skills/following-dragons-code-review-guided/SKILL.md
Extract security-relevant signals from code review comments and translate them into fuzzer-guiding annotations using the EyeQ pipeline. Use when the user says 'guide fuzzing from code reviews', 'find dragons in review comments', 'annotate code for fuzzing', 'review-guided fuzzing', 'extract security signals from PRs', or 'instrument code for AFL++ from review discussions'.
npx skillsauth add ndpvt-web/arxiv-claude-skills following-dragons-code-review-guidedInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to apply the EyeQ technique from "Following Dragons: Code Review-Guided Fuzzing" (Luu et al., 2026). EyeQ bridges the gap between developer intelligence expressed in code review discussions and automated fuzz testing. It extracts implicit security signals from review comments, classifies them against CWE-699 weakness categories, localizes the implicated code regions, and generates IJON annotations that steer AFL++ toward the program states developers flagged as fragile or security-critical. The technique found 40+ previously unknown bugs in the PHP interpreter, outperforming unguided fuzzing by 2.7x.
Standard fuzzers maximize code coverage but are blind to which paths developers consider most dangerous. Code reviewers routinely express security-relevant reasoning -- concerns about buffer boundaries, integer overflow, input validation gaps, resource exhaustion -- but this intelligence is trapped in natural language comments that testing tools ignore. EyeQ systematically harvests this intelligence and converts it into machine-consumable fuzzing guidance.
The core insight is a four-stage pipeline: (1) Review Classification uses a two-phase LLM approach -- coarse filtering identifies candidate CWE-699 categories, then fine-grained classification selects the most specific matching CWE. (2) Code Localization maps security-relevant comments to concrete functions via candidate ranking followed by diff-based verification against actual code changes. (3) Code Instrumentation identifies the specific variables or expressions within those functions that expose the security-critical state and wraps them in IJON annotation macros (IJON_SET, IJON_MAX, IJON_MIN). (4) Fuzzing Execution runs AFL++ with the IJON bitmap, treating annotations as first-class feedback signals that reward semantically meaningful executions even when coverage does not increase.
The annotation macros are lightweight and non-invasive. IJON_SET(x) reports a numeric program state to the fuzzer. IJON_MAX(x) and IJON_MIN(x) reward monotonic progress toward extreme values. IJON_STATE() captures discrete execution phases. All annotations are guarded with #ifdef _USE_IJON so they compile away in production builds. This means the technique requires no changes to program semantics or developer workflows.
Collect review comments. Gather code review discussions from the target repository -- PR comments, inline review threads, commit-level discussions. Extract the comment text, the file path and diff context, and the reviewer's identity. Focus on comments that discuss behavior, correctness, or edge cases rather than style or formatting.
Classify comments for security relevance (coarse pass). For each comment, determine whether it expresses a security-relevant concern. Act as a senior application security engineer. Look for implicit signals: mentions of bounds, overflow, validation, allocation, parsing, trust boundaries, error handling, or resource limits. Map candidate comments to the CWE-699 taxonomy of 40 coding weakness categories. Prioritize recall -- it is better to flag a benign comment as potentially relevant than to miss a real signal.
Refine CWE classification (fine-grained pass). For comments that passed the coarse filter, select the most specific matching CWE entry. Prefer definition matches over keyword matches. For example, a comment about "large on-stack allocations pushing past the stack boundary" maps to CWE-1218 (Memory Buffer Errors) or CWE-787 (Out-of-Bounds Write), not merely "input validation." Output the CWE ID, the concrete phrase from the comment that triggered classification, and a confidence level.
Localize implicated functions. Identify which functions in the codebase are implicated by each security-relevant comment. Use two steps: (a) rank candidate functions by relevance to the review discussion, assigning HIGH/MEDIUM/LOW confidence, and (b) verify the localization by grounding it in the actual diff -- confirm the function appears in the changeset the review addresses. Do not invent function names; only select from functions present in the codebase.
Identify annotation targets within functions. For each localized function, identify the specific variables, expressions, or control-flow points that expose the security-critical state the reviewer was concerned about. Look for: parsed input values, buffer size calculations, loop bounds, allocation sizes, pointer arithmetic, return codes from validation functions, or state machine transitions. Select the top five candidates, providing verbatim code substrings as anchors.
Generate IJON annotations. For each annotation target, choose the appropriate IJON macro:
IJON_SET(expr) for values the fuzzer should explore diversely (e.g., parsed configuration values, hash table sizes)IJON_MAX(expr) for values where pushing toward large extremes may trigger bugs (e.g., allocation sizes, recursion depths)IJON_MIN(expr) for values where pushing toward zero or negative may trigger bugs (e.g., remaining buffer space, reference counts)IJON_STATE() for capturing discrete phases in multi-step operations (e.g., state machine transitions in protocol parsers)Wrap all annotations in #ifdef _USE_IJON / #endif guards.
Insert annotations into source code. Place each annotation immediately after the statement that computes or assigns the target expression. Ensure the annotation does not alter control flow or data flow. Add a trailing comment explaining the rationale (e.g., // TRACKS EXTREME VALUE, // MONITORS ALLOCATION BOUND).
Configure and execute the fuzzer. Compile the instrumented code with AFL++ using the IJON-enabled bitmap. Set up the fuzzing harness targeting the entry points that exercise the annotated functions. Run the fuzzer with a time budget appropriate to the codebase size (24 hours is a reasonable starting point for medium-sized projects).
Triage crashes against review context. When the fuzzer produces crashes, cross-reference the crashing inputs against the original review comments. Determine whether the crash validates the reviewer's concern. Classify bugs by CWE and severity. Check whether the bug reproduces on the latest version of the codebase.
Report findings with provenance. For each confirmed bug, document: the original review comment that surfaced the concern, the CWE classification, the localized function, the annotation that guided discovery, the crashing input, and the root cause analysis. This provenance chain connects developer intuition to concrete vulnerability evidence.
Example 1: PHP Fiber Stack Overflow
User: "This PR review comment says 'how should fibers guard against stack exhaustion?' and there's concern about 'large on-stack allocations pushing the stack pointer far beyond the stack boundary.' Help me generate fuzzing annotations."
Approach:
OnUpdateFiberStackSize() -- the function that parses user-provided INI values for fiber stack sizeEG(fiber_stack_size) -- the variable holding the parsed configuration valuestatic ZEND_INI_MH(OnUpdateFiberStackSize) {
if (new_value) {
zend_long tmp = zend_ini_parse_quantity_warn(new_value, name);
if (tmp < 0) {
return FAILURE;
}
EG(fiber_stack_size) = tmp;
#ifdef _USE_IJON
IJON_SET(EG(fiber_stack_size)); // TRACKS EXTREME VALUE
#endif
}
return SUCCESS;
}
Output: The fuzzer discovers that input "9690x-D" passes the negative check but produces a value triggering AddressSanitizer stack-overflow when a Fiber is started.
Example 2: Hash Table Resource Exhaustion
User: "A reviewer wrote 'this doesn't limit the number of hash collisions -- an attacker could craft keys to degrade to O(n) lookup.' I want to fuzz this."
Approach:
zend_hash_do_resize() and _zend_hash_add_or_update_i() -- functions managing hash table growthht->nNumUsed (element count), ht->nTableMask (table size), collision chain length// In _zend_hash_add_or_update_i():
#ifdef _USE_IJON
IJON_MAX(ht->nNumUsed); // REWARD HIGH ELEMENT COUNTS
IJON_MAX(chain_len); // REWARD LONG COLLISION CHAINS
#endif
Output: Fuzzer steers toward inputs producing deep collision chains, uncovering a denial-of-service condition at scale.
Example 3: Scanning a PR for Security Signals
User: "Here are 15 review comments from our latest PR. Which ones are security-relevant and what should I fuzz?"
Approach:
| # | Comment Excerpt | CWE | Confidence | Target Function |
|---|----------------------------------------------|------------|------------|-------------------------|
| 3 | "no check on array index after realloc" | CWE-787 | HIGH | process_input_buffer() |
| 7 | "what if the length field is negative?" | CWE-190 | HIGH | parse_header() |
| 11| "race between free and callback invocation" | CWE-416 | MEDIUM | event_dispatch() |
Then generate IJON annotations for each HIGH-confidence entry.
IJON_SET for configuration/input parsing values and IJON_MAX for size/count variables. The choice of macro determines how the fuzzer steers.volatile read or use __attribute__((used)) to prevent dead-code elimination.IJON_STATE() annotations at prerequisite checkpoints.Paper: "Following Dragons: Code Review-Guided Fuzzing" -- Luu, Pasdar, Charoenwet, Murray, Cohney (2026). arXiv:2602.10487v1. https://arxiv.org/abs/2602.10487v1
Look for: Figures 3-7 (the exact LLM prompt templates for each pipeline stage), Listing 4 (the annotated PHP fiber example), and Table III (the CWE classification accuracy breakdown by category).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".