skills/evaluating-enhancing-vulnerability-reasoning/SKILL.md
Perform DAG-structured vulnerability reasoning on code, modeling causal dependencies between code facts instead of linear chain-of-thought. Use when asked to: 'analyze this code for vulnerabilities', 'explain why this code is vulnerable', 'trace the root cause of this security bug', 'review this function for memory safety issues', 'is this code exploitable and why', 'reason about the security of this code path'.
npx skillsauth add ndpvt-web/arxiv-claude-skills evaluating-enhancing-vulnerability-reasoningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to perform rigorous vulnerability analysis using Directed Acyclic Graph (DAG) structured reasoning, based on the DAGVul framework. Instead of producing linear chain-of-thought explanations that often hallucinate plausible-sounding but incorrect logic, this approach models vulnerability reasoning as a graph of causal dependencies: extracting ground facts from code as source nodes, building intermediate inference nodes through taint tracking and control/data flow analysis, and converging on terminal sink nodes that confirm or refute the vulnerability. This enforces structural consistency and eliminates the 12 systematic failure patterns (hallucination, spurious causality, incomplete evidence, etc.) that plague standard LLM vulnerability analysis.
The problem with linear reasoning: Research shows that 36.4% of correct vulnerability detection verdicts from LLMs are based on incorrect reasoning — the model gets the right answer for the wrong reasons. Linear chain-of-thought (CoT) reasoning is prone to 12 systematic failure patterns grouped into four categories: (1) focus identification errors (analyzing the wrong code region), (2) code comprehension failures (misunderstanding data flow, control flow, intra-procedural or inter-procedural semantics), (3) logic analysis failures (incomplete evidence, spurious causality, flawed premises, contradictions), and (4) generative biases (hallucination, over-inference, redundancy). These errors compound in sequential reasoning because there is no structural constraint preventing a later step from contradicting or disconnecting from earlier ones.
DAG-structured reasoning fixes this. Instead of a linear chain, model the analysis as a graph G = (V, E) with three node types: Source nodes (ground facts extracted directly from code — buffer sizes, API signatures, variable types, entry points), Intermediate nodes (logical inferences such as taint propagation steps, pointer analysis, constraint solving — each requiring explicit citation of parent nodes), and Sink nodes (terminals that either confirm vulnerability as a verified_sink or prove safety as a sanitized_sink). Edges encode causal dependencies: a node is only admissible when ALL its parent dependencies are established. This topological ordering prevents circular reasoning, ensures every claim is grounded in code evidence, and guarantees logical closure — every reasoning path must terminate at a defined sink.
Enforcing correctness through structure: The DAG constraint means you cannot assert "this buffer overflows" without first establishing the buffer's allocation size (source node), the write operation and its bounds (intermediate nodes with data flow edges), and the absence of bounds checking (intermediate node). If any link in the causal chain is missing, the graph is structurally incomplete and the conclusion is not supported. This mirrors how human security experts reason — building a chain of evidence from code facts to vulnerability confirmation.
Identify the analysis scope. Read the code and determine the specific function, module, or code path under review. Identify the CWE category if known (e.g., CWE-416 Use-After-Free, CWE-787 Out-of-bounds Write, CWE-89 SQL Injection). Narrow focus to the relevant code region rather than analyzing everything at once.
Extract source nodes (ground facts). Enumerate concrete, verifiable facts directly from the code with line references: buffer allocations and their sizes, API calls and their signatures, variable declarations and types, input sources (user input, file reads, network data), security-relevant constants, and memory management operations (malloc/free, new/delete).
Build intermediate inference nodes with explicit parent dependencies. For each inference, state what parent nodes it depends on and what program analysis primitive justifies it:
Check each intermediate node for the 12 failure patterns. Before accepting an inference node, verify:
Enforce topological ordering. Verify that every intermediate node's parent dependencies are satisfied before that node is asserted. If node N depends on nodes A and B, both A and B must be fully established first. No forward references or circular dependencies.
Converge on sink nodes. Arrive at one of two terminal conclusions:
Validate logical closure. Confirm that every source node connects to at least one sink node through a complete path. If any source node is a dead-end (no path to a sink), the analysis is incomplete — either extend the reasoning or explicitly note the gap.
Produce the structured output. Present the DAG as a readable vulnerability report with: (a) the verdict, (b) the causal chain formatted as node dependencies, (c) line references for every claim, and (d) the CWE classification if applicable.
Example 1: Use-After-Free (CWE-416)
User: "Is this C code vulnerable?"
void process_data(int *buf, int len) {
for (int i = 0; i < len; i++)
buf[i] = i * 2;
free(buf);
// ... other operations ...
printf("%d\n", buf[0]); // line 6
}
Approach (DAG reasoning):
Source Nodes:
S1: [line 1] buf is a pointer parameter (int*), caller-owned memory
S2: [line 4] free(buf) deallocates the memory pointed to by buf
S3: [line 6] buf[0] dereferences buf after the free
Intermediate Nodes:
I1: [depends on S1, S2] After free(buf) on line 4, the pointer buf
becomes a dangling pointer — it still holds the address but the
memory is no longer valid. (Memory lifecycle analysis)
I2: [depends on I1] No reassignment of buf between line 4 and line 6.
No conditional guard or null check. buf remains dangling.
(Data flow analysis: no re-definition of buf in [4..6])
I3: [depends on I2, S3] The dereference buf[0] on line 6 accesses
freed memory through the dangling pointer. (Taint: dangling
pointer reaches dereference operation)
Sink Node:
VERIFIED_SINK: [depends on I3] Use-After-Free confirmed.
Trigger: free(buf) at line 4 followed by buf[0] read at line 6
with no intervening reallocation or guard.
Verdict: VULNERABLE — CWE-416 Use-After-Free
Root cause: Missing nullification of buf after free(), and no
control flow preventing the post-free access on line 6.
Example 2: SQL Injection Analysis — Safe Code (CWE-89)
User: "Can this Python code be SQL injected?"
def get_user(db, username):
query = "SELECT * FROM users WHERE name = ?"
cursor = db.execute(query, (username,))
return cursor.fetchone()
Approach (DAG reasoning):
Source Nodes:
S1: [line 1] username is an external parameter (untrusted input)
S2: [line 2] query uses a parameterized placeholder "?"
S3: [line 3] db.execute() is called with query and (username,) as
separate arguments — parameterized execution
Intermediate Nodes:
I1: [depends on S1, S2] The untrusted input username is NOT
concatenated or interpolated into the query string. The query
string is a static literal. (Data flow: no def-use chain from
username to query string construction)
I2: [depends on S2, S3] The database driver receives the query
template and parameters separately. The "?" placeholder is
bound by the driver's parameterization engine, which escapes
the value before substitution. (API contract analysis)
Sink Node:
SANITIZED_SINK: [depends on I1, I2] SQL injection is prevented.
The parameterized query separates code from data. The untrusted
input never enters the query as executable SQL.
Verdict: SAFE — Parameterized query prevents CWE-89 SQL Injection.
Example 3: Buffer Overflow with Subtle Control Flow (CWE-787)
User: "Review this for buffer overflow."
void copy_input(char *src) {
char dest[64];
int len = strlen(src);
if (len < 128) {
memcpy(dest, src, len);
}
}
Approach (DAG reasoning):
Source Nodes:
S1: [line 1] src is an external char* parameter (untrusted)
S2: [line 2] dest is a stack buffer of 64 bytes
S3: [line 3] len = strlen(src), which can be 0..SIZE_MAX-1
S4: [line 4] Guard condition: len < 128
S5: [line 5] memcpy(dest, src, len) copies len bytes into dest
Intermediate Nodes:
I1: [depends on S3, S4] The guard permits len in range [0..127].
(Constraint solving: len < 128 passes for any len <= 127)
I2: [depends on S2, I1] dest has capacity 64 bytes, but len can
be up to 127. Values in [65..127] exceed dest's capacity.
(Bounds analysis: 127 > 64)
I3: [depends on I2, S5] memcpy writes len bytes (up to 127) into
a 64-byte buffer. For len in [65..127], this overflows dest
by up to 63 bytes on the stack. (Overflow confirmation)
Sink Node:
VERIFIED_SINK: [depends on I3] Stack buffer overflow confirmed.
Trigger: src with strlen in [65..127] passes the guard but
overflows the 64-byte dest buffer.
The guard checks against 128 but should check against
sizeof(dest) which is 64.
Verdict: VULNERABLE — CWE-787 Out-of-bounds Write
Root cause: Bounds check (len < 128) is mismatched with actual
buffer size (64 bytes). Fix: change guard to (len < sizeof(dest))
or (len < 64).
input is untrusted user data") as explicit source nodes and flag them as assumptions rather than ground facts.development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".