skills/codecircuit-inferring-llm-generated-code/SKILL.md
Assess LLM-generated code correctness using attribution graph analysis inspired by mechanistic interpretability. Apply structural reasoning diagnostics to identify buggy logic, predict failure modes, and suggest targeted fixes. Use when: 'analyze this code for correctness', 'why might this generated code be wrong', 'find structural bugs in this function', 'verify this algorithm logic', 'diagnose why this code fails', 'check this code for logical errors without running it'.
npx skillsauth add ndpvt-web/arxiv-claude-skills codecircuit-inferring-llm-generated-codeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to assess LLM-generated code correctness by applying the attribution graph reasoning framework from the CodeCircuit paper. Instead of relying solely on test execution or surface-level pattern matching, you simulate the mechanistic interpretability approach: decompose code into a line-level dependency graph, extract topological features (density, centrality, clustering), and use structural signatures to identify where reasoning breaks down. This technique is particularly powerful for catching "structural near misses"--code that looks almost right but has subtle logical flaws in control flow, boundary conditions, or state transitions.
CodeCircuit's core insight is that correct code and incorrect code have measurably different internal structure when you trace how each line's output depends on prior computations. The paper constructs attribution graphs where nodes are code lines (or sub-expressions) and directed edges represent data/control flow dependencies weighted by their influence on the final output. Correct implementations form dense, well-connected subgraphs with balanced centrality--meaning no single line dominates the reasoning flow unnaturally. Incorrect code shows characteristic anomalies: fragmented components, abnormally high betweenness centrality on wrong operations (e.g., a greedy heuristic dominating where careful state tracking should), and high error-to-feature influence ratios where the model "shortcuts" past necessary logic.
The practical takeaway for code review is a structured diagnostic: build the dependency graph of a function, compute topological features, and look for specific failure signatures. High graph fragmentation suggests missing logic connections. Centrality spikes on simple expressions (like high = mid instead of high = mid - 1) indicate the code is taking a shortcut that skips a necessary boundary adjustment. Low clustering around loop update logic signals that iteration state is not being properly maintained across iterations.
The paper validates this across Python, C++, and Java, showing these structural patterns are language-agnostic. Topological features predict correctness at 79.9% AUROC for Python and improve with code complexity (80-92% AUROC for 10-30 line functions), precisely where surface heuristics fail.
Parse the code into a line-level dependency graph. For each line, identify what variables it reads (inputs) and what it writes (outputs). Draw directed edges from each line that produces a value to every line that consumes it. Include control flow edges from conditionals and loop headers to the lines they govern.
Annotate edge weights by influence strength. Assign higher weights to edges where a variable is directly used in a computation (e.g., result += arr[i]) versus where it only affects control flow indirectly. Lines that contribute to the return value or final state get the strongest backward attribution.
Identify the critical path. Trace the highest-weight path from input parameters through to the return statement. This is the "algorithmic spine"--the sequence of operations that most directly determines the output. Verify that this path implements the stated algorithm correctly.
Compute topological diagnostics on the graph:
Check for known failure signatures:
low = mid + 1, high = mid - 1) should have balanced centrality. If one dominates, the boundary is likely wrong.Perform targeted causal analysis on suspicious lines. For each flagged line, ask: "If I change this expression, does the dependency structure become more balanced?" Try the most common fix patterns (adjusting < vs <=, mid vs mid-1, i vs i+1) and check if the graph anomaly resolves.
Report findings with structural evidence. For each identified issue, explain: (a) which line is problematic, (b) what topological anomaly flagged it, (c) what the fix is, and (d) why the fix restores correct structure.
Validate the fix against the full graph. After proposing changes, re-trace the dependency graph to confirm no new anomalies are introduced and the critical path now correctly implements the algorithm.
Example 1: Binary Search Off-by-One Bug
User: "Is this binary search correct?"
def binary_search(arr, target):
low, high = 0, len(arr) - 1
while low <= high:
mid = (low + high) // 2
if arr[mid] == target:
return mid
elif arr[mid] < target:
low = mid + 1
else:
high = mid # Bug: should be mid - 1
return -1
Approach:
low/high -> mid -> arr[mid] -> comparison -> update low/high -> loop conditionhigh = mid (line 9) has abnormally high betweenness because it creates a dependency cycle where high never advances past mid, making this line the bottleneck of all future iterationslow = mid + 1 path shrinks the search space, but high = mid does not--asymmetric narrowing is the structural anomalymid with mid - 1 balances the centrality between both update branchesOutput:
Bug found on line 9: `high = mid` should be `high = mid - 1`.
Structural evidence: The dependency graph shows asymmetric search space
narrowing. The `low` update (mid + 1) advances past mid, but `high = mid`
does not exclude mid from the next iteration. This creates an infinite
loop when `low == high == mid` and `arr[mid] != target`.
The attribution graph confirms: line 9 has 3.2x higher betweenness
centrality than line 8, indicating it acts as a bottleneck rather than
a symmetric counterpart. Balanced binary search requires both updates
to exclude mid.
Fix: Change line 9 to `high = mid - 1`.
Example 2: Dynamic Programming Missing Dependency
User: "Check this longest common subsequence implementation."
def lcs(s1, s2):
m, n = len(s1), len(s2)
dp = [[0] * (n + 1) for _ in range(m + 1)]
for i in range(1, m + 1):
for j in range(1, n + 1):
if s1[i-1] == s2[j-1]:
dp[i][j] = dp[i-1][j-1] + 1
else:
dp[i][j] = max(dp[i-1][j], dp[i][j-1])
return dp[m][n]
Approach:
dp[i][j] depends on dp[i-1][j-1], dp[i-1][j], dp[i][j-1] depending on the branchdp[m][n] traces back through the recurrence to dp[0][*] and dp[*][0] (base cases, initialized to 0)Output:
No structural anomalies detected. The dependency graph shows:
- Symmetric recurrence: dp[i][j] correctly depends on dp[i-1][j-1] (match
case) and max(dp[i-1][j], dp[i][j-1]) (mismatch case)
- Complete connectivity: every cell traces back to the base case row/column
- Balanced centrality: no single line dominates the computation path
- Proper initialization: dp[0][*] and dp[*][0] are implicitly 0
This implementation is structurally correct for LCS.
Example 3: Graph Traversal with Subtle State Bug
User: "Does this BFS find the shortest path correctly?"
def shortest_path(graph, start, end):
from collections import deque
queue = deque([(start, [start])])
while queue:
node, path = queue.popleft()
if node == end:
return path
for neighbor in graph[node]:
queue.append((neighbor, path + [neighbor]))
return None
Approach:
queue -> node, path -> neighbor loop -> queue.append (cycle back)visited set. The dependency graph shows the neighbor iteration has no filtering edge--every neighbor is enqueued unconditionallyOutput:
Bug found: Missing visited-node tracking causes infinite loops on cyclic
graphs and redundant exploration on all graphs.
Structural evidence: The dependency graph around lines 8-9 shows an
unconditional feedback loop--every neighbor is enqueued with no filtering
edge. A correct BFS requires a "visited" guard that creates a dependency
from the set of processed nodes back to the neighbor enumeration,
pruning already-seen nodes.
Fix: Add a visited set and check membership before enqueuing:
visited = {start}
...
for neighbor in graph[node]:
if neighbor not in visited:
visited.add(neighbor)
queue.append((neighbor, path + [neighbor]))
Paper: CodeCircuit: Toward Inferring LLM-Generated Code Correctness via Attribution Graphs (He et al., 2026). Key sections: Section 3 for attribution graph construction via per-layer transcoders, Section 4 for topological feature extraction (30+ features across mechanical composition, global structure, and centrality categories), and Section 5 for the causal intervention case study on binary search. Code: github.com/bruno686/CodeCircuit.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".