skills/beyond-blame-rethinking-szz/SKILL.md
Identify bug-inducing commits using temporal knowledge graph search beyond git blame. Use when: 'find what commit introduced this bug', 'trace root cause of regression', 'which commit broke this', 'find the bug-inducing change', 'blame analysis for this fix', 'what caused this defect'
npx skillsauth add ndpvt-web/arxiv-claude-skills beyond-blame-rethinking-szzInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to identify Bug-Inducing Commits (BICs) by constructing and searching a temporal knowledge graph of git history, going far beyond what git blame alone can find. Based on the AgenticSZZ approach, it addresses the critical limitation that over 40% of real bug-inducing commits cannot be found by blame alone -- 28% require traversing commit history beyond blame results, and 14% involve "blameless" bugs where no lines in the fix were directly modified by the culprit commit. The technique reframes BIC identification from a ranking problem over blame commits into a graph search problem where temporal ordering enables causal reasoning about bug introduction.
git blame on a bug fix points to commits that don't look like the actual root causegit bisect is impractical due to build complexity and the user needs a static analysis approachTraditional SZZ algorithms use git blame on the lines modified by a Bug-Fixing Commit (BFC) to find commits that last touched those lines. This works for only ~57% of cases. The remaining cases fall into three categories: Blame Ancestors (10.3%) where the real BIC is an ancestor of the blamed commit in file history, BFC Ancestors (17.7%) where the BIC is found by walking backward from the fix commit rather than from blame results, and Blameless (14.1%) where the fix adds entirely new code and there are no deleted/modified lines to blame at all.
AgenticSZZ constructs a Temporal Knowledge Graph (TKG) with three node types -- Commits, Files, and Functions -- connected by edges: PRECEDES (temporal ordering between commits), MODIFIES_FILE and MODIFIES_FUNCTION (structural connections from commits to code), and DEFINED_IN (functions to their containing files). The graph is built by running blame on the BFC, then traversing file history backward from both blame commits and the BFC itself, extracting function-level information from diff hunk headers. Candidates are scored by proximity: blame commits get fitness 1.0, blame ancestors 0.6, and BFC ancestors 0.3.
An LLM agent then searches this graph using four tools: list_candidates (ranked by fitness), traverse_graph (follow edges to discover related commits), query_node (read commit metadata), and read_node_content (examine actual code diffs for causal analysis). The agent iteratively explores candidates, reads diffs, and reasons about causality -- achieving F1-scores of 0.48-0.74 across datasets, up to 27% improvement over prior state-of-the-art.
Identify the Bug-Fixing Commit (BFC). Start from the known fix -- a merge/commit that resolves the bug. Extract the exact files and lines changed by this commit using git diff.
Run git blame on fixed lines. For each line deleted or modified in the BFC, run git blame to find the commit that last touched it. These are the "blame commits" -- the traditional SZZ search space. Record which files and functions (from hunk headers) each blame commit touches.
Classify the fix type. Check whether the fix only adds new code (blameless), modifies existing code (blame-based), or does both. If the fix is purely additive with no deleted lines, skip to step 5 -- blame cannot help.
Traverse blame ancestors. For each blame commit, walk backward through file history (git log --follow <file>) up to 24 commits deep. For each ancestor commit, extract the functions it modifies from diff hunk headers. These are "blame ancestor" candidates with fitness score 0.6.
Traverse BFC ancestors. Walk backward from the BFC through the history of each modified file, stopping at or before the oldest blame commit timestamp. These are "BFC ancestor" candidates with fitness score 0.3. This step is critical for blameless cases.
Build the temporal knowledge graph. Create nodes for every commit, file, and function discovered. Connect them with PRECEDES edges (chronological order), MODIFIES_FILE/MODIFIES_FUNCTION edges, and DEFINED_IN edges. Ensure all edges respect temporal ordering -- a BIC must precede the BFC.
Rank candidates by fitness. Sort all candidate commits: blame commits (1.0) > blame ancestors (0.6) > BFC ancestors (0.3). Within each tier, prioritize function-level matches over file-level matches.
Analyze top candidates causally. For each top candidate (start with highest fitness), read the actual diff. Ask: "Does this change introduce behavior that the bug fix corrects?" Look for semantic connections -- added conditions that are too restrictive, missing null checks, incorrect arithmetic, wrong API usage patterns.
Cross-reference temporal constraints. Verify the candidate BIC was committed before the bug was reported (if report date is known) and before the BFC. Eliminate candidates that violate temporal causality.
Decide and document reasoning. Select the most likely BIC with an explanation of the causal chain: what the commit changed, how that change manifests as the bug, and why the BFC reverses or compensates for it.
Example 1: Blame-based BIC (straightforward case)
User: "This commit abc123 fixes a null pointer exception by adding a null check
on line 45 of parser.c. What commit introduced the bug?"
Approach:
1. Run `git diff abc123~1 abc123` to see the fix modifies line 45 of parser.c,
adding `if (node != NULL)` before dereferencing.
2. Run `git blame abc123~1 -- parser.c` on line 45 to find commit def456
last modified this line.
3. Read `git show def456` -- it refactored the parser to remove a previous
null guard during a "cleanup" commit.
4. Causal chain confirmed: def456 removed the null check, abc123 restores it.
Output:
Bug-Inducing Commit: def456
Reason: Commit def456 removed the null-safety check on `node` in parser.c:45
during a refactoring pass. The variable can be NULL when parsing empty input,
which the original guard protected against. The fix abc123 restores this check.
Category: Blame (direct blame match)
Example 2: Blame Ancestor BIC (beyond direct blame)
User: "Commit fix789 fixes a race condition in connection_pool.py by adding a
lock around the pool resize operation. git blame points to commit aaa111
which just reformatted the code. What actually introduced the bug?"
Approach:
1. git blame on the fixed lines points to aaa111 (a formatting-only commit).
2. aaa111 is not the real cause -- it only changed whitespace.
3. Traverse backward from aaa111 through file history of connection_pool.py.
4. At depth 3, find commit bbb222 which introduced the `resize_pool()` method
without thread synchronization, moving from a single-threaded to
multi-threaded architecture.
5. Causal chain: bbb222 added concurrent pool access without locking.
Output:
Bug-Inducing Commit: bbb222
Reason: Commit bbb222 introduced the `resize_pool()` function as part of the
multi-threading migration but did not add synchronization around the shared
pool data structure. The blame commit aaa111 was a red herring (formatting only).
Category: Blame Ancestor (3 commits behind blame, fitness 0.6)
Example 3: Blameless BIC (fix adds entirely new code)
User: "Commit fix999 adds a new validation check for negative values in
calculate_discount(). There are no deleted lines in the fix -- it only
adds code. What commit should have included this validation?"
Approach:
1. The fix is purely additive -- no lines deleted or modified means git blame
has nothing to target. This is a blameless case.
2. Traverse BFC ancestors: walk backward through the history of pricing.py
from fix999.
3. Find commit ccc333 which introduced calculate_discount() and defined its
parameter contract. The function accepts a `rate` parameter but never
validates it can be negative.
4. Causal chain: ccc333 introduced the function with an incomplete contract.
Output:
Bug-Inducing Commit: ccc333
Reason: Commit ccc333 introduced calculate_discount() accepting arbitrary
float values for `rate` without validating the sign. The function computes
`price * (1 - rate)` which produces values exceeding the original price
when rate is negative. The fix adds the missing validation.
Category: Blameless (BFC ancestor, fitness 0.3)
@@ ... @@ function_name) to build function-level graph connections. Function-level matches are far more precise than file-level.git log --follow to track the file through renames. The TKG must connect pre-rename and post-rename file nodes.Paper: "Beyond Blame: Rethinking SZZ with Knowledge Graph Search" by Yu Shi, Hao Li, Bram Adams, Ahmed E. Hassan (arXiv:2602.02934, 2026). Look for: the TKG construction algorithm, the four agent tools (list_candidates, traverse_graph, query_node, read_node_content), the BIC category distribution table (Table I), and Algorithm 1 describing the agentic search loop.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".