skills/debugging-code-world/SKILL.md
Debug code by mentally simulating execution as a Code World Model — predicting runtime state after each statement, catching failures from token-budget exhaustion and string tokenization brittleness, and isolating whether bugs come from incorrect action generation or state propagation errors. Use when: 'trace through this code step by step', 'why does this function return the wrong value', 'debug this execution trace', 'simulate what happens when I run this', 'find where the state goes wrong', 'step through the variables in this loop'.
npx skillsauth add ndpvt-web/arxiv-claude-skills debugging-code-worldInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill teaches Claude to debug code by simulating execution as a Code World Model (CWM) — predicting the explicit runtime state after every executed statement, then diagnosing failures using the two dominant error regimes identified in the research: token-budget exhaustion on long execution histories, and string-valued state failures caused by subword tokenization discontinuities. Rather than reasoning about code abstractly, Claude traces concrete variable states at each step, identifies where predicted state diverges from expected state, and classifies the root cause as either an action-generation error (wrong operation chosen) or a state-propagation error (correct operation, wrong result).
A Code World Model simulates program execution by predicting a complete runtime state snapshot after every executed statement. Instead of reasoning about code in natural language ("this loop probably does X"), you produce an explicit execution trace: for each line, write down every variable's value after that line runs. This dense supervision makes errors visible immediately — the first line where predicted state diverges from actual state is your bug location.
The research identifies two systematic failure modes. First, token-budget exhaustion: when programs have long execution histories (deep loops, element-by-element processing, recursive chains), the execution trace becomes so large that you lose track of earlier state. The mitigation is to compress traces — track only variables that changed, summarize stable state, and checkpoint at loop boundaries rather than every iteration. Second, string-valued state brittleness: string operations fail at disproportionately high rates (73% of failures on CruxEval-O despite being only 46% of outputs) because subword tokenization creates context-dependent mappings. The separator "-." tokenizes as one token in isolation but fragments into five tokens inside "a-.-.b", causing methods like rsplit and rfind to produce unexpected results. When you encounter string bugs, always evaluate string operations character-by-character rather than relying on pattern intuition.
The most actionable finding: long-horizon degradation comes primarily from incorrect action generation, not from state-tracking errors. When the research replaced model-generated operations with ground-truth commands, the CWM tracked state accurately over 128+ steps. This means: if your trace goes wrong, the bug is almost certainly in which operation you think runs next (wrong branch taken, wrong function called, off-by-one in iteration count), not in how you computed the result of a correct operation. Prioritize verifying control flow and operation selection before re-checking arithmetic.
Extract the code under examination. Isolate the function or block to debug. Identify all input parameters, global state, and expected output. Write down the concrete input values you are tracing.
Initialize the state table. Create a variable-to-value mapping for every variable in scope at the entry point. Format as a table or dictionary literal: {x: 5, name: "hello", items: [1,2,3]}.
Execute line-by-line, producing state snapshots. For each statement, predict the runtime state after execution. Write the updated state table. For assignments, update the variable. For function calls, expand inline if the function is available; otherwise, predict the return value.
Compress traces for long executions. If the code contains loops with more than ~10 iterations, do NOT trace every single iteration. Instead: trace the first 2 iterations fully, state the pattern, then trace the last 2 iterations and the exit condition. Checkpoint the full state at loop entry and exit.
Flag string operations for character-level verification. When you encounter string methods (split, join, replace, find, rfind, rsplit, slicing with computed indices), do NOT predict the result from pattern recognition. Instead, write out the string character by character, apply the operation mechanically, and verify the result against your intuition. If they disagree, trust the mechanical trace.
Identify the divergence point. Compare your predicted state at each step against the expected behavior. The first statement where predicted state differs from correct state is your primary bug candidate.
Classify the failure mode. Ask: is the divergence because (a) the wrong operation executed (wrong branch, wrong function, wrong iteration count — an action-generation error), or (b) the right operation executed but produced the wrong result (arithmetic error, off-by-one, wrong method behavior — a state-propagation error)? Action-generation errors are far more common.
Verify control flow independently. If you suspect an action-generation error, trace only the control flow (which branches are taken, which iterations occur) without computing values. Confirm the execution path is correct before re-examining value computation.
Test the fix by re-tracing. After identifying the bug, apply the fix and re-run your state-trace from 2 steps before the divergence point through 2 steps after. Confirm the state now matches expected output.
Summarize with the root cause classification. Report: where the state diverged, whether it was action-generation or state-propagation, the specific failure mode (token-budget/string-brittleness/logic error), and the fix.
Example 1: Debugging a string manipulation function
User: "Why does this function return the wrong result?"
def extract_domain(email):
parts = email.rsplit("@", 1)
domain = parts[1]
subdomain = domain.rsplit(".", 1)
return subdomain[0]
# Expected: "company" for "[email protected]"
# Actual: "company.co"
Approach:
{email: "[email protected]"}parts = email.rsplit("@", 1):
rsplit("@", 1) splits from the right, max 1 split{parts: ["user", "company.co.uk"]}domain = parts[1]:
{domain: "company.co.uk"}subdomain = domain.rsplit(".", 1):
"company.co.uk".rsplit(".", 1) splits from the right at last .["company.co", "uk"]{subdomain: ["company.co", "uk"]}return subdomain[0] returns "company.co"Divergence at step 4: The user expects rsplit(".", 1) to split at the first dot, but rsplit splits from the right. This is an action-generation error — the wrong operation was chosen (rsplit instead of split).
Fix: Change domain.rsplit(".", 1) to domain.split(".", 1) — or, if the intent is to get the subdomain before the TLD, rethink the parsing for multi-part TLDs.
Example 2: Debugging a loop with accumulator drift
User: "This should compute a running average but the result is wrong after many iterations."
def running_avg(values):
avg = 0
for i, v in enumerate(values):
avg = avg + (v - avg) / (i + 1)
return avg
# Input: [10, 20, 30, 40, 50]
# Expected: 30.0
# Actual: 30.0 (seems correct... but fails on [1, 1, 1, ..., 1] x 1000)
Approach:
{avg: 0, values: [1]*1000}avg = 0 + (1 - 0) / 1 = 1.0 — state: {avg: 1.0}avg = 1.0 + (1 - 1.0) / 2 = 1.0 + 0.0 = 1.0 — state: {avg: 1.0}[1e15, 1, 1, 1, ...]:
avg = 1e15avg = 1e15 + (1 - 1e15) / 2 = 1e15/2 + 0.5 = 5e14 + 0.5(v - avg) loses precision when avg is large and v is small.Root cause: state-propagation error due to floating-point precision loss in the subtraction (v - avg) when magnitudes differ greatly. This is a numerical stability issue, not a logic error. Fix: use compensated summation (Kahan) or two-pass computation.
Example 3: Debugging control flow with early exit
User: "This search function sometimes returns None when the item exists."
def find_in_nested(data, target):
for key, value in data.items():
if isinstance(value, dict):
result = find_in_nested(value, target)
return result # BUG: returns even if result is None
elif value == target:
return key
return None
# Input: {"a": {"x": 1}, "b": 2}, target=2
Approach:
{data: {"a": {"x": 1}, "b": 2}, target: 2}isinstance({"x": 1}, dict) is Truefind_in_nested({"x": 1}, 2)
1 == 2 is Falseresult = Nonereturn result executes immediately — returns NoneDivergence at step 2, specifically return result: This is an action-generation error — the function unconditionally returns after the first recursive call, exiting the loop prematurely. The correct action is to return only when result is not None.
Fix: Replace return result with if result is not None: return result.
rsplit, rfind, partition, and regex with special characters are the highest-risk operations.[1, 2, 3, ..., 98, 99, 100] with explicit length annotation.Debugging Code World Models — Rahmani, 2026. Key findings: (1) CWM failures concentrate in string-valued state due to tokenization discontinuities (strings are 46% of outputs but 73% of failures), (2) long-horizon degradation is caused by incorrect action generation rather than state-tracking errors — when given correct operations, Transformers track state accurately over 128+ steps, (3) non-string data types achieve 100% accuracy at depth-5 composition while strings degrade to 25%.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".