skills/agentstepper-interactive-debugging-software/SKILL.md
Interactive debugging of LLM-powered software development agents using structured trajectory analysis, stepwise execution, and live editing of prompts/tool calls. Use when: 'debug my agent', 'why did the agent do that', 'trace agent execution', 'step through agent actions', 'inspect agent trajectory', 'agent is producing wrong output'.
npx skillsauth add ndpvt-web/arxiv-claude-skills agentstepper-interactive-debugging-softwareInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to apply the AgentStepper debugging methodology to LLM-powered software development agents. Rather than treating agent execution as an opaque black box, this technique decomposes agent trajectories into structured conversations between the LLM, the agent program, and tools -- then provides debugging primitives (breakpoints, stepping, live editing) adapted from conventional debuggers but operating at agent-action-level abstraction. The core insight: debugging agents shares deep similarities with debugging software programs, but requires reasoning at higher abstractions corresponding to meaningful agent actions rather than lines of code.
AgentStepper models agent execution as interleaved structured conversations rather than flat linear logs. Every agent operates in a repeating 4-step cycle: (1) prompt the LLM, (2) receive the LLM response, (3) invoke a tool, (4) capture the tool output. By decomposing raw execution traces into these four message types and displaying them chronologically in parallel columns (agent-to-LLM and agent-to-tools), the trajectory becomes navigable. Each cycle represents one "step" -- the atomic unit of debugging.
The second key innovation is repository-level change tracking. Every tool invocation that modifies code triggers a git commit with an LLM-generated summary. This produces a commit-history-style view of the agent's modifications, allowing developers to see exactly which action introduced a problematic code change. Combined with a message inspector that can diff successive prompts, this makes it possible to pinpoint both what changed and why the agent decided to change it.
The third innovation is live editing at breakpoints. When execution is paused, developers can modify the prompt before it reaches the LLM, simulate an LLM response entirely, alter tool invocation arguments, or edit tool outputs before the agent processes them. This enables controlled experimentation: "what would happen if the agent received different search results?" or "what if I correct the LLM's flawed reasoning before it acts?"
Capture the raw trajectory. Collect the full execution log of the agent run -- every LLM prompt, LLM response, tool invocation (name + arguments), and tool output. If logs are unavailable, reconstruct from available artifacts (shell history, git history, chat transcripts).
Decompose into action cycles. Parse the trajectory into repeating 4-step cycles: prompt -> response -> tool_call -> tool_output. Label each event with its cycle number and type. If a cycle is incomplete (e.g., the agent made consecutive LLM calls without tool use), mark the deviation.
Generate one-line summaries for each event. For each message in the trajectory, produce a concise summary: what the prompt asked, what the LLM decided, which tool was called with what intent, and what the tool returned. Use the following summary templates:
Build a structured conversation view. Arrange events in two parallel streams -- left column for agent-LLM interactions, right column for agent-tool interactions -- ordered chronologically. This reveals the causal chain: prompt led to decision led to action led to result led to next prompt.
Track repository-level changes. For each tool invocation that modified files, capture the diff (or reconstruct it from git history). Associate each diff with its action cycle and summary. Present these as a commit-history view showing progressive code modifications.
Identify the divergence point. Walk through the structured trajectory to find the first cycle where the agent's behavior deviates from the correct path. Check: Did the prompt contain incorrect context? Did the LLM misinterpret the prompt? Did the tool call use wrong arguments? Did the tool return unexpected output?
Inspect prompt evolution. Compare successive prompts (cycle N vs cycle N+1) to detect context corruption, lost information, or hallucinated state. Diff the prompts to see exactly what changed between cycles.
Propose a fix via live-edit simulation. For the identified divergence point, describe what edit to the prompt, LLM response, tool arguments, or tool output would have corrected the trajectory. This is the "what if" analysis: "If the prompt at cycle 3 had included the correct file path, the agent would have edited the right file."
Validate the fix propagates. Trace forward from the proposed edit to confirm it would resolve downstream issues, not just the immediate symptom. An incorrect tool output in cycle 2 may cause cascading failures through cycles 3-7.
Report findings with actionable recommendations. Present: (a) the specific cycle and event type where the bug originated, (b) the root cause (prompt issue, LLM reasoning failure, tool misconfiguration, or unexpected tool output), (c) the proposed fix, and (d) how to prevent recurrence (better prompts, tool validation, guardrails).
Example 1: Agent edits the wrong file
User: "I ran SWE-Agent on issue #142 (fix the login timeout bug) but it edited auth/session.py instead of auth/timeout.py. Why?"
Approach:
find_file searches for "timeout" -> returns auth/timeout.py and auth/session.pyopen_file opens auth/session.pyedit_file applies patch to auth/session.pyOutput:
TRAJECTORY ANALYSIS
===================
Divergence Point: Cycle 2, Event: LLM Response
Root Cause: LLM selected auth/session.py over auth/timeout.py due to
ambiguous context -- the prompt listed filenames without content previews.
The LLM associated "login timeout" with "session" (plausible but incorrect).
Proposed Fix: At Cycle 2, the prompt should include the first 20 lines of
each candidate file so the LLM can see that auth/timeout.py contains the
LOGIN_TIMEOUT_SECONDS constant and the relevant retry logic.
Prevention: Add a file-preview step to the agent's file-selection pipeline.
Before asking the LLM to choose a file, show it content snippets, not just
filenames.
Example 2: Agent enters an infinite tool-call loop
User: "My custom repair agent keeps calling run_tests over and over without making progress. It's stuck in a loop."
Approach:
edit_file applies patch -> run_tests returns 3 failuresedit_file applies patch -> run_tests returns 3 failures (same tests)edit_file applies (reverts to Cycle 5 state) -> run_tests returns 3 failuresOutput:
TRAJECTORY ANALYSIS
===================
Divergence Point: Cycle 6, Event: Prompt Construction
Root Cause: The agent's prompt template does not accumulate history of
previous repair attempts. The LLM sees only the current test failures,
causing it to oscillate between two candidate fixes indefinitely.
Code Changes (commit-style view):
Cycle 5: + timeout = max(timeout, MIN_TIMEOUT) # Fix A
Cycle 6: + timeout = self.config.get('timeout') # Fix B
Cycle 7: + timeout = max(timeout, MIN_TIMEOUT) # Fix A (repeated)
Proposed Fix: Modify the prompt template to include a "Previous Attempts"
section listing past patches and their test results. This gives the LLM
the context to try genuinely new approaches.
Prevention: Add loop detection -- if the same tool output appears 3+ times,
inject a meta-prompt: "Previous approaches have failed. Analyze why and
try a fundamentally different strategy."
Example 3: Instrumenting an existing agent with debugging hooks
User: "I have a Python agent framework. How do I add AgentStepper-style debugging to it?"
Approach:
Output:
# Minimal instrumentation -- 7 API calls, ~40 lines of code
from agentstepper import Debugger
dbg = Debugger(repo_path="./workspace") # Tracks code changes via git
class MyAgent:
def run(self, task):
dbg.start_run(task_description=task)
while not self.is_done():
# 1. Before sending prompt to LLM
prompt = self.build_prompt()
prompt = dbg.before_llm_query(prompt) # Breakpoint + live edit
# 2. After receiving LLM response
response = self.llm.complete(prompt)
response = dbg.after_llm_response(response) # Inspect + override
# 3. Before invoking tool
tool_name, tool_args = self.parse_tool_call(response)
tool_name, tool_args = dbg.before_tool_call(tool_name, tool_args)
# 4. After tool execution
tool_output = self.execute_tool(tool_name, tool_args)
tool_output = dbg.after_tool_output(tool_output) # Inspect + edit
self.update_state(tool_output)
dbg.end_run()
Each dbg.* call: (a) records the event to the trajectory, (b) checks for breakpoints, (c) pauses if in stepping mode, and (d) allows live editing of the value before returning it. The debugger auto-commits file changes after each tool call.
| Problem | Diagnosis | Resolution |
|---------|-----------|------------|
| Incomplete trajectory logs | Agent crashed mid-cycle or logging was partial | Reconstruct from git history and shell history; mark missing events as [unrecorded] |
| No clear divergence point | Agent gradually drifted rather than making one wrong turn | Compare the trajectory against an ideal trajectory for the task; look for cumulative prompt degradation |
| Trajectory too long to analyze manually | Agent ran 50+ cycles | Start from the end (first visible symptom) and trace backward through the commit-history view to find when the problematic code change was introduced |
| Tool outputs are opaque (binary data, long outputs) | Cannot meaningfully summarize | Truncate tool outputs to first/last 50 lines; focus on exit codes and error messages as diagnostic signals |
| Multiple interleaved bugs | Agent made several independent mistakes | Isolate each bug by identifying which action cycles it affects; debug the earliest one first since later bugs may be cascading consequences |
AgentStepper: Interactive Debugging of Software Development Agents -- Hutter & Pradel, 2026. Key sections: the 4-event action cycle model (Section 3), the 7-function instrumentation API (Table 1), the structured conversation view design (Section 4), and the user study showing bug identification success jumping from 17% to 60% with the debugger (Section 5).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".