Hive Debugger

An interactive debugging companion that helps developers identify and fix runtime issues in Hive agents. The debugger analyzes runtime logs at three levels (L1/L2/L3), categorizes issues, and provides actionable fix recommendations.

When to Use This Skill

Use /hive-debugger when:

Your agent is failing or producing unexpected results
You need to understand why a specific node is retrying repeatedly
Tool calls are failing and you need to identify the root cause
Agent execution is stalled or taking too long
You want to monitor agent behavior in real-time during development

This skill works alongside agents running in TUI mode and provides supervisor-level insights into execution behavior.

Forever-Alive Agent Awareness

Some agents use terminal_nodes=[] (the "forever-alive" pattern), meaning they loop indefinitely and never enter a "completed" execution state. For these agents:

Sessions with status "in_progress" or "paused" are normal, not failures
High step counts, long durations, and many node visits are expected behavior
The agent stops only when the user explicitly exits — there is no graph-driven completion
Debug focus should be on quality of individual node visits and iterations, not whether the session reached a terminal state
Conversation memory accumulates across loops — watch for context overflow and stale data issues

How to identify forever-alive agents: Check agent.py or agent.json for terminal_nodes=[] (empty list). If empty, the agent is forever-alive.

Prerequisites

Before using this skill, ensure:

You have an exported agent in exports/{agent_name}/
The agent has been run at least once (logs exist)
Runtime logging is enabled (default in Hive framework)
You have access to the agent's working directory at ~/.hive/agents/{agent_name}/

Workflow

Stage 1: Setup & Context Gathering

Objective: Understand the agent being debugged

What to do:

Ask the developer which agent needs debugging:
- Get agent name (e.g., "deep_research_agent", "deep_research_agent")
- Confirm the agent exists in exports/{agent_name}/
Determine agent working directory:
- Calculate: ~/.hive/agents/{agent_name}/
- Verify this directory exists and contains session logs
Read agent configuration:
- Read file: exports/{agent_name}/agent.json
- Extract goal information from the JSON:
  - goal.id - The goal identifier
  - goal.success_criteria - What success looks like
  - goal.constraints - Rules the agent must follow
- Extract graph information:
  - List of node IDs from graph.nodes
  - List of edges from graph.edges
Store context for the debugging session:
- agent_name
- agent_work_dir (e.g., /home/user/.hive/deep_research_agent)
- goal_id
- success_criteria
- constraints
- node_ids

Example:

Developer: "My deep_research_agent agent keeps failing"

You: "I'll help debug the deep_research_agent agent. Let me gather context..."

[Read exports/deep_research_agent/agent.json]

Context gathered:
- Agent: deep_research_agent
- Goal: deep-research
- Working Directory: /home/user/.hive/deep_research_agent
- Success Criteria: ["Produce a comprehensive research report with cited sources"]
- Constraints: ["Must cite all sources", "Must cover multiple perspectives"]
- Nodes: ["intake", "research", "analysis", "report-writer"]

Stage 2: Mode Selection

Objective: Choose the debugging approach that best fits the situation

What to do:

Ask the developer which debugging mode they want to use. Use AskUserQuestion with these options:

Real-time Monitoring Mode
- Description: Monitor active TUI session continuously, poll logs every 5-10 seconds, alert on new issues immediately
- Best for: Live debugging sessions where you want to catch issues as they happen
- Note: Requires agent to be currently running
Post-Mortem Analysis Mode
- Description: Analyze completed or failed runs in detail, deep dive into specific session
- Best for: Understanding why a past execution failed
- Note: Most common mode for debugging
Historical Trends Mode
- Description: Analyze patterns across multiple runs, identify recurring issues
- Best for: Finding systemic problems that happen repeatedly
- Note: Useful for agents that have run many times

Implementation:

Use AskUserQuestion to present these options and let the developer choose.
Store the selected mode for the session.

Stage 3: Triage (L1 Analysis)

Objective: Identify which sessions need attention

What to do:

Query high-level run summaries using the MCP tool:

query_runtime_logs(
    agent_work_dir="{agent_work_dir}",
    status="needs_attention",
    limit=20
)

Analyze the results:
- Look for runs with needs_attention: true
- Check attention_summary.categories for issue types
- Note the run_id of problematic sessions
- Check status field: "degraded", "failure", "in_progress"
- For forever-alive agents: Sessions with status "in_progress" or "paused" are normal — these agents never reach "completed". Only flag sessions with needs_attention: true or actual error indicators (tool failures, retry loops, missing outputs). High step counts alone do not indicate a problem.
Attention flag triggers to understand: From runtime_logger.py, runs are flagged when:
- retry_count > 3
- escalate_count > 2
- latency_ms > 60000
- tokens_used > 100000
- total_steps > 20
Present findings to developer:
- Summarize how many runs need attention
- List the most recent problematic runs
- Show attention categories for each
- Ask which run they want to investigate (if multiple)

Example Output:

Found 2 runs needing attention:

1. session_20260206_115718_e22339c5 (30 minutes ago)
   Status: degraded
   Categories: missing_outputs, retry_loops

2. session_20260206_103422_9f8d1b2a (2 hours ago)
   Status: failure
   Categories: tool_failures, high_latency

Which run would you like to investigate?

Stage 4: Diagnosis (L2 Analysis)

Objective: Identify which nodes failed and what patterns exist

What to do:

Query per-node details using the MCP tool:

query_runtime_log_details(
    agent_work_dir="{agent_work_dir}",
    run_id="{selected_run_id}",
    needs_attention_only=True
)

Categorize issues using the Issue Taxonomy:

10 Issue Categories:

| Category | Detection Pattern | Meaning | |----------|------------------|---------| | Missing Outputs | exit_status != "success", attention_reasons contains "missing_outputs" | Node didn't call set_output with required keys | | Tool Errors | tool_error_count > 0, attention_reasons contains "tool_failures" | Tool calls failed (API errors, timeouts, auth issues) | | Retry Loops | retry_count > 3, verdict_counts.RETRY > 5 | Judge repeatedly rejecting outputs | | Guard Failures | guard_reject_count > 0 | Output validation failed (wrong types, missing keys) | | Stalled Execution | total_steps > 20, verdict_counts.CONTINUE > 10 | EventLoopNode not making progress. Caveat: Forever-alive agents may legitimately have high step counts — check if agent is blocked at a client-facing node (normal) vs genuinely stuck in a loop | | High Latency | latency_ms > 60000, avg_step_latency > 5000 | Slow tool calls or LLM responses | | Client-Facing Issues | client_input_requested but no user_input_received | Premature set_output before user input | | Edge Routing Errors | exit_status == "no_valid_edge", attention_reasons contains "routing_issue" | No edges match current state | | Memory/Context Issues | tokens_used > 100000, context_overflow_count > 0 | Conversation history too long | | Constraint Violations | Compare output against goal constraints | Agent violated goal-level rules |

Forever-Alive Agent Caveat: If the agent uses terminal_nodes=[], sessions will never reach "completed" status. This is by design. When debugging these agents, focus on:
- Whether individual node visits succeed (not whether the graph "finishes")
- Quality of each loop iteration — are outputs improving or degrading across loops?
- Whether client-facing nodes are correctly blocking for user input
- Memory accumulation issues: stale data from previous loops, context overflow across many iterations
- Conversation compaction behavior: is the conversation growing unbounded?
Analyze each flagged node:
- Node ID and name
- Exit status
- Retry count
- Verdict distribution (ACCEPT/RETRY/ESCALATE/CONTINUE)
- Attention reasons
- Total steps executed
Present diagnosis to developer:
- List problematic nodes
- Categorize each issue
- Highlight the most severe problems
- Show evidence (retry counts, error types)

Example Output:

Diagnosis for session_20260206_115718_e22339c5:

Problem Node: research
├─ Exit Status: escalate
├─ Retry Count: 5 (HIGH)
├─ Verdict Counts: {RETRY: 5, ESCALATE: 1}
├─ Attention Reasons: ["high_retry_count", "missing_outputs"]
├─ Total Steps: 8
└─ Categories: Missing Outputs + Retry Loops

Root Issue: The research node is stuck in a retry loop because it's not setting required outputs.

Stage 5: Root Cause Analysis (L3 Analysis)

Objective: Understand exactly what went wrong by examining detailed logs

What to do:

Query detailed tool/LLM logs using the MCP tool:

query_runtime_log_raw(
    agent_work_dir="{agent_work_dir}",
    run_id="{run_id}",
    node_id="{problem_node_id}"
)

Analyze based on issue category:

For Missing Outputs:
- Check step.tool_calls for set_output usage
- Look for conditional logic that skipped set_output
- Check if LLM is calling other tools instead
For Tool Errors:
- Check step.tool_results for error messages
- Identify error types: rate limits, auth failures, timeouts, network errors
- Note which specific tool is failing
For Retry Loops:
- Check step.verdict_feedback from judge
- Look for repeated failure reasons
- Identify if it's the same issue every time
For Guard Failures:
- Check step.guard_results for validation errors
- Identify missing keys or type mismatches
- Compare actual output to expected schema
For Stalled Execution:
- Check step.llm_response_text for repetition
- Look for LLM stuck in same action loop
- Check if tool calls are succeeding but not progressing
Extract evidence:
- Specific error messages
- Tool call arguments and results
- LLM response text
- Judge feedback
- Step-by-step progression
Formulate root cause explanation:
- Clearly state what is happening
- Explain why it's happening
- Show evidence from logs

Example Output:

Root Cause Analysis for research:

Step-by-step breakdown:

Step 3:
- Tool Call: web_search(query="latest AI regulations 2026")
- Result: Found relevant articles and sources
- Verdict: RETRY
- Feedback: "Missing required output 'research_findings'. You found sources but didn't call set_output."

Step 4:
- Tool Call: web_search(query="AI regulation policy 2026")
- Result: Found additional policy information
- Verdict: RETRY
- Feedback: "Still missing 'research_findings'. Use set_output to save your findings."

Steps 5-7: Similar pattern continues...

ROOT CAUSE: The node is successfully finding research sources via web_search, but the LLM is not calling set_output to save the results. It keeps searching for more information instead of completing the task.

Stage 6: Fix Recommendations

Objective: Provide actionable solutions the developer can implement

What to do:

Based on the issue category identified, provide specific fix recommendations using these templates:

Template 1: Missing Outputs (Client-Facing Nodes)

## Issue: Premature set_output in Client-Facing Node

**Root Cause:** Node called set_output before receiving user input

**Fix:** Use STEP 1/STEP 2 prompt pattern

**File to edit:** `exports/{agent_name}/nodes/{node_name}.py`

**Changes:**
1. Update the system_prompt to include explicit step guidance:
   ```python
   system_prompt = """
   STEP 1: Analyze the user input and decide what action to take.
   DO NOT call set_output in this step.

   STEP 2: After receiving feedback or completing analysis,
   ONLY THEN call set_output with your results.
   """

If some inputs are optional (like feedback on retry edges), add nullable_output_keys:
```
nullable_output_keys=["feedback"]
```

Verification:

Run the agent with test input
Verify the client-facing node waits for user input before calling set_output


#### Template 2: Retry Loops

```markdown
## Issue: Judge Repeatedly Rejecting Outputs

**Root Cause:** {Insert specific reason from verdict_feedback}

**Fix Options:**

**Option A - If outputs are actually correct:** Adjust judge evaluation rules
- File: `exports/{agent_name}/agent.json`
- Update `evaluation_rules` section to accept the current output format
- Example: If judge expects list but gets string, update rule to accept both

**Option B - If prompt is ambiguous:** Clarify node instructions
- File: `exports/{agent_name}/nodes/{node_name}.py`
- Make system_prompt more explicit about output format and requirements
- Add examples of correct outputs

**Option C - If tool is unreliable:** Add retry logic with fallback
- Consider using alternative tools
- Add manual fallback option
- Update prompt to handle tool failures gracefully

**Verification:**
- Run the node with test input
- Confirm judge accepts output on first try
- Check that retry_count stays at 0

Template 3: Tool Errors

## Issue: {tool_name} Failing with {error_type}

**Root Cause:** {Insert specific error message from logs}

**Fix Strategy:**

**If API rate limit:**
1. Add exponential backoff in tool retry logic
2. Reduce API call frequency
3. Consider caching results

**If auth failure:**
1. Check credentials using:
   ```bash
   /hive-credentials --agent {agent_name}

Verify API key environment variables
Update mcp_servers.json if needed

If timeout:

Increase timeout in mcp_servers.json:
```
{
  "timeout_ms": 60000
}
```
Consider using faster alternative tools
Break large requests into smaller chunks

Verification:

Test tool call manually
Confirm successful response
Monitor for recurring errors


#### Template 4: Edge Routing Errors

```markdown
## Issue: No Valid Edge from Node {node_id}

**Root Cause:** No edge condition matched the current state

**File to edit:** `exports/{agent_name}/agent.json`

**Analysis:**
- Current node output: {show actual output keys}
- Existing edge conditions: {list edge conditions}
- Why no match: {explain the mismatch}

**Fix:**
Add the missing edge to the graph:
```json
{
  "edge_id": "{node_id}_to_{target_node}",
  "source": "{node_id}",
  "target": "{target_node}",
  "condition": "on_success"
}

Alternative: Update existing edge condition to cover this case

Verification:

Run agent with same input
Verify edge is traversed successfully
Check that execution continues to next node


#### Template 5: Stalled Execution

```markdown
## Issue: EventLoopNode Not Making Progress

**Root Cause:** {Insert analysis - e.g., "LLM repeating same failed action"}

**File to edit:** `exports/{agent_name}/nodes/{node_name}.py`

**Fix:** Update system_prompt to guide LLM out of loops

**Add this guidance:**
```python
system_prompt = """
{existing prompt}

IMPORTANT: If a tool call fails multiple times:
1. Try an alternative approach or different tool
2. If no alternatives work, call set_output with partial results
3. DO NOT retry the same failed action more than 3 times

Progress is more important than perfection. Move forward even with incomplete data.
"""

Additional fix: Lower max_iterations to prevent infinite loops

# In node configuration
max_node_visits=3  # Prevent getting stuck

Verification:

Run node with same input that caused stall
Verify it exits after reasonable attempts (< 10 steps)
Confirm it calls set_output eventually


#### Template 6: Checkpoint Recovery (Post-Fix Resume)

```markdown
## Recovery Strategy: Resume from Last Clean Checkpoint

**Situation:** You've fixed the issue, but the failed session is stuck mid-execution

**Solution:** Resume execution from a checkpoint before the failure

### Option A: Auto-Resume from Latest Checkpoint (Recommended)

Use CLI arguments to auto-resume when launching TUI:

```bash
PYTHONPATH=core:exports python -m {agent_name} --tui \
    --resume-session {session_id}

This will:

Load session state from state.json
Continue from where it paused/failed
Apply your fixes immediately

Option B: Resume from Specific Checkpoint (Time-Travel)

If you need to go back to an earlier point:

PYTHONPATH=core:exports python -m {agent_name} --tui \
    --resume-session {session_id} \
    --checkpoint {checkpoint_id}

Example:

PYTHONPATH=core:exports python -m deep_research_agent --tui \
    --resume-session session_20260208_143022_abc12345 \
    --checkpoint cp_node_complete_intake_143030

Option C: Use TUI Commands

Alternatively, launch TUI normally and use commands:

# Launch TUI
PYTHONPATH=core:exports python -m {agent_name} --tui

# In TUI, use commands:
/resume {session_id}                    # Resume from session state
/recover {session_id} {checkpoint_id}   # Recover from specific checkpoint

When to Use Each Option:

Use /resume (or --resume-session) when:

You fixed credentials and want to retry
Agent paused and you want to continue
Agent failed and you want to retry from last state

Use /recover (or --resume-session + --checkpoint) when:

You need to go back to an earlier checkpoint
You want to try a different path from a specific point
Debugging requires time-travel to earlier state

Find Available Checkpoints:

Use MCP tools to programmatically find and inspect checkpoints:

# List all sessions to find the failed one
list_agent_sessions(agent_work_dir="~/.hive/agents/{agent_name}", status="failed")

# Inspect session state
get_agent_session_state(agent_work_dir="~/.hive/agents/{agent_name}", session_id="{session_id}")

# Find clean checkpoints to resume from
list_agent_checkpoints(agent_work_dir="~/.hive/agents/{agent_name}", session_id="{session_id}", is_clean="true")

# Compare checkpoints to understand what changed
compare_agent_checkpoints(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="{session_id}",
    checkpoint_id_before="cp_node_complete_intake_143030",
    checkpoint_id_after="cp_node_complete_research_143115"
)

# Inspect memory at a specific checkpoint
get_agent_checkpoint(agent_work_dir="~/.hive/agents/{agent_name}", session_id="{session_id}", checkpoint_id="cp_node_complete_intake_143030")

Or in TUI:

/sessions {session_id}

Verification:

Use --resume-session to test your fix immediately
No need to re-run from the beginning
Session continues with your code changes applied


**Selecting the right template:**
- Match the issue category from Stage 4
- Customize with specific details from Stage 5
- Include actual error messages and code snippets
- Provide file paths and line numbers when possible
- **Always include recovery commands** (Template 6) after providing fix recommendations

---

### Stage 7: Verification Support

**Objective:** Help the developer confirm their fixes work

**What to do:**

1. **Suggest appropriate tests based on fix type:**

   **For node-level fixes:**
   ```bash
   # Use hive-test to run goal-based tests
   /hive-test --agent {agent_name} --goal {goal_id}

   # Or run specific test scenarios
   /hive-test --agent {agent_name} --scenario {specific_input}

For quick manual tests:

# Launch the interactive TUI dashboard
hive tui

Then use arrow keys to select the agent from the list and press Enter to run it.

Provide MCP tool queries to validate the fix:

Check if issue is resolved:

query_runtime_logs(
    agent_work_dir="~/.hive/agents/{agent_name}",
    status="needs_attention",
    limit=5
)
# Should show 0 results if fully fixed

Verify specific node behavior:

query_runtime_log_details(
    agent_work_dir="~/.hive/agents/{agent_name}",
    run_id="{new_run_id}",
    node_id="{fixed_node_id}"
)
# Should show exit_status="success", retry_count=0

Monitor for regression:
- Run the agent multiple times
- Check for similar issues reappearing
- Verify fix works across different inputs

Provide verification checklist:

Verification Checklist:
□ Applied recommended fix to code
□ Ran agent with test input
□ Checked runtime logs show no attention flags
□ Verified specific node completes successfully
□ Tested with multiple inputs
□ No regression of original issue
□ Agent meets success criteria

Example interaction:

Developer: "I applied the fix to research. How do I verify it works?"

You: "Great! Let's verify the fix with these steps:

1. Launch the TUI dashboard:
   hive tui
   Then select your agent from the list and press Enter to run it.

2. After it completes, check the logs:
   [Use query_runtime_logs to check for attention flags]

3. Verify the specific node:
   [Use query_runtime_log_details for research]

Expected results:
- No 'needs_attention' flags
- research shows exit_status='success'
- retry_count should be 0

Let me know when you've run it and I'll help check the logs!"

MCP Tool Usage Guide

Three Levels of Observability

L1: query_runtime_logs - Session-level summaries

When to use: Initial triage, identifying problematic runs, monitoring trends
Returns: List of runs with status, attention flags, timestamps

Example:

query_runtime_logs(
    agent_work_dir="/home/user/.hive/deep_research_agent",
    status="needs_attention",
    limit=20
)

L2: query_runtime_log_details - Node-level details

When to use: Diagnosing which nodes failed, understanding retry patterns
Returns: Per-node completion details, retry counts, verdicts

Example:

query_runtime_log_details(
    agent_work_dir="/home/user/.hive/deep_research_agent",
    run_id="session_20260206_115718_e22339c5",
    needs_attention_only=True
)

L3: query_runtime_log_raw - Step-level details

When to use: Root cause analysis, understanding exact failures
Returns: Full tool calls, LLM responses, judge feedback

Example:

query_runtime_log_raw(
    agent_work_dir="/home/user/.hive/deep_research_agent",
    run_id="session_20260206_115718_e22339c5",
    node_id="research"
)

Session & Checkpoint Tools

list_agent_sessions - Browse sessions with filtering

When to use: Finding resumable sessions, identifying failed sessions, Stage 3 triage
Returns: Session list with status, timestamps, is_resumable, current_node, quality

Example:

list_agent_sessions(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    status="failed",
    limit=10
)

get_agent_session_state - Load full session state (excludes memory values)

When to use: Inspecting session progress, checking is_resumable, examining path
Returns: Full state with memory_keys/memory_size instead of memory values

Example:

get_agent_session_state(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    session_id="session_20260208_143022_abc12345"
)

get_agent_session_memory - Get memory contents from a session

When to use: Stage 5 root cause analysis, inspecting produced data
Returns: All memory keys+values, or a single key's value

Example:

get_agent_session_memory(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    session_id="session_20260208_143022_abc12345",
    key="twitter_handles"
)

list_agent_checkpoints - List checkpoints for a session

When to use: Stage 6 recovery, finding clean checkpoints to resume from
Returns: Checkpoint summaries with type, node, clean status

Example:

list_agent_checkpoints(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    session_id="session_20260208_143022_abc12345",
    is_clean="true"
)

get_agent_checkpoint - Load a specific checkpoint with full state

When to use: Inspecting exact state at a checkpoint, comparing to current state
Returns: Full checkpoint: memory snapshot, execution path, metrics

Example:

get_agent_checkpoint(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    session_id="session_20260208_143022_abc12345",
    checkpoint_id="cp_node_complete_intake_143030"
)

compare_agent_checkpoints - Diff memory between two checkpoints

When to use: Understanding data flow, finding where state diverged
Returns: Memory diff (added/removed/changed keys) + execution path diff

Example:

compare_agent_checkpoints(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    session_id="session_20260208_143022_abc12345",
    checkpoint_id_before="cp_node_complete_intake_143030",
    checkpoint_id_after="cp_node_complete_research_143115"
)

Query Patterns

Pattern 1: Top-Down Investigation (Most common)

1. L1: Find problematic runs
2. L2: Identify failing nodes
3. L3: Analyze specific failures

Pattern 2: Node-Specific Debugging

1. L2: Get details for specific node across all runs
2. L3: Deep dive into worst failures

Pattern 3: Real-time Monitoring

Loop every 10 seconds:
  1. L1: Check for new needs_attention runs
  2. If found: Alert and drill into L2

Pattern 4: Session State + Checkpoint Recovery

1. list_agent_sessions: Find failed/paused sessions
2. get_agent_session_state: Check is_resumable, see execution path
3. get_agent_session_memory: Inspect what data was produced
4. list_agent_checkpoints: Find clean checkpoints before failure
5. compare_agent_checkpoints: Understand what changed between checkpoints
6. Recommend resume command with specific checkpoint

Complete Example Walkthrough

Scenario: deep_research_agent Agent Failing

Developer says: "My deep_research_agent keeps failing. It's not saving research findings correctly."

Stage 1: Setup

You: "I'll help debug the deep_research_agent agent. Let me gather context..."

[Read exports/deep_research_agent/agent.json]

Context:
- Agent: deep_research_agent
- Goal: deep-research
- Working Dir: ~/.hive/agents/deep_research_agent
- Success Criteria: ["Successfully send 5 personalized outreach messages"]
- Constraints: ["Must verify handle exists", "Must personalize message"]
- Nodes: research, profile-analyzer, message-composer, outreach-sender

Stage 2: Mode Selection

You: "Which debugging approach would you prefer?"

[Present 3 mode options using AskUserQuestion]

Developer selects: "Post-Mortem Analysis Mode"

Stage 3: Triage (L1)

You: "Querying runtime logs for problematic runs..."

[Call query_runtime_logs with status="needs_attention"]

Results:
- Found 1 run needing attention
- session_20260206_115718_e22339c5 (30 min ago)
- Status: degraded
- Categories: missing_outputs, retry_loops

Stage 4: Diagnosis (L2)

You: "Analyzing node-level details..."

[Call query_runtime_log_details for the session]

Results:
- Problem Node: research
- Exit Status: escalate
- Retry Count: 5
- Verdict Counts: {RETRY: 5, ESCALATE: 1}
- Attention Reasons: ["high_retry_count", "missing_outputs"]
- Total Steps: 8

Diagnosis: research has retry loop + missing outputs issue
Category: Missing Outputs + Retry Loops

Stage 5: Root Cause (L3)

You: "Drilling into detailed logs for research..."

[Call query_runtime_log_raw for node research]

Evidence:
Step 3: web_search("latest AI regulations 2026") → Found articles
  Verdict: RETRY
  Feedback: "Missing required output 'research_findings'. You found sources but didn't call set_output."

Step 4: web_search("AI regulation policy 2026") → Found more info
  Verdict: RETRY
  Feedback: "Still missing 'research_findings'. Use set_output to save your findings."

Steps 5-7: Same pattern repeats...

ROOT CAUSE: Node successfully finds handles but LLM doesn't call set_output. It keeps searching instead of completing.

Stage 6: Fix Recommendation

You: "Here's the fix for this issue:

## Issue: research Node Not Setting Outputs

**Root Cause:** Node successfully finds research sources but doesn't call set_output

**Category:** Missing Outputs

**File to edit:** exports/deep_research_agent/nodes/research.py

**Changes:**
Update the system_prompt to emphasize the output requirement:

```python
system_prompt = """
You are the Research node for the deep research agent.

Your task:
1. Search for relevant sources on the research topic
2. Extract key findings from each source
3. **CRITICAL: Call set_output(research_findings=[...]) immediately after gathering sufficient findings**

IMPORTANT: You MUST call set_output even if you found findings on the first search.
Do not continue searching indefinitely - set output and finish once you have enough data.

Output format:
{
  "research_findings": [{"source": "...", "summary": "...", "url": "..."}, ...]
}
"""

Why this works: The explicit instruction to call set_output immediately prevents the LLM from continuing to search after finding sufficient sources.


#### Stage 7: Verification

Developer: "I updated the prompt. How do I test it?"

You: "Let's verify the fix:

Launch the TUI dashboard: hive tui Then select your agent and press Enter to run it.
I'll check the logs after it runs..."

[Developer runs agent]

Developer: "It completed! Can you check if it worked?"

You: "Checking now..."

[Call query_runtime_log_details for the new run]

Results:

Node: research
Exit Status: success ✓
Retry Count: 0 ✓
Total Steps: 2 ✓
No attention flags ✓

"Perfect! The fix worked. The research now:

Completes successfully on first try
No retries needed
Calls set_output properly

Your agent should now work correctly!"


---

## Tips for Effective Debugging

1. **Always start with L1 logs** - Don't jump straight to detailed logs
2. **Focus on attention flags** - They highlight the real issues
3. **Compare verdict_feedback across steps** - Patterns reveal root causes
4. **Check tool error messages carefully** - They often contain the exact problem
5. **Consider the agent's goal** - Fixes should align with success criteria
6. **Test fixes immediately** - Quick verification prevents wasted effort
7. **Look for patterns across multiple runs** - One-time failures might be transient

## Common Pitfalls to Avoid

1. **Don't recommend code you haven't verified exists** - Always read files first
2. **Don't assume tool capabilities** - Check MCP server configs
3. **Don't ignore edge conditions** - Missing edges cause routing failures
4. **Don't overlook judge configuration** - Mismatched expectations cause retry loops
5. **Don't forget nullable_output_keys** - Optional inputs need explicit marking
6. **Don't diagnose "in_progress" as a failure for forever-alive agents** - Agents with `terminal_nodes=[]` are designed to never enter "completed" state. This is intentional. Focus on quality of individual node visits, not session completion status
7. **Don't ignore conversation memory issues in long-running sessions** - In continuous conversation mode, history grows across node transitions and loop iterations. Watch for context overflow (tokens_used > 100K), stale data from previous loops affecting edge conditions, and compaction failures that cause the LLM to lose important context
8. **Don't confuse "waiting for user" with "stalled"** - Client-facing nodes in forever-alive agents block for user input by design. A session paused at a client-facing node is working correctly, not stalled

---

## Storage Locations Reference

**New unified storage (default):**
- Logs: `~/.hive/agents/{agent_name}/sessions/session_YYYYMMDD_HHMMSS_{uuid}/logs/`
- State: `~/.hive/agents/{agent_name}/sessions/{session_id}/state.json`
- Conversations: `~/.hive/agents/{agent_name}/sessions/{session_id}/conversations/`

**Old storage (deprecated, still supported):**
- Logs: `~/.hive/agents/{agent_name}/runtime_logs/runs/{run_id}/`

The MCP tools automatically check both locations.

---

**Remember:** Your role is to be a debugging companion and thought partner. Guide the developer through the investigation, explain what you find, and provide actionable fixes. Don't just report errors - help understand and solve them.

Hive Debugger

When to Use This Skill

Use /hive-debugger when:

Your agent is failing or producing unexpected results
You need to understand why a specific node is retrying repeatedly
Tool calls are failing and you need to identify the root cause
Agent execution is stalled or taking too long
You want to monitor agent behavior in real-time during development

This skill works alongside agents running in TUI mode and provides supervisor-level insights into execution behavior.

Forever-Alive Agent Awareness

Some agents use terminal_nodes=[] (the "forever-alive" pattern), meaning they loop indefinitely and never enter a "completed" execution state. For these agents:

Sessions with status "in_progress" or "paused" are normal, not failures
High step counts, long durations, and many node visits are expected behavior
The agent stops only when the user explicitly exits — there is no graph-driven completion
Debug focus should be on quality of individual node visits and iterations, not whether the session reached a terminal state
Conversation memory accumulates across loops — watch for context overflow and stale data issues

How to identify forever-alive agents: Check agent.py or agent.json for terminal_nodes=[] (empty list). If empty, the agent is forever-alive.

Prerequisites

Before using this skill, ensure:

You have an exported agent in exports/{agent_name}/
The agent has been run at least once (logs exist)
Runtime logging is enabled (default in Hive framework)
You have access to the agent's working directory at ~/.hive/agents/{agent_name}/

Workflow

Stage 1: Setup & Context Gathering

Objective: Understand the agent being debugged

What to do:

Ask the developer which agent needs debugging:
- Get agent name (e.g., "deep_research_agent", "deep_research_agent")
- Confirm the agent exists in exports/{agent_name}/
Determine agent working directory:
- Calculate: ~/.hive/agents/{agent_name}/
- Verify this directory exists and contains session logs
Read agent configuration:
- Read file: exports/{agent_name}/agent.json
- Extract goal information from the JSON:
  - goal.id - The goal identifier
  - goal.success_criteria - What success looks like
  - goal.constraints - Rules the agent must follow
- Extract graph information:
  - List of node IDs from graph.nodes
  - List of edges from graph.edges
Store context for the debugging session:
- agent_name
- agent_work_dir (e.g., /home/user/.hive/deep_research_agent)
- goal_id
- success_criteria
- constraints
- node_ids

Example:

Developer: "My deep_research_agent agent keeps failing"

You: "I'll help debug the deep_research_agent agent. Let me gather context..."

[Read exports/deep_research_agent/agent.json]

Context gathered:
- Agent: deep_research_agent
- Goal: deep-research
- Working Directory: /home/user/.hive/deep_research_agent
- Success Criteria: ["Produce a comprehensive research report with cited sources"]
- Constraints: ["Must cite all sources", "Must cover multiple perspectives"]
- Nodes: ["intake", "research", "analysis", "report-writer"]

Stage 2: Mode Selection

Objective: Choose the debugging approach that best fits the situation

What to do:

Ask the developer which debugging mode they want to use. Use AskUserQuestion with these options:

Real-time Monitoring Mode
- Description: Monitor active TUI session continuously, poll logs every 5-10 seconds, alert on new issues immediately
- Best for: Live debugging sessions where you want to catch issues as they happen
- Note: Requires agent to be currently running
Post-Mortem Analysis Mode
- Description: Analyze completed or failed runs in detail, deep dive into specific session
- Best for: Understanding why a past execution failed
- Note: Most common mode for debugging
Historical Trends Mode
- Description: Analyze patterns across multiple runs, identify recurring issues
- Best for: Finding systemic problems that happen repeatedly
- Note: Useful for agents that have run many times

Implementation:

Use AskUserQuestion to present these options and let the developer choose.
Store the selected mode for the session.

Stage 3: Triage (L1 Analysis)

Objective: Identify which sessions need attention

What to do:

Query high-level run summaries using the MCP tool:

query_runtime_logs(
    agent_work_dir="{agent_work_dir}",
    status="needs_attention",
    limit=20
)

Analyze the results:
- Look for runs with needs_attention: true
- Check attention_summary.categories for issue types
- Note the run_id of problematic sessions
- Check status field: "degraded", "failure", "in_progress"
- For forever-alive agents: Sessions with status "in_progress" or "paused" are normal — these agents never reach "completed". Only flag sessions with needs_attention: true or actual error indicators (tool failures, retry loops, missing outputs). High step counts alone do not indicate a problem.
Attention flag triggers to understand: From runtime_logger.py, runs are flagged when:
- retry_count > 3
- escalate_count > 2
- latency_ms > 60000
- tokens_used > 100000
- total_steps > 20
Present findings to developer:
- Summarize how many runs need attention
- List the most recent problematic runs
- Show attention categories for each
- Ask which run they want to investigate (if multiple)

Example Output:

Found 2 runs needing attention:

1. session_20260206_115718_e22339c5 (30 minutes ago)
   Status: degraded
   Categories: missing_outputs, retry_loops

2. session_20260206_103422_9f8d1b2a (2 hours ago)
   Status: failure
   Categories: tool_failures, high_latency

Which run would you like to investigate?

Stage 4: Diagnosis (L2 Analysis)

Objective: Identify which nodes failed and what patterns exist

What to do:

Query per-node details using the MCP tool:

query_runtime_log_details(
    agent_work_dir="{agent_work_dir}",
    run_id="{selected_run_id}",
    needs_attention_only=True
)

Categorize issues using the Issue Taxonomy:

10 Issue Categories:

| Category | Detection Pattern | Meaning | |----------|------------------|---------| | Missing Outputs | exit_status != "success", attention_reasons contains "missing_outputs" | Node didn't call set_output with required keys | | Tool Errors | tool_error_count > 0, attention_reasons contains "tool_failures" | Tool calls failed (API errors, timeouts, auth issues) | | Retry Loops | retry_count > 3, verdict_counts.RETRY > 5 | Judge repeatedly rejecting outputs | | Guard Failures | guard_reject_count > 0 | Output validation failed (wrong types, missing keys) | | Stalled Execution | total_steps > 20, verdict_counts.CONTINUE > 10 | EventLoopNode not making progress. Caveat: Forever-alive agents may legitimately have high step counts — check if agent is blocked at a client-facing node (normal) vs genuinely stuck in a loop | | High Latency | latency_ms > 60000, avg_step_latency > 5000 | Slow tool calls or LLM responses | | Client-Facing Issues | client_input_requested but no user_input_received | Premature set_output before user input | | Edge Routing Errors | exit_status == "no_valid_edge", attention_reasons contains "routing_issue" | No edges match current state | | Memory/Context Issues | tokens_used > 100000, context_overflow_count > 0 | Conversation history too long | | Constraint Violations | Compare output against goal constraints | Agent violated goal-level rules |

Forever-Alive Agent Caveat: If the agent uses terminal_nodes=[], sessions will never reach "completed" status. This is by design. When debugging these agents, focus on:
- Whether individual node visits succeed (not whether the graph "finishes")
- Quality of each loop iteration — are outputs improving or degrading across loops?
- Whether client-facing nodes are correctly blocking for user input
- Memory accumulation issues: stale data from previous loops, context overflow across many iterations
- Conversation compaction behavior: is the conversation growing unbounded?
Analyze each flagged node:
- Node ID and name
- Exit status
- Retry count
- Verdict distribution (ACCEPT/RETRY/ESCALATE/CONTINUE)
- Attention reasons
- Total steps executed
Present diagnosis to developer:
- List problematic nodes
- Categorize each issue
- Highlight the most severe problems
- Show evidence (retry counts, error types)

Example Output:

Diagnosis for session_20260206_115718_e22339c5:

Problem Node: research
├─ Exit Status: escalate
├─ Retry Count: 5 (HIGH)
├─ Verdict Counts: {RETRY: 5, ESCALATE: 1}
├─ Attention Reasons: ["high_retry_count", "missing_outputs"]
├─ Total Steps: 8
└─ Categories: Missing Outputs + Retry Loops

Root Issue: The research node is stuck in a retry loop because it's not setting required outputs.

Stage 5: Root Cause Analysis (L3 Analysis)

Objective: Understand exactly what went wrong by examining detailed logs

What to do:

Query detailed tool/LLM logs using the MCP tool:

query_runtime_log_raw(
    agent_work_dir="{agent_work_dir}",
    run_id="{run_id}",
    node_id="{problem_node_id}"
)

Analyze based on issue category:

For Missing Outputs:
- Check step.tool_calls for set_output usage
- Look for conditional logic that skipped set_output
- Check if LLM is calling other tools instead
For Tool Errors:
- Check step.tool_results for error messages
- Identify error types: rate limits, auth failures, timeouts, network errors
- Note which specific tool is failing
For Retry Loops:
- Check step.verdict_feedback from judge
- Look for repeated failure reasons
- Identify if it's the same issue every time
For Guard Failures:
- Check step.guard_results for validation errors
- Identify missing keys or type mismatches
- Compare actual output to expected schema
For Stalled Execution:
- Check step.llm_response_text for repetition
- Look for LLM stuck in same action loop
- Check if tool calls are succeeding but not progressing
Extract evidence:
- Specific error messages
- Tool call arguments and results
- LLM response text
- Judge feedback
- Step-by-step progression
Formulate root cause explanation:
- Clearly state what is happening
- Explain why it's happening
- Show evidence from logs

Example Output:

Root Cause Analysis for research:

Step-by-step breakdown:

Step 3:
- Tool Call: web_search(query="latest AI regulations 2026")
- Result: Found relevant articles and sources
- Verdict: RETRY
- Feedback: "Missing required output 'research_findings'. You found sources but didn't call set_output."

Step 4:
- Tool Call: web_search(query="AI regulation policy 2026")
- Result: Found additional policy information
- Verdict: RETRY
- Feedback: "Still missing 'research_findings'. Use set_output to save your findings."

Steps 5-7: Similar pattern continues...

ROOT CAUSE: The node is successfully finding research sources via web_search, but the LLM is not calling set_output to save the results. It keeps searching for more information instead of completing the task.

Stage 6: Fix Recommendations

Objective: Provide actionable solutions the developer can implement

What to do:

Based on the issue category identified, provide specific fix recommendations using these templates:

Template 1: Missing Outputs (Client-Facing Nodes)

## Issue: Premature set_output in Client-Facing Node

**Root Cause:** Node called set_output before receiving user input

**Fix:** Use STEP 1/STEP 2 prompt pattern

**File to edit:** `exports/{agent_name}/nodes/{node_name}.py`

**Changes:**
1. Update the system_prompt to include explicit step guidance:
   ```python
   system_prompt = """
   STEP 1: Analyze the user input and decide what action to take.
   DO NOT call set_output in this step.

   STEP 2: After receiving feedback or completing analysis,
   ONLY THEN call set_output with your results.
   """

If some inputs are optional (like feedback on retry edges), add nullable_output_keys:
```
nullable_output_keys=["feedback"]
```

Verification:

Run the agent with test input
Verify the client-facing node waits for user input before calling set_output


#### Template 2: Retry Loops

```markdown
## Issue: Judge Repeatedly Rejecting Outputs

**Root Cause:** {Insert specific reason from verdict_feedback}

**Fix Options:**

**Option A - If outputs are actually correct:** Adjust judge evaluation rules
- File: `exports/{agent_name}/agent.json`
- Update `evaluation_rules` section to accept the current output format
- Example: If judge expects list but gets string, update rule to accept both

**Option B - If prompt is ambiguous:** Clarify node instructions
- File: `exports/{agent_name}/nodes/{node_name}.py`
- Make system_prompt more explicit about output format and requirements
- Add examples of correct outputs

**Option C - If tool is unreliable:** Add retry logic with fallback
- Consider using alternative tools
- Add manual fallback option
- Update prompt to handle tool failures gracefully

**Verification:**
- Run the node with test input
- Confirm judge accepts output on first try
- Check that retry_count stays at 0

Template 3: Tool Errors

## Issue: {tool_name} Failing with {error_type}

**Root Cause:** {Insert specific error message from logs}

**Fix Strategy:**

**If API rate limit:**
1. Add exponential backoff in tool retry logic
2. Reduce API call frequency
3. Consider caching results

**If auth failure:**
1. Check credentials using:
   ```bash
   /hive-credentials --agent {agent_name}

Verify API key environment variables
Update mcp_servers.json if needed

If timeout:

Increase timeout in mcp_servers.json:
```
{
  "timeout_ms": 60000
}
```
Consider using faster alternative tools
Break large requests into smaller chunks

Verification:

Test tool call manually
Confirm successful response
Monitor for recurring errors


#### Template 4: Edge Routing Errors

```markdown
## Issue: No Valid Edge from Node {node_id}

**Root Cause:** No edge condition matched the current state

**File to edit:** `exports/{agent_name}/agent.json`

**Analysis:**
- Current node output: {show actual output keys}
- Existing edge conditions: {list edge conditions}
- Why no match: {explain the mismatch}

**Fix:**
Add the missing edge to the graph:
```json
{
  "edge_id": "{node_id}_to_{target_node}",
  "source": "{node_id}",
  "target": "{target_node}",
  "condition": "on_success"
}

Alternative: Update existing edge condition to cover this case

Verification:

Run agent with same input
Verify edge is traversed successfully
Check that execution continues to next node


#### Template 5: Stalled Execution

```markdown
## Issue: EventLoopNode Not Making Progress

**Root Cause:** {Insert analysis - e.g., "LLM repeating same failed action"}

**File to edit:** `exports/{agent_name}/nodes/{node_name}.py`

**Fix:** Update system_prompt to guide LLM out of loops

**Add this guidance:**
```python
system_prompt = """
{existing prompt}

IMPORTANT: If a tool call fails multiple times:
1. Try an alternative approach or different tool
2. If no alternatives work, call set_output with partial results
3. DO NOT retry the same failed action more than 3 times

Progress is more important than perfection. Move forward even with incomplete data.
"""

Additional fix: Lower max_iterations to prevent infinite loops

# In node configuration
max_node_visits=3  # Prevent getting stuck

Verification:

Run node with same input that caused stall
Verify it exits after reasonable attempts (< 10 steps)
Confirm it calls set_output eventually


#### Template 6: Checkpoint Recovery (Post-Fix Resume)

```markdown
## Recovery Strategy: Resume from Last Clean Checkpoint

**Situation:** You've fixed the issue, but the failed session is stuck mid-execution

**Solution:** Resume execution from a checkpoint before the failure

### Option A: Auto-Resume from Latest Checkpoint (Recommended)

Use CLI arguments to auto-resume when launching TUI:

```bash
PYTHONPATH=core:exports python -m {agent_name} --tui \
    --resume-session {session_id}

This will:

Load session state from state.json
Continue from where it paused/failed
Apply your fixes immediately

Option B: Resume from Specific Checkpoint (Time-Travel)

If you need to go back to an earlier point:

PYTHONPATH=core:exports python -m {agent_name} --tui \
    --resume-session {session_id} \
    --checkpoint {checkpoint_id}

Example:

PYTHONPATH=core:exports python -m deep_research_agent --tui \
    --resume-session session_20260208_143022_abc12345 \
    --checkpoint cp_node_complete_intake_143030

Option C: Use TUI Commands

Alternatively, launch TUI normally and use commands:

# Launch TUI
PYTHONPATH=core:exports python -m {agent_name} --tui

# In TUI, use commands:
/resume {session_id}                    # Resume from session state
/recover {session_id} {checkpoint_id}   # Recover from specific checkpoint

When to Use Each Option:

Use /resume (or --resume-session) when:

You fixed credentials and want to retry
Agent paused and you want to continue
Agent failed and you want to retry from last state

Use /recover (or --resume-session + --checkpoint) when:

You need to go back to an earlier checkpoint
You want to try a different path from a specific point
Debugging requires time-travel to earlier state

Find Available Checkpoints:

Use MCP tools to programmatically find and inspect checkpoints:

# List all sessions to find the failed one
list_agent_sessions(agent_work_dir="~/.hive/agents/{agent_name}", status="failed")

# Inspect session state
get_agent_session_state(agent_work_dir="~/.hive/agents/{agent_name}", session_id="{session_id}")

# Find clean checkpoints to resume from
list_agent_checkpoints(agent_work_dir="~/.hive/agents/{agent_name}", session_id="{session_id}", is_clean="true")

# Compare checkpoints to understand what changed
compare_agent_checkpoints(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="{session_id}",
    checkpoint_id_before="cp_node_complete_intake_143030",
    checkpoint_id_after="cp_node_complete_research_143115"
)

# Inspect memory at a specific checkpoint
get_agent_checkpoint(agent_work_dir="~/.hive/agents/{agent_name}", session_id="{session_id}", checkpoint_id="cp_node_complete_intake_143030")

Or in TUI:

/sessions {session_id}

Verification:

Use --resume-session to test your fix immediately
No need to re-run from the beginning
Session continues with your code changes applied


**Selecting the right template:**
- Match the issue category from Stage 4
- Customize with specific details from Stage 5
- Include actual error messages and code snippets
- Provide file paths and line numbers when possible
- **Always include recovery commands** (Template 6) after providing fix recommendations

---

### Stage 7: Verification Support

**Objective:** Help the developer confirm their fixes work

**What to do:**

1. **Suggest appropriate tests based on fix type:**

   **For node-level fixes:**
   ```bash
   # Use hive-test to run goal-based tests
   /hive-test --agent {agent_name} --goal {goal_id}

   # Or run specific test scenarios
   /hive-test --agent {agent_name} --scenario {specific_input}

For quick manual tests:

# Launch the interactive TUI dashboard
hive tui

Then use arrow keys to select the agent from the list and press Enter to run it.

Provide MCP tool queries to validate the fix:

Check if issue is resolved:

query_runtime_logs(
    agent_work_dir="~/.hive/agents/{agent_name}",
    status="needs_attention",
    limit=5
)
# Should show 0 results if fully fixed

Verify specific node behavior:

query_runtime_log_details(
    agent_work_dir="~/.hive/agents/{agent_name}",
    run_id="{new_run_id}",
    node_id="{fixed_node_id}"
)
# Should show exit_status="success", retry_count=0

Monitor for regression:
- Run the agent multiple times
- Check for similar issues reappearing
- Verify fix works across different inputs

Provide verification checklist:

Verification Checklist:
□ Applied recommended fix to code
□ Ran agent with test input
□ Checked runtime logs show no attention flags
□ Verified specific node completes successfully
□ Tested with multiple inputs
□ No regression of original issue
□ Agent meets success criteria

Example interaction:

Developer: "I applied the fix to research. How do I verify it works?"

You: "Great! Let's verify the fix with these steps:

1. Launch the TUI dashboard:
   hive tui
   Then select your agent from the list and press Enter to run it.

2. After it completes, check the logs:
   [Use query_runtime_logs to check for attention flags]

3. Verify the specific node:
   [Use query_runtime_log_details for research]

Expected results:
- No 'needs_attention' flags
- research shows exit_status='success'
- retry_count should be 0

Let me know when you've run it and I'll help check the logs!"

MCP Tool Usage Guide

Three Levels of Observability

L1: query_runtime_logs - Session-level summaries

When to use: Initial triage, identifying problematic runs, monitoring trends
Returns: List of runs with status, attention flags, timestamps

Example:

query_runtime_logs(
    agent_work_dir="/home/user/.hive/deep_research_agent",
    status="needs_attention",
    limit=20
)

L2: query_runtime_log_details - Node-level details

When to use: Diagnosing which nodes failed, understanding retry patterns
Returns: Per-node completion details, retry counts, verdicts

Example:

query_runtime_log_details(
    agent_work_dir="/home/user/.hive/deep_research_agent",
    run_id="session_20260206_115718_e22339c5",
    needs_attention_only=True
)

L3: query_runtime_log_raw - Step-level details

When to use: Root cause analysis, understanding exact failures
Returns: Full tool calls, LLM responses, judge feedback

Example:

query_runtime_log_raw(
    agent_work_dir="/home/user/.hive/deep_research_agent",
    run_id="session_20260206_115718_e22339c5",
    node_id="research"
)

Session & Checkpoint Tools

list_agent_sessions - Browse sessions with filtering

When to use: Finding resumable sessions, identifying failed sessions, Stage 3 triage
Returns: Session list with status, timestamps, is_resumable, current_node, quality

Example:

list_agent_sessions(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    status="failed",
    limit=10
)

get_agent_session_state - Load full session state (excludes memory values)

When to use: Inspecting session progress, checking is_resumable, examining path
Returns: Full state with memory_keys/memory_size instead of memory values

Example:

get_agent_session_state(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    session_id="session_20260208_143022_abc12345"
)

get_agent_session_memory - Get memory contents from a session

When to use: Stage 5 root cause analysis, inspecting produced data
Returns: All memory keys+values, or a single key's value

Example:

get_agent_session_memory(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    session_id="session_20260208_143022_abc12345",
    key="twitter_handles"
)

list_agent_checkpoints - List checkpoints for a session

When to use: Stage 6 recovery, finding clean checkpoints to resume from
Returns: Checkpoint summaries with type, node, clean status

Example:

list_agent_checkpoints(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    session_id="session_20260208_143022_abc12345",
    is_clean="true"
)

get_agent_checkpoint - Load a specific checkpoint with full state

When to use: Inspecting exact state at a checkpoint, comparing to current state
Returns: Full checkpoint: memory snapshot, execution path, metrics

Example:

get_agent_checkpoint(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    session_id="session_20260208_143022_abc12345",
    checkpoint_id="cp_node_complete_intake_143030"
)

compare_agent_checkpoints - Diff memory between two checkpoints

When to use: Understanding data flow, finding where state diverged
Returns: Memory diff (added/removed/changed keys) + execution path diff

Example:

compare_agent_checkpoints(
    agent_work_dir="/home/user/.hive/agents/twitter_outreach",
    session_id="session_20260208_143022_abc12345",
    checkpoint_id_before="cp_node_complete_intake_143030",
    checkpoint_id_after="cp_node_complete_research_143115"
)

Query Patterns

Pattern 1: Top-Down Investigation (Most common)

1. L1: Find problematic runs
2. L2: Identify failing nodes
3. L3: Analyze specific failures

Pattern 2: Node-Specific Debugging

1. L2: Get details for specific node across all runs
2. L3: Deep dive into worst failures

Pattern 3: Real-time Monitoring

Loop every 10 seconds:
  1. L1: Check for new needs_attention runs
  2. If found: Alert and drill into L2

Pattern 4: Session State + Checkpoint Recovery

1. list_agent_sessions: Find failed/paused sessions
2. get_agent_session_state: Check is_resumable, see execution path
3. get_agent_session_memory: Inspect what data was produced
4. list_agent_checkpoints: Find clean checkpoints before failure
5. compare_agent_checkpoints: Understand what changed between checkpoints
6. Recommend resume command with specific checkpoint

Complete Example Walkthrough

Scenario: deep_research_agent Agent Failing

Developer says: "My deep_research_agent keeps failing. It's not saving research findings correctly."

Stage 1: Setup

You: "I'll help debug the deep_research_agent agent. Let me gather context..."

[Read exports/deep_research_agent/agent.json]

Context:
- Agent: deep_research_agent
- Goal: deep-research
- Working Dir: ~/.hive/agents/deep_research_agent
- Success Criteria: ["Successfully send 5 personalized outreach messages"]
- Constraints: ["Must verify handle exists", "Must personalize message"]
- Nodes: research, profile-analyzer, message-composer, outreach-sender

Stage 2: Mode Selection

You: "Which debugging approach would you prefer?"

[Present 3 mode options using AskUserQuestion]

Developer selects: "Post-Mortem Analysis Mode"

Stage 3: Triage (L1)

You: "Querying runtime logs for problematic runs..."

[Call query_runtime_logs with status="needs_attention"]

Results:
- Found 1 run needing attention
- session_20260206_115718_e22339c5 (30 min ago)
- Status: degraded
- Categories: missing_outputs, retry_loops

Stage 4: Diagnosis (L2)

You: "Analyzing node-level details..."

[Call query_runtime_log_details for the session]

Results:
- Problem Node: research
- Exit Status: escalate
- Retry Count: 5
- Verdict Counts: {RETRY: 5, ESCALATE: 1}
- Attention Reasons: ["high_retry_count", "missing_outputs"]
- Total Steps: 8

Diagnosis: research has retry loop + missing outputs issue
Category: Missing Outputs + Retry Loops

Stage 5: Root Cause (L3)

You: "Drilling into detailed logs for research..."

[Call query_runtime_log_raw for node research]

Evidence:
Step 3: web_search("latest AI regulations 2026") → Found articles
  Verdict: RETRY
  Feedback: "Missing required output 'research_findings'. You found sources but didn't call set_output."

Step 4: web_search("AI regulation policy 2026") → Found more info
  Verdict: RETRY
  Feedback: "Still missing 'research_findings'. Use set_output to save your findings."

Steps 5-7: Same pattern repeats...

ROOT CAUSE: Node successfully finds handles but LLM doesn't call set_output. It keeps searching instead of completing.

Stage 6: Fix Recommendation

You: "Here's the fix for this issue:

## Issue: research Node Not Setting Outputs

**Root Cause:** Node successfully finds research sources but doesn't call set_output

**Category:** Missing Outputs

**File to edit:** exports/deep_research_agent/nodes/research.py

**Changes:**
Update the system_prompt to emphasize the output requirement:

```python
system_prompt = """
You are the Research node for the deep research agent.

Your task:
1. Search for relevant sources on the research topic
2. Extract key findings from each source
3. **CRITICAL: Call set_output(research_findings=[...]) immediately after gathering sufficient findings**

IMPORTANT: You MUST call set_output even if you found findings on the first search.
Do not continue searching indefinitely - set output and finish once you have enough data.

Output format:
{
  "research_findings": [{"source": "...", "summary": "...", "url": "..."}, ...]
}
"""

Why this works: The explicit instruction to call set_output immediately prevents the LLM from continuing to search after finding sufficient sources.


#### Stage 7: Verification

Developer: "I updated the prompt. How do I test it?"

You: "Let's verify the fix:

Launch the TUI dashboard: hive tui Then select your agent and press Enter to run it.
I'll check the logs after it runs..."

[Developer runs agent]

Developer: "It completed! Can you check if it worked?"

You: "Checking now..."

[Call query_runtime_log_details for the new run]

Results:

Node: research
Exit Status: success ✓
Retry Count: 0 ✓
Total Steps: 2 ✓
No attention flags ✓

"Perfect! The fix worked. The research now:

Completes successfully on first try
No retries needed
Calls set_output properly

Your agent should now work correctly!"


---

## Tips for Effective Debugging

1. **Always start with L1 logs** - Don't jump straight to detailed logs
2. **Focus on attention flags** - They highlight the real issues
3. **Compare verdict_feedback across steps** - Patterns reveal root causes
4. **Check tool error messages carefully** - They often contain the exact problem
5. **Consider the agent's goal** - Fixes should align with success criteria
6. **Test fixes immediately** - Quick verification prevents wasted effort
7. **Look for patterns across multiple runs** - One-time failures might be transient

## Common Pitfalls to Avoid

1. **Don't recommend code you haven't verified exists** - Always read files first
2. **Don't assume tool capabilities** - Check MCP server configs
3. **Don't ignore edge conditions** - Missing edges cause routing failures
4. **Don't overlook judge configuration** - Mismatched expectations cause retry loops
5. **Don't forget nullable_output_keys** - Optional inputs need explicit marking
6. **Don't diagnose "in_progress" as a failure for forever-alive agents** - Agents with `terminal_nodes=[]` are designed to never enter "completed" state. This is intentional. Focus on quality of individual node visits, not session completion status
7. **Don't ignore conversation memory issues in long-running sessions** - In continuous conversation mode, history grows across node transitions and loop iterations. Watch for context overflow (tokens_used > 100K), stale data from previous loops affecting edge conditions, and compaction failures that cause the LLM to lose important context
8. **Don't confuse "waiting for user" with "stalled"** - Client-facing nodes in forever-alive agents block for user input by design. A session paused at a client-facing node is working correctly, not stalled

---

## Storage Locations Reference

**New unified storage (default):**
- Logs: `~/.hive/agents/{agent_name}/sessions/session_YYYYMMDD_HHMMSS_{uuid}/logs/`
- State: `~/.hive/agents/{agent_name}/sessions/{session_id}/state.json`
- Conversations: `~/.hive/agents/{agent_name}/sessions/{session_id}/conversations/`

**Old storage (deprecated, still supported):**
- Logs: `~/.hive/agents/{agent_name}/runtime_logs/runs/{run_id}/`

The MCP tools automatically check both locations.

---

**Remember:** Your role is to be a debugging companion and thought partner. Guide the developer through the investigation, explain what you find, and provide actionable fixes. Don't just report errors - help understand and solve them.

Adoption

mattmre/hive-debugger

$ install --global

Security Scan Results

SKILL.md

Hive Debugger

When to Use This Skill

Forever-Alive Agent Awareness

Prerequisites

Workflow

Stage 1: Setup & Context Gathering

Stage 2: Mode Selection

Stage 3: Triage (L1 Analysis)

Stage 4: Diagnosis (L2 Analysis)

Stage 5: Root Cause Analysis (L3 Analysis)

Stage 6: Fix Recommendations

Template 1: Missing Outputs (Client-Facing Nodes)

Template 3: Tool Errors

Option B: Resume from Specific Checkpoint (Time-Travel)

Option C: Use TUI Commands

When to Use Each Option:

Find Available Checkpoints:

MCP Tool Usage Guide

Three Levels of Observability

Session & Checkpoint Tools

Query Patterns

Complete Example Walkthrough

Scenario: deep_research_agent Agent Failing

Stage 1: Setup

Stage 2: Mode Selection

Stage 3: Triage (L1)

Stage 4: Diagnosis (L2)

Stage 5: Root Cause (L3)

Stage 6: Fix Recommendation

Related Skills

mattmre/orchestration-framework

mattmre/triage-issue-skill

mattmre/hive

mattmre/hive-test

mattmre/hive-debugger

$ install --global

Security Scan Results

SKILL.md

Hive Debugger

When to Use This Skill

Forever-Alive Agent Awareness

Prerequisites

Workflow

Stage 1: Setup & Context Gathering

Stage 2: Mode Selection

Stage 3: Triage (L1 Analysis)

Stage 4: Diagnosis (L2 Analysis)

Stage 5: Root Cause Analysis (L3 Analysis)

Stage 6: Fix Recommendations

Template 1: Missing Outputs (Client-Facing Nodes)

Template 3: Tool Errors

Option B: Resume from Specific Checkpoint (Time-Travel)

Option C: Use TUI Commands

When to Use Each Option:

Find Available Checkpoints:

MCP Tool Usage Guide

Three Levels of Observability

Session & Checkpoint Tools

Query Patterns

Complete Example Walkthrough

Scenario: deep_research_agent Agent Failing

Stage 1: Setup

Stage 2: Mode Selection

Stage 3: Triage (L1)

Stage 4: Diagnosis (L2)

Stage 5: Root Cause (L3)

Stage 6: Fix Recommendation

Related Skills

mattmre/orchestration-framework

mattmre/triage-issue-skill

mattmre/hive

mattmre/hive-test