Agent Testing

Test agents iteratively: execute, analyze failures, fix, resume from checkpoint, repeat.

When to Use

Testing a newly built agent against its goal
Debugging a failing agent iteratively
Verifying fixes without re-running expensive early nodes
Running final regression tests before deployment

Prerequisites

Agent package at exports/{agent_name}/ (built with /hive-create)
Credentials configured (/hive-credentials)
ANTHROPIC_API_KEY set (or appropriate LLM provider key)

Path distinction (critical — don't confuse these):

exports/{agent_name}/ — agent source code (edit here)
~/.hive/agents/{agent_name}/ — runtime data: sessions, checkpoints, logs (read here)

The Iterative Test Loop

This is the core workflow. Don't re-run the entire agent when a late node fails — analyze, fix, and resume from the last clean checkpoint.

┌──────────────────────────────────────┐
│ PHASE 1: Generate Test Scenarios     │
│ Goal → synthetic test inputs + tests │
└──────────────┬───────────────────────┘
               ↓
┌──────────────────────────────────────┐
│ PHASE 2: Execute                     │◄────────────────┐
│ Run agent (CLI or pytest)            │                 │
└──────────────┬───────────────────────┘                 │
               ↓                                         │
          Pass? ──yes──► PHASE 6: Final Verification     │
               │                                         │
               no                                        │
               ↓                                         │
┌──────────────────────────────────────┐                 │
│ PHASE 3: Analyze                     │                 │
│ Session + runtime logs + checkpoints │                 │
└──────────────┬───────────────────────┘                 │
               ↓                                         │
┌──────────────────────────────────────┐                 │
│ PHASE 4: Fix                         │                 │
│ Prompt / code / graph / goal         │                 │
└──────────────┬───────────────────────┘                 │
               ↓                                         │
┌──────────────────────────────────────┐                 │
│ PHASE 5: Recover & Resume            │─────────────────┘
│ Checkpoint resume OR fresh re-run    │
└──────────────────────────────────────┘

Phase 1: Generate Test Scenarios

Create synthetic tests from the agent's goal, constraints, and success criteria.

Step 1a: Read the goal

# Read goal from agent.py
Read(file_path="exports/{agent_name}/agent.py")
# Extract the Goal definition and convert to JSON string

Step 1b: Get test guidelines

# Get constraint test guidelines
generate_constraint_tests(
    goal_id="your-goal-id",
    goal_json='{"id": "...", "constraints": [...]}',
    agent_path="exports/{agent_name}"
)

# Get success criteria test guidelines
generate_success_tests(
    goal_id="your-goal-id",
    goal_json='{"id": "...", "success_criteria": [...]}',
    node_names="intake,research,review,report",
    tool_names="web_search,web_scrape",
    agent_path="exports/{agent_name}"
)

These return file_header, test_template, constraints_formatted/success_criteria_formatted, and test_guidelines. They do NOT generate test code — you write the tests.

Step 1c: Write tests

Write(
    file_path=result["output_file"],
    content=result["file_header"] + "\n\n" + your_test_code
)

Test writing rules

Every test MUST be async with @pytest.mark.asyncio
Every test MUST accept runner, auto_responder, mock_mode fixtures
Use await auto_responder.start() before running, await auto_responder.stop() in finally
Use await runner.run(input_dict) — this goes through AgentRunner → AgentRuntime → ExecutionStream
Access output via result.output.get("key") — NEVER result.output["key"]
result.success=True means no exception, NOT goal achieved — always check output
Write 8-15 tests total, not 30+
Each real test costs ~3 seconds + LLM tokens
NEVER use default_agent.run() — it bypasses the runtime (no sessions, no logs, client-facing nodes hang)

Step 1d: Check existing tests

Before generating, check if tests already exist:

list_tests(
    goal_id="your-goal-id",
    agent_path="exports/{agent_name}"
)

Phase 2: Execute

Two execution paths, use the right one for your situation.

Iterative debugging (for complex agents)

Run the agent via CLI. This creates sessions with checkpoints at ~/.hive/agents/{agent_name}/sessions/:

uv run hive run exports/{agent_name} --input '{"query": "test topic"}'

Sessions and checkpoints are saved automatically.

Client-facing nodes: Agents with client_facing=True nodes (interactive conversation) work in headless mode when run from a real terminal — the agent streams output to stdout and reads user input from stdin via a >>> prompt. In non-interactive shells (like Claude Code's Bash tool), client-facing nodes will hang because there is no stdin. For testing interactive agents from Claude Code, use run_tests with mock mode or have the user run the agent manually in their terminal.

Automated regression (for CI or final verification)

Use the run_tests MCP tool to run all pytest tests:

run_tests(
    goal_id="your-goal-id",
    agent_path="exports/{agent_name}"
)

Returns structured results:

{
  "overall_passed": false,
  "summary": {"total": 12, "passed": 10, "failed": 2, "pass_rate": "83.3%"},
  "test_results": [{"test_name": "test_success_source_diversity", "status": "failed"}],
  "failures": [{"test_name": "test_success_source_diversity", "details": "..."}]
}

Options:

# Run only constraint tests
run_tests(goal_id, agent_path, test_types='["constraint"]')

# Stop on first failure
run_tests(goal_id, agent_path, fail_fast=True)

# Parallel execution
run_tests(goal_id, agent_path, parallel=4)

Note: run_tests uses AgentRunner with tmp_path storage, so sessions are isolated per test run. For checkpoint-based recovery with persistent sessions, use CLI execution. Use run_tests for quick regression checks and final verification.

Phase 3: Analyze Failures

When a test fails, drill down systematically. Don't guess — use the tools.

Step 3a: Get error category

debug_test(
    goal_id="your-goal-id",
    test_name="test_success_source_diversity",
    agent_path="exports/{agent_name}"
)

Returns error category (IMPLEMENTATION_ERROR, ASSERTION_FAILURE, TIMEOUT, IMPORT_ERROR, API_ERROR) plus full traceback and suggestions.

Step 3b: Find the failed session

list_agent_sessions(
    agent_work_dir="~/.hive/agents/{agent_name}",
    status="failed",
    limit=5
)

Returns session list with IDs, timestamps, current_node (where it failed), execution_quality.

Step 3c: Inspect session state

get_agent_session_state(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="session_20260209_143022_abc12345"
)

Returns execution path, which node was current, step count, timestamps — but excludes memory values (to avoid context bloat). Shows memory_keys and memory_size instead.

Step 3d: Examine runtime logs (L2/L3)

# L2: Per-node success/failure, retry counts
query_runtime_log_details(
    agent_work_dir="~/.hive/agents/{agent_name}",
    run_id="session_20260209_143022_abc12345",
    needs_attention_only=True
)

# L3: Exact LLM responses, tool call inputs/outputs
query_runtime_log_raw(
    agent_work_dir="~/.hive/agents/{agent_name}",
    run_id="session_20260209_143022_abc12345",
    node_id="research"
)

Step 3e: Inspect memory data

# See what data a node actually produced
get_agent_session_memory(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="session_20260209_143022_abc12345",
    key="research_results"
)

Step 3f: Find recovery points

list_agent_checkpoints(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="session_20260209_143022_abc12345",
    is_clean="true"
)

Returns checkpoint summaries with IDs, types (node_start, node_complete), which node, and is_clean flag. Clean checkpoints are safe resume points.

Step 3g: Compare checkpoints (optional)

To understand what changed between two points in execution:

compare_agent_checkpoints(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="session_20260209_143022_abc12345",
    checkpoint_id_before="cp_node_complete_research_143030",
    checkpoint_id_after="cp_node_complete_review_143115"
)

Returns memory diff (added/removed/changed keys) and execution path diff.

Phase 4: Fix Based on Root Cause

Use the analysis from Phase 3 to determine what to fix and where.

| Root Cause | What to Fix | Where to Edit | |------------|------------|---------------| | Prompt issue — LLM produces wrong output format, misses instructions | Node system_prompt | exports/{agent}/nodes/__init__.py | | Code bug — TypeError, KeyError, logic error in Python | Agent code | exports/{agent}/agent.py, nodes/__init__.py | | Graph issue — wrong routing, missing edge, bad condition_expr | Edges, node config | exports/{agent}/agent.py | | Tool issue — MCP tool fails, wrong config, missing credential | Tool config | exports/{agent}/mcp_servers.json, /hive-credentials | | Goal issue — success criteria too strict/vague, wrong constraints | Goal definition | exports/{agent}/agent.py (goal section) | | Test issue — test expectations don't match actual agent behavior | Test code | exports/{agent}/tests/test_*.py |

Fix strategies by error category

IMPLEMENTATION_ERROR (TypeError, AttributeError, KeyError):

# Read the failing code
Read(file_path="exports/{agent_name}/nodes/__init__.py")

# Fix the bug
Edit(
    file_path="exports/{agent_name}/nodes/__init__.py",
    old_string="results.get('videos')",
    new_string="(results or {}).get('videos', [])"
)

ASSERTION_FAILURE (test assertions fail but agent ran successfully):

Check if the agent's output is actually wrong → fix the prompt
Check if the test's expectations are unrealistic → fix the test
Use get_agent_session_memory to see what the agent actually produced

TIMEOUT / STALL (agent runs too long):

Check node_visit_counts for feedback loops hitting max_node_visits
Check L3 logs for tool calls that hang
Reduce max_iterations in loop_config or fix the prompt to converge faster

API_ERROR (connection, rate limit, auth):

Verify credentials with /hive-credentials
Check MCP server configuration

Phase 5: Recover & Resume

After fixing the agent, decide whether to resume or re-run.

When to resume from checkpoint

Resume when ALL of these are true:

The fix is to a node that comes AFTER existing clean checkpoints
Clean checkpoints exist (from a CLI execution with checkpointing)
The early nodes are expensive (web scraping, API calls, long LLM chains)

# Resume from the last clean checkpoint before the failing node
uv run hive run exports/{agent_name} \
  --resume-session session_20260209_143022_abc12345 \
  --checkpoint cp_node_complete_research_143030

This skips all nodes before the checkpoint and only re-runs the fixed node onward.

When to re-run from scratch

Re-run when ANY of these are true:

The fix is to the entry node or an early node
No checkpoints exist (e.g., agent was run via run_tests)
The agent is fast (2-3 nodes, completes in seconds)
You changed the graph structure (added/removed nodes/edges)

uv run hive run exports/{agent_name} --input '{"query": "test topic"}'

Inspecting a checkpoint before resuming

get_agent_checkpoint(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="session_20260209_143022_abc12345",
    checkpoint_id="cp_node_complete_research_143030"
)

Returns the full checkpoint: shared_memory snapshot, execution_path, current_node, next_node, is_clean.

Loop back to Phase 2

After resuming or re-running, check if the fix worked. If not, go back to Phase 3.

Phase 6: Final Verification

Once the iterative fix loop converges (the agent produces correct output), run the full automated test suite:

run_tests(
    goal_id="your-goal-id",
    agent_path="exports/{agent_name}"
)

All tests should pass. If not, repeat the loop for remaining failures.

Credential Requirements

CRITICAL: Testing requires ALL credentials the agent depends on. This includes both the LLM API key AND any tool-specific credentials (HubSpot, Brave Search, etc.).

Prerequisites

Before running agent tests, you MUST collect ALL required credentials from the user.

Step 1: LLM API Key (always required)

export ANTHROPIC_API_KEY="your-key-here"

Step 2: Tool-specific credentials (depends on agent's tools)

Inspect the agent's mcp_servers.json and tool configuration to determine which tools the agent uses, then check for all required credentials:

from aden_tools.credentials import CredentialManager, CREDENTIAL_SPECS

creds = CredentialManager()

# Determine which tools the agent uses (from agent.json or mcp_servers.json)
agent_tools = [...]  # e.g., ["hubspot_search_contacts", "web_search", ...]

# Find all missing credentials for those tools
missing = creds.get_missing_for_tools(agent_tools)

Common tool credentials: | Tool | Env Var | Help URL | |------|---------|----------| | HubSpot CRM | HUBSPOT_ACCESS_TOKEN | https://developers.hubspot.com/docs/api/private-apps | | Brave Search | BRAVE_SEARCH_API_KEY | https://brave.com/search/api/ | | Google Search | GOOGLE_SEARCH_API_KEY + GOOGLE_SEARCH_CX | https://developers.google.com/custom-search |

Why ALL credentials are required:

Tests need to execute the agent's LLM nodes to validate behavior
Tools with missing credentials will return error dicts instead of real data
Mock mode bypasses everything, providing no confidence in real-world performance

Mock Mode Limitations

Mock mode (--mock flag or MOCK_MODE=1) is ONLY for structure validation:

Validates graph structure (nodes, edges, connections)
Validates that AgentRunner.load() succeeds and the agent is importable
Does NOT execute event_loop agents — MockLLMProvider never calls set_output, so event_loop nodes loop forever
Does NOT test LLM reasoning, content quality, or constraint validation
Does NOT test real API integrations or tool use

Bottom line: If you're testing whether an agent achieves its goal, you MUST use real credentials.

Enforcing Credentials in Tests

When writing tests, ALWAYS include credential checks:

import os
import pytest
from aden_tools.credentials import CredentialManager

pytestmark = pytest.mark.skipif(
    not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
    reason="API key required for real testing. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)


@pytest.fixture(scope="session", autouse=True)
def check_credentials():
    """Ensure ALL required credentials are set for real testing."""
    creds = CredentialManager()
    mock_mode = os.environ.get("MOCK_MODE")

    if not creds.is_available("anthropic"):
        if mock_mode:
            print("\nRunning in MOCK MODE - structure validation only")
        else:
            pytest.fail(
                "\nANTHROPIC_API_KEY not set!\n"
                "Set API key: export ANTHROPIC_API_KEY='your-key-here'\n"
                "Or run structure validation: MOCK_MODE=1 pytest exports/{agent}/tests/"
            )

    if not mock_mode:
        agent_tools = []  # Update per agent
        missing = creds.get_missing_for_tools(agent_tools)
        if missing:
            lines = ["\nMissing tool credentials!"]
            for name in missing:
                spec = creds.specs.get(name)
                if spec:
                    lines.append(f"  {spec.env_var} - {spec.description}")
            pytest.fail("\n".join(lines))

User Communication

When the user asks to test an agent, ALWAYS check for ALL credentials first:

Identify the agent's tools from mcp_servers.json
Check ALL required credentials using CredentialManager
Ask the user to provide any missing credentials before proceeding
Collect ALL missing credentials in a single prompt — not one at a time

Safe Test Patterns

OutputCleaner

The framework automatically validates and cleans node outputs using a fast LLM at edge traversal time. Tests should still use safe patterns because OutputCleaner may not catch all issues.

Safe Access (REQUIRED)

# UNSAFE - will crash on missing keys
approval = result.output["approval_decision"]
category = result.output["analysis"]["category"]

# SAFE - use .get() with defaults
output = result.output or {}
approval = output.get("approval_decision", "UNKNOWN")

# SAFE - type check before operations
analysis = output.get("analysis", {})
if isinstance(analysis, dict):
    category = analysis.get("category", "unknown")

# SAFE - handle JSON parsing trap (LLM response as string)
import json
recommendation = output.get("recommendation", "{}")
if isinstance(recommendation, str):
    try:
        parsed = json.loads(recommendation)
        if isinstance(parsed, dict):
            approval = parsed.get("approval_decision", "UNKNOWN")
    except json.JSONDecodeError:
        approval = "UNKNOWN"
elif isinstance(recommendation, dict):
    approval = recommendation.get("approval_decision", "UNKNOWN")

# SAFE - type check before iteration
items = output.get("items", [])
if isinstance(items, list):
    for item in items:
        ...

Helper Functions for conftest.py

import json
import re

def _parse_json_from_output(result, key):
    """Parse JSON from agent output (framework may store full LLM response as string)."""
    response_text = result.output.get(key, "")
    json_text = re.sub(r'```json\s*|\s*```', '', response_text).strip()
    try:
        return json.loads(json_text)
    except (json.JSONDecodeError, AttributeError, TypeError):
        return result.output.get(key)

def safe_get_nested(result, key_path, default=None):
    """Safely get nested value from result.output."""
    output = result.output or {}
    current = output
    for key in key_path:
        if isinstance(current, dict):
            current = current.get(key)
        elif isinstance(current, str):
            try:
                json_text = re.sub(r'```json\s*|\s*```', '', current).strip()
                parsed = json.loads(json_text)
                if isinstance(parsed, dict):
                    current = parsed.get(key)
                else:
                    return default
            except json.JSONDecodeError:
                return default
        else:
            return default
    return current if current is not None else default

# Make available in tests
pytest.parse_json_from_output = _parse_json_from_output
pytest.safe_get_nested = safe_get_nested

ExecutionResult Fields

result.success=True means NO exception, NOT goal achieved

# WRONG
assert result.success

# RIGHT
assert result.success, f"Agent failed: {result.error}"
output = result.output or {}
approval = output.get("approval_decision")
assert approval == "APPROVED", f"Expected APPROVED, got {approval}"

All fields:

success: bool — Completed without exception (NOT goal achieved!)
output: dict — Complete memory snapshot (may contain raw strings)
error: str | None — Error message if failed
steps_executed: int — Number of nodes executed
total_tokens: int — Cumulative token usage
total_latency_ms: int — Total execution time
path: list[str] — Node IDs traversed (may repeat in feedback loops)
paused_at: str | None — Node ID if paused
session_state: dict — State for resuming
node_visit_counts: dict[str, int] — Visit counts per node (feedback loop testing)
execution_quality: str — "clean", "degraded", or "failed"

Test Count Guidance

Write 8-15 tests, not 30+

2-3 tests per success criterion
1 happy path test
1 boundary/edge case test
1 error handling test (optional)

Each real test costs ~3 seconds + LLM tokens. 12 tests = ~36 seconds, $0.12.

Test Patterns

Happy Path

@pytest.mark.asyncio
async def test_happy_path(runner, auto_responder, mock_mode):
    """Test normal successful execution."""
    await auto_responder.start()
    try:
        result = await runner.run({"query": "python tutorials"})
    finally:
        await auto_responder.stop()
    assert result.success, f"Agent failed: {result.error}"
    output = result.output or {}
    assert output.get("report"), "No report produced"

Boundary Condition

@pytest.mark.asyncio
async def test_minimum_sources(runner, auto_responder, mock_mode):
    """Test at minimum source threshold."""
    await auto_responder.start()
    try:
        result = await runner.run({"query": "niche topic"})
    finally:
        await auto_responder.stop()
    assert result.success, f"Agent failed: {result.error}"
    output = result.output or {}
    sources = output.get("sources", [])
    if isinstance(sources, list):
        assert len(sources) >= 3, f"Expected >= 3 sources, got {len(sources)}"

Error Handling

@pytest.mark.asyncio
async def test_empty_input(runner, auto_responder, mock_mode):
    """Test graceful handling of empty input."""
    await auto_responder.start()
    try:
        result = await runner.run({"query": ""})
    finally:
        await auto_responder.stop()
    # Agent should either fail gracefully or produce an error message
    output = result.output or {}
    assert not result.success or output.get("error"), "Should handle empty input"

Feedback Loop

@pytest.mark.asyncio
async def test_feedback_loop_terminates(runner, auto_responder, mock_mode):
    """Test that feedback loops don't run forever."""
    await auto_responder.start()
    try:
        result = await runner.run({"query": "test"})
    finally:
        await auto_responder.stop()
    visits = result.node_visit_counts or {}
    for node_id, count in visits.items():
        assert count <= 5, f"Node {node_id} visited {count} times — possible infinite loop"

MCP Tool Reference

Phase 1: Test Generation

# Check existing tests
list_tests(goal_id, agent_path)

# Get constraint test guidelines (returns templates, NOT generated tests)
generate_constraint_tests(goal_id, goal_json, agent_path)
# Returns: output_file, file_header, test_template, constraints_formatted, test_guidelines

# Get success criteria test guidelines
generate_success_tests(goal_id, goal_json, node_names, tool_names, agent_path)
# Returns: output_file, file_header, test_template, success_criteria_formatted, test_guidelines

Phase 2: Execution

# Automated regression (no checkpoints, fresh runs)
run_tests(goal_id, agent_path, test_types='["all"]', parallel=-1, fail_fast=False)

# Run only specific test types
run_tests(goal_id, agent_path, test_types='["constraint"]')
run_tests(goal_id, agent_path, test_types='["success"]')

# Iterative debugging with checkpoints (via CLI)
uv run hive run exports/{agent_name} --input '{"query": "test"}'

Phase 3: Analysis

# Debug a specific failed test
debug_test(goal_id, test_name, agent_path)

# Find failed sessions
list_agent_sessions(agent_work_dir, status="failed", limit=5)

# Inspect session state (excludes memory values)
get_agent_session_state(agent_work_dir, session_id)

# Inspect memory data
get_agent_session_memory(agent_work_dir, session_id, key="research_results")

# Runtime logs: L1 summaries
query_runtime_logs(agent_work_dir, status="needs_attention")

# Runtime logs: L2 per-node details
query_runtime_log_details(agent_work_dir, run_id, needs_attention_only=True)

# Runtime logs: L3 tool/LLM raw data
query_runtime_log_raw(agent_work_dir, run_id, node_id="research")

# Find clean checkpoints
list_agent_checkpoints(agent_work_dir, session_id, is_clean="true")

# Compare checkpoints (memory diff)
compare_agent_checkpoints(agent_work_dir, session_id, cp_before, cp_after)

Phase 5: Recovery

# Inspect checkpoint before resuming
get_agent_checkpoint(agent_work_dir, session_id, checkpoint_id)
# Empty checkpoint_id = latest checkpoint

# Resume from checkpoint via CLI (headless)
uv run hive run exports/{agent_name} \
  --resume-session {session_id} --checkpoint {checkpoint_id}

Anti-Patterns

| Don't | Do Instead | |-------|-----------| | Use default_agent.run() in tests | Use runner.run() with auto_responder fixtures (goes through AgentRuntime) | | Re-run entire agent when a late node fails | Resume from last clean checkpoint | | Treat result.success as goal achieved | Check result.output for actual criteria | | Access result.output["key"] directly | Use result.output.get("key") | | Fix random things hoping tests pass | Analyze L2/L3 logs to find root cause first | | Write 30+ tests | Write 8-15 focused tests | | Skip credential check | Use /hive-credentials before testing | | Confuse exports/ with ~/.hive/agents/ | Code in exports/, runtime data in ~/.hive/ | | Use run_tests for iterative debugging | Use headless CLI with checkpoints for iterative debugging | | Use headless CLI for final regression | Use run_tests for automated regression | | Use --tui from Claude Code | Use headless run command — TUI hangs in non-interactive shells | | Test client-facing nodes from Claude Code | Use mock mode, or have the user run the agent in their terminal | | Run tests without reading goal first | Always understand the goal before writing tests | | Skip Phase 3 analysis and guess | Use session + log tools to identify root cause |

Example Walkthrough: Deep Research Agent

A complete iteration showing the test loop for an agent with nodes: intake → research → review → report.

Phase 1: Generate tests

# Read the goal
Read(file_path="exports/deep_research_agent/agent.py")

# Get success criteria test guidelines
result = generate_success_tests(
    goal_id="rigorous-interactive-research",
    goal_json='{"id": "rigorous-interactive-research", "success_criteria": [{"id": "source-diversity", "target": ">=5"}, {"id": "citation-coverage", "target": "100%"}, {"id": "report-completeness", "target": "90%"}]}',
    node_names="intake,research,review,report",
    tool_names="web_search,web_scrape",
    agent_path="exports/deep_research_agent"
)

# Write tests
Write(
    file_path=result["output_file"],
    content=result["file_header"] + "\n\n" + test_code
)

Phase 2: First execution

run_tests(
    goal_id="rigorous-interactive-research",
    agent_path="exports/deep_research_agent",
    fail_fast=True
)

Result: test_success_source_diversity fails — agent only found 2 sources instead of 5.

Phase 3: Analyze

# Debug the failing test
debug_test(
    goal_id="rigorous-interactive-research",
    test_name="test_success_source_diversity",
    agent_path="exports/deep_research_agent"
)
# → ASSERTION_FAILURE: Expected >= 5 sources, got 2

# Find the session
list_agent_sessions(
    agent_work_dir="~/.hive/agents/deep_research_agent",
    status="completed",
    limit=1
)
# → session_20260209_150000_abc12345

# See what the research node produced
get_agent_session_memory(
    agent_work_dir="~/.hive/agents/deep_research_agent",
    session_id="session_20260209_150000_abc12345",
    key="research_results"
)
# → Only 2 web_search calls made, each returned 1 source

# Check the LLM's behavior in the research node
query_runtime_log_raw(
    agent_work_dir="~/.hive/agents/deep_research_agent",
    run_id="session_20260209_150000_abc12345",
    node_id="research"
)
# → LLM called web_search only twice, then called set_output

Root cause: The research node's prompt doesn't tell the LLM to search for at least 5 diverse sources. It stops after the first couple of searches.

Phase 4: Fix the prompt

Read(file_path="exports/deep_research_agent/nodes/__init__.py")

Edit(
    file_path="exports/deep_research_agent/nodes/__init__.py",
    old_string='system_prompt="Search for information on the user\'s topic."',
    new_string='system_prompt="Search for information on the user\'s topic. You MUST find at least 5 diverse, authoritative sources. Use multiple different search queries to ensure source diversity. Do not stop searching until you have at least 5 distinct sources."'
)

Phase 5: Resume from checkpoint

For this example, the fix is to the research node. If we had run via CLI with checkpointing, we could resume from the checkpoint after intake to skip re-running intake:

# Check if clean checkpoint exists after intake
list_agent_checkpoints(
    agent_work_dir="~/.hive/agents/deep_research_agent",
    session_id="session_20260209_150000_abc12345",
    is_clean="true"
)
# → cp_node_complete_intake_150005

# Resume from after intake, re-run research with fixed prompt
uv run hive run exports/deep_research_agent \
  --resume-session session_20260209_150000_abc12345 \
  --checkpoint cp_node_complete_intake_150005

Or for this simple case (intake is fast), just re-run:

uv run hive run exports/deep_research_agent --input '{"topic": "test"}'

Phase 6: Final verification

run_tests(
    goal_id="rigorous-interactive-research",
    agent_path="exports/deep_research_agent"
)
# → All 12 tests pass

Test File Structure

exports/{agent_name}/
├── agent.py              ← Agent to test (goal, nodes, edges)
├── nodes/__init__.py     ← Node implementations (prompts, config)
├── config.py             ← Agent configuration
├── mcp_servers.json      ← Tool server config
└── tests/
    ├── conftest.py           ← Shared fixtures + safe access helpers
    ├── test_constraints.py   ← Constraint tests
    ├── test_success_criteria.py  ← Success criteria tests
    └── test_edge_cases.py    ← Edge case tests

Integration with Other Skills

| Scenario | From | To | Action | |----------|------|----|--------| | Agent built, ready to test | /hive-create | /hive-test | Generate tests, start loop | | Prompt fix needed | /hive-test Phase 4 | Direct edit | Edit nodes/__init__.py, resume | | Goal definition wrong | /hive-test Phase 4 | /hive-create | Update goal, may need rebuild | | Missing credentials | /hive-test Phase 3 | /hive-credentials | Set up credentials | | Complex runtime failure | /hive-test Phase 3 | /hive-debugger | Deep L1/L2/L3 analysis | | All tests pass | /hive-test Phase 6 | Done | Agent validated |

Evidence Capture Protocol

Adapted from Agent33's TDD evidence-capture discipline. Every test cycle must produce verifiable evidence that can be audited later.

Verification Log Format

After completing a test cycle, record evidence in a verification log entry:

| Date | Cycle ID | PR/Branch | Command | Result | Rationale Link | Artifact Link |
|------|----------|-----------|---------|--------|----------------|---------------|
| YYYY-MM-DD | [task-id] | [branch] | [exact test command] | [pass/fail + count] | [session-log-path] | [PR-link] |

TDD Stage Evidence

For each RED/GREEN/REFACTOR stage, capture structured evidence:

## RED Stage Evidence
- Task ID: [task-id]
- Test file(s): [paths]
- Command: [exact test command]
- Result: FAIL (expected)
- Failure reason: [why it fails - missing implementation, etc.]
- Timestamp: [ISO 8601]

## GREEN Stage Evidence
- Task ID: [task-id]
- Implementation file(s): [paths]
- Command: [exact test command]
- Result: PASS
- Tests passed: [count]
- Regressions: [none / list any]
- Timestamp: [ISO 8601]

## REFACTOR Stage Evidence
- Task ID: [task-id]
- Refactoring type: [extract method, rename, restructure, etc.]
- Files changed: [paths]
- Command: [exact test command]
- Result: PASS (maintained)
- Improvements: [brief description]
- Timestamp: [ISO 8601]

Coverage Requirements

Coverage requirements are linked to the type of change being tested:

| Change Type | Minimum Coverage | Evidence Required | |-------------|------------------|-------------------| | New feature | 80% line coverage for new code | Coverage report artifact | | Bug fix | Test that reproduces bug + fix | Before/after test output | | Refactor | Maintain existing coverage | Coverage diff (no decrease) | | Security fix | 100% coverage of security path | Security test evidence |

Evidence Capture Commands

Run these commands and include their output in the session log or PR description:

# Test output with verbose results
npm test -- --verbose 2>&1 | tee test-output.log
# or for Python projects:
pytest -v --tb=short 2>&1 | tee test-output.log

# Coverage report
npm test -- --coverage 2>&1 | tee coverage-report.log
# or for Python projects:
pytest --cov=src --cov-report=term-missing 2>&1 | tee coverage-report.log

# Diff summary for the PR
git diff --stat HEAD~1
git diff HEAD~1 -- '*.ts' '*.js' '*.py'

Acceptance Criteria Linkage

Map each acceptance criterion to its test(s) to ensure full traceability:

## Acceptance Criteria Coverage
| Criterion | Test(s) | Status |
|-----------|---------|--------|
| AC-1: Feature works as specified | `test_feature_happy_path` | Covered |
| AC-2: Edge case handled | `test_edge_case_boundary` | Covered |
| AC-3: Error handling | `test_error_graceful` | Covered |

Regression Prevention Checklist

Before marking a test cycle as complete:

[ ] All new tests pass
[ ] All existing tests still pass (no regressions introduced)
[ ] Coverage meets or exceeds the minimum for the change type
[ ] Test output captured and saved as evidence
[ ] Coverage report captured and saved as evidence
[ ] Diff summary documented with rationale for each changed file
[ ] Acceptance criteria mapped to specific test cases
[ ] Verification log entry created with exact commands and results

Agent Testing

Test agents iteratively: execute, analyze failures, fix, resume from checkpoint, repeat.

When to Use

Testing a newly built agent against its goal
Debugging a failing agent iteratively
Verifying fixes without re-running expensive early nodes
Running final regression tests before deployment

Prerequisites

Agent package at exports/{agent_name}/ (built with /hive-create)
Credentials configured (/hive-credentials)
ANTHROPIC_API_KEY set (or appropriate LLM provider key)

Path distinction (critical — don't confuse these):

exports/{agent_name}/ — agent source code (edit here)
~/.hive/agents/{agent_name}/ — runtime data: sessions, checkpoints, logs (read here)

The Iterative Test Loop

This is the core workflow. Don't re-run the entire agent when a late node fails — analyze, fix, and resume from the last clean checkpoint.

┌──────────────────────────────────────┐
│ PHASE 1: Generate Test Scenarios     │
│ Goal → synthetic test inputs + tests │
└──────────────┬───────────────────────┘
               ↓
┌──────────────────────────────────────┐
│ PHASE 2: Execute                     │◄────────────────┐
│ Run agent (CLI or pytest)            │                 │
└──────────────┬───────────────────────┘                 │
               ↓                                         │
          Pass? ──yes──► PHASE 6: Final Verification     │
               │                                         │
               no                                        │
               ↓                                         │
┌──────────────────────────────────────┐                 │
│ PHASE 3: Analyze                     │                 │
│ Session + runtime logs + checkpoints │                 │
└──────────────┬───────────────────────┘                 │
               ↓                                         │
┌──────────────────────────────────────┐                 │
│ PHASE 4: Fix                         │                 │
│ Prompt / code / graph / goal         │                 │
└──────────────┬───────────────────────┘                 │
               ↓                                         │
┌──────────────────────────────────────┐                 │
│ PHASE 5: Recover & Resume            │─────────────────┘
│ Checkpoint resume OR fresh re-run    │
└──────────────────────────────────────┘

Phase 1: Generate Test Scenarios

Create synthetic tests from the agent's goal, constraints, and success criteria.

Step 1a: Read the goal

# Read goal from agent.py
Read(file_path="exports/{agent_name}/agent.py")
# Extract the Goal definition and convert to JSON string

Step 1b: Get test guidelines

# Get constraint test guidelines
generate_constraint_tests(
    goal_id="your-goal-id",
    goal_json='{"id": "...", "constraints": [...]}',
    agent_path="exports/{agent_name}"
)

# Get success criteria test guidelines
generate_success_tests(
    goal_id="your-goal-id",
    goal_json='{"id": "...", "success_criteria": [...]}',
    node_names="intake,research,review,report",
    tool_names="web_search,web_scrape",
    agent_path="exports/{agent_name}"
)

These return file_header, test_template, constraints_formatted/success_criteria_formatted, and test_guidelines. They do NOT generate test code — you write the tests.

Step 1c: Write tests

Write(
    file_path=result["output_file"],
    content=result["file_header"] + "\n\n" + your_test_code
)

Test writing rules

Every test MUST be async with @pytest.mark.asyncio
Every test MUST accept runner, auto_responder, mock_mode fixtures
Use await auto_responder.start() before running, await auto_responder.stop() in finally
Use await runner.run(input_dict) — this goes through AgentRunner → AgentRuntime → ExecutionStream
Access output via result.output.get("key") — NEVER result.output["key"]
result.success=True means no exception, NOT goal achieved — always check output
Write 8-15 tests total, not 30+
Each real test costs ~3 seconds + LLM tokens
NEVER use default_agent.run() — it bypasses the runtime (no sessions, no logs, client-facing nodes hang)

Step 1d: Check existing tests

Before generating, check if tests already exist:

list_tests(
    goal_id="your-goal-id",
    agent_path="exports/{agent_name}"
)

Phase 2: Execute

Two execution paths, use the right one for your situation.

Iterative debugging (for complex agents)

Run the agent via CLI. This creates sessions with checkpoints at ~/.hive/agents/{agent_name}/sessions/:

uv run hive run exports/{agent_name} --input '{"query": "test topic"}'

Sessions and checkpoints are saved automatically.

Automated regression (for CI or final verification)

Use the run_tests MCP tool to run all pytest tests:

run_tests(
    goal_id="your-goal-id",
    agent_path="exports/{agent_name}"
)

Returns structured results:

{
  "overall_passed": false,
  "summary": {"total": 12, "passed": 10, "failed": 2, "pass_rate": "83.3%"},
  "test_results": [{"test_name": "test_success_source_diversity", "status": "failed"}],
  "failures": [{"test_name": "test_success_source_diversity", "details": "..."}]
}

Options:

# Run only constraint tests
run_tests(goal_id, agent_path, test_types='["constraint"]')

# Stop on first failure
run_tests(goal_id, agent_path, fail_fast=True)

# Parallel execution
run_tests(goal_id, agent_path, parallel=4)

Phase 3: Analyze Failures

When a test fails, drill down systematically. Don't guess — use the tools.

Step 3a: Get error category

debug_test(
    goal_id="your-goal-id",
    test_name="test_success_source_diversity",
    agent_path="exports/{agent_name}"
)

Returns error category (IMPLEMENTATION_ERROR, ASSERTION_FAILURE, TIMEOUT, IMPORT_ERROR, API_ERROR) plus full traceback and suggestions.

Step 3b: Find the failed session

list_agent_sessions(
    agent_work_dir="~/.hive/agents/{agent_name}",
    status="failed",
    limit=5
)

Returns session list with IDs, timestamps, current_node (where it failed), execution_quality.

Step 3c: Inspect session state

get_agent_session_state(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="session_20260209_143022_abc12345"
)

Returns execution path, which node was current, step count, timestamps — but excludes memory values (to avoid context bloat). Shows memory_keys and memory_size instead.

Step 3d: Examine runtime logs (L2/L3)

# L2: Per-node success/failure, retry counts
query_runtime_log_details(
    agent_work_dir="~/.hive/agents/{agent_name}",
    run_id="session_20260209_143022_abc12345",
    needs_attention_only=True
)

# L3: Exact LLM responses, tool call inputs/outputs
query_runtime_log_raw(
    agent_work_dir="~/.hive/agents/{agent_name}",
    run_id="session_20260209_143022_abc12345",
    node_id="research"
)

Step 3e: Inspect memory data

# See what data a node actually produced
get_agent_session_memory(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="session_20260209_143022_abc12345",
    key="research_results"
)

Step 3f: Find recovery points

list_agent_checkpoints(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="session_20260209_143022_abc12345",
    is_clean="true"
)

Returns checkpoint summaries with IDs, types (node_start, node_complete), which node, and is_clean flag. Clean checkpoints are safe resume points.

Step 3g: Compare checkpoints (optional)

To understand what changed between two points in execution:

compare_agent_checkpoints(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="session_20260209_143022_abc12345",
    checkpoint_id_before="cp_node_complete_research_143030",
    checkpoint_id_after="cp_node_complete_review_143115"
)

Returns memory diff (added/removed/changed keys) and execution path diff.

Phase 4: Fix Based on Root Cause

Use the analysis from Phase 3 to determine what to fix and where.

Fix strategies by error category

IMPLEMENTATION_ERROR (TypeError, AttributeError, KeyError):

# Read the failing code
Read(file_path="exports/{agent_name}/nodes/__init__.py")

# Fix the bug
Edit(
    file_path="exports/{agent_name}/nodes/__init__.py",
    old_string="results.get('videos')",
    new_string="(results or {}).get('videos', [])"
)

ASSERTION_FAILURE (test assertions fail but agent ran successfully):

Check if the agent's output is actually wrong → fix the prompt
Check if the test's expectations are unrealistic → fix the test
Use get_agent_session_memory to see what the agent actually produced

TIMEOUT / STALL (agent runs too long):

Check node_visit_counts for feedback loops hitting max_node_visits
Check L3 logs for tool calls that hang
Reduce max_iterations in loop_config or fix the prompt to converge faster

API_ERROR (connection, rate limit, auth):

Verify credentials with /hive-credentials
Check MCP server configuration

Phase 5: Recover & Resume

After fixing the agent, decide whether to resume or re-run.

When to resume from checkpoint

Resume when ALL of these are true:

The fix is to a node that comes AFTER existing clean checkpoints
Clean checkpoints exist (from a CLI execution with checkpointing)
The early nodes are expensive (web scraping, API calls, long LLM chains)

# Resume from the last clean checkpoint before the failing node
uv run hive run exports/{agent_name} \
  --resume-session session_20260209_143022_abc12345 \
  --checkpoint cp_node_complete_research_143030

This skips all nodes before the checkpoint and only re-runs the fixed node onward.

When to re-run from scratch

Re-run when ANY of these are true:

The fix is to the entry node or an early node
No checkpoints exist (e.g., agent was run via run_tests)
The agent is fast (2-3 nodes, completes in seconds)
You changed the graph structure (added/removed nodes/edges)

uv run hive run exports/{agent_name} --input '{"query": "test topic"}'

Inspecting a checkpoint before resuming

get_agent_checkpoint(
    agent_work_dir="~/.hive/agents/{agent_name}",
    session_id="session_20260209_143022_abc12345",
    checkpoint_id="cp_node_complete_research_143030"
)

Returns the full checkpoint: shared_memory snapshot, execution_path, current_node, next_node, is_clean.

Loop back to Phase 2

After resuming or re-running, check if the fix worked. If not, go back to Phase 3.

Phase 6: Final Verification

Once the iterative fix loop converges (the agent produces correct output), run the full automated test suite:

run_tests(
    goal_id="your-goal-id",
    agent_path="exports/{agent_name}"
)

All tests should pass. If not, repeat the loop for remaining failures.

Credential Requirements

CRITICAL: Testing requires ALL credentials the agent depends on. This includes both the LLM API key AND any tool-specific credentials (HubSpot, Brave Search, etc.).

Prerequisites

Before running agent tests, you MUST collect ALL required credentials from the user.

Step 1: LLM API Key (always required)

export ANTHROPIC_API_KEY="your-key-here"

Step 2: Tool-specific credentials (depends on agent's tools)

Inspect the agent's mcp_servers.json and tool configuration to determine which tools the agent uses, then check for all required credentials:

from aden_tools.credentials import CredentialManager, CREDENTIAL_SPECS

creds = CredentialManager()

# Determine which tools the agent uses (from agent.json or mcp_servers.json)
agent_tools = [...]  # e.g., ["hubspot_search_contacts", "web_search", ...]

# Find all missing credentials for those tools
missing = creds.get_missing_for_tools(agent_tools)

Why ALL credentials are required:

Tests need to execute the agent's LLM nodes to validate behavior
Tools with missing credentials will return error dicts instead of real data
Mock mode bypasses everything, providing no confidence in real-world performance

Mock Mode Limitations

Mock mode (--mock flag or MOCK_MODE=1) is ONLY for structure validation:

Validates graph structure (nodes, edges, connections)
Validates that AgentRunner.load() succeeds and the agent is importable
Does NOT execute event_loop agents — MockLLMProvider never calls set_output, so event_loop nodes loop forever
Does NOT test LLM reasoning, content quality, or constraint validation
Does NOT test real API integrations or tool use

Bottom line: If you're testing whether an agent achieves its goal, you MUST use real credentials.

Enforcing Credentials in Tests

When writing tests, ALWAYS include credential checks:

import os
import pytest
from aden_tools.credentials import CredentialManager

pytestmark = pytest.mark.skipif(
    not CredentialManager().is_available("anthropic") and not os.environ.get("MOCK_MODE"),
    reason="API key required for real testing. Set ANTHROPIC_API_KEY or use MOCK_MODE=1."
)


@pytest.fixture(scope="session", autouse=True)
def check_credentials():
    """Ensure ALL required credentials are set for real testing."""
    creds = CredentialManager()
    mock_mode = os.environ.get("MOCK_MODE")

    if not creds.is_available("anthropic"):
        if mock_mode:
            print("\nRunning in MOCK MODE - structure validation only")
        else:
            pytest.fail(
                "\nANTHROPIC_API_KEY not set!\n"
                "Set API key: export ANTHROPIC_API_KEY='your-key-here'\n"
                "Or run structure validation: MOCK_MODE=1 pytest exports/{agent}/tests/"
            )

    if not mock_mode:
        agent_tools = []  # Update per agent
        missing = creds.get_missing_for_tools(agent_tools)
        if missing:
            lines = ["\nMissing tool credentials!"]
            for name in missing:
                spec = creds.specs.get(name)
                if spec:
                    lines.append(f"  {spec.env_var} - {spec.description}")
            pytest.fail("\n".join(lines))

User Communication

When the user asks to test an agent, ALWAYS check for ALL credentials first:

Identify the agent's tools from mcp_servers.json
Check ALL required credentials using CredentialManager
Ask the user to provide any missing credentials before proceeding
Collect ALL missing credentials in a single prompt — not one at a time

Safe Test Patterns

OutputCleaner

The framework automatically validates and cleans node outputs using a fast LLM at edge traversal time. Tests should still use safe patterns because OutputCleaner may not catch all issues.

Safe Access (REQUIRED)

# UNSAFE - will crash on missing keys
approval = result.output["approval_decision"]
category = result.output["analysis"]["category"]

# SAFE - use .get() with defaults
output = result.output or {}
approval = output.get("approval_decision", "UNKNOWN")

# SAFE - type check before operations
analysis = output.get("analysis", {})
if isinstance(analysis, dict):
    category = analysis.get("category", "unknown")

# SAFE - handle JSON parsing trap (LLM response as string)
import json
recommendation = output.get("recommendation", "{}")
if isinstance(recommendation, str):
    try:
        parsed = json.loads(recommendation)
        if isinstance(parsed, dict):
            approval = parsed.get("approval_decision", "UNKNOWN")
    except json.JSONDecodeError:
        approval = "UNKNOWN"
elif isinstance(recommendation, dict):
    approval = recommendation.get("approval_decision", "UNKNOWN")

# SAFE - type check before iteration
items = output.get("items", [])
if isinstance(items, list):
    for item in items:
        ...

Helper Functions for conftest.py

import json
import re

def _parse_json_from_output(result, key):
    """Parse JSON from agent output (framework may store full LLM response as string)."""
    response_text = result.output.get(key, "")
    json_text = re.sub(r'```json\s*|\s*```', '', response_text).strip()
    try:
        return json.loads(json_text)
    except (json.JSONDecodeError, AttributeError, TypeError):
        return result.output.get(key)

def safe_get_nested(result, key_path, default=None):
    """Safely get nested value from result.output."""
    output = result.output or {}
    current = output
    for key in key_path:
        if isinstance(current, dict):
            current = current.get(key)
        elif isinstance(current, str):
            try:
                json_text = re.sub(r'```json\s*|\s*```', '', current).strip()
                parsed = json.loads(json_text)
                if isinstance(parsed, dict):
                    current = parsed.get(key)
                else:
                    return default
            except json.JSONDecodeError:
                return default
        else:
            return default
    return current if current is not None else default

# Make available in tests
pytest.parse_json_from_output = _parse_json_from_output
pytest.safe_get_nested = safe_get_nested

ExecutionResult Fields

result.success=True means NO exception, NOT goal achieved

# WRONG
assert result.success

# RIGHT
assert result.success, f"Agent failed: {result.error}"
output = result.output or {}
approval = output.get("approval_decision")
assert approval == "APPROVED", f"Expected APPROVED, got {approval}"

All fields:

success: bool — Completed without exception (NOT goal achieved!)
output: dict — Complete memory snapshot (may contain raw strings)
error: str | None — Error message if failed
steps_executed: int — Number of nodes executed
total_tokens: int — Cumulative token usage
total_latency_ms: int — Total execution time
path: list[str] — Node IDs traversed (may repeat in feedback loops)
paused_at: str | None — Node ID if paused
session_state: dict — State for resuming
node_visit_counts: dict[str, int] — Visit counts per node (feedback loop testing)
execution_quality: str — "clean", "degraded", or "failed"

Test Count Guidance

Write 8-15 tests, not 30+

2-3 tests per success criterion
1 happy path test
1 boundary/edge case test
1 error handling test (optional)

Each real test costs ~3 seconds + LLM tokens. 12 tests = ~36 seconds, $0.12.

Test Patterns

Happy Path

@pytest.mark.asyncio
async def test_happy_path(runner, auto_responder, mock_mode):
    """Test normal successful execution."""
    await auto_responder.start()
    try:
        result = await runner.run({"query": "python tutorials"})
    finally:
        await auto_responder.stop()
    assert result.success, f"Agent failed: {result.error}"
    output = result.output or {}
    assert output.get("report"), "No report produced"

Boundary Condition

@pytest.mark.asyncio
async def test_minimum_sources(runner, auto_responder, mock_mode):
    """Test at minimum source threshold."""
    await auto_responder.start()
    try:
        result = await runner.run({"query": "niche topic"})
    finally:
        await auto_responder.stop()
    assert result.success, f"Agent failed: {result.error}"
    output = result.output or {}
    sources = output.get("sources", [])
    if isinstance(sources, list):
        assert len(sources) >= 3, f"Expected >= 3 sources, got {len(sources)}"

Error Handling

@pytest.mark.asyncio
async def test_empty_input(runner, auto_responder, mock_mode):
    """Test graceful handling of empty input."""
    await auto_responder.start()
    try:
        result = await runner.run({"query": ""})
    finally:
        await auto_responder.stop()
    # Agent should either fail gracefully or produce an error message
    output = result.output or {}
    assert not result.success or output.get("error"), "Should handle empty input"

Feedback Loop

@pytest.mark.asyncio
async def test_feedback_loop_terminates(runner, auto_responder, mock_mode):
    """Test that feedback loops don't run forever."""
    await auto_responder.start()
    try:
        result = await runner.run({"query": "test"})
    finally:
        await auto_responder.stop()
    visits = result.node_visit_counts or {}
    for node_id, count in visits.items():
        assert count <= 5, f"Node {node_id} visited {count} times — possible infinite loop"

MCP Tool Reference

Phase 1: Test Generation

# Check existing tests
list_tests(goal_id, agent_path)

# Get constraint test guidelines (returns templates, NOT generated tests)
generate_constraint_tests(goal_id, goal_json, agent_path)
# Returns: output_file, file_header, test_template, constraints_formatted, test_guidelines

# Get success criteria test guidelines
generate_success_tests(goal_id, goal_json, node_names, tool_names, agent_path)
# Returns: output_file, file_header, test_template, success_criteria_formatted, test_guidelines

Phase 2: Execution

# Automated regression (no checkpoints, fresh runs)
run_tests(goal_id, agent_path, test_types='["all"]', parallel=-1, fail_fast=False)

# Run only specific test types
run_tests(goal_id, agent_path, test_types='["constraint"]')
run_tests(goal_id, agent_path, test_types='["success"]')

# Iterative debugging with checkpoints (via CLI)
uv run hive run exports/{agent_name} --input '{"query": "test"}'

Phase 3: Analysis

# Debug a specific failed test
debug_test(goal_id, test_name, agent_path)

# Find failed sessions
list_agent_sessions(agent_work_dir, status="failed", limit=5)

# Inspect session state (excludes memory values)
get_agent_session_state(agent_work_dir, session_id)

# Inspect memory data
get_agent_session_memory(agent_work_dir, session_id, key="research_results")

# Runtime logs: L1 summaries
query_runtime_logs(agent_work_dir, status="needs_attention")

# Runtime logs: L2 per-node details
query_runtime_log_details(agent_work_dir, run_id, needs_attention_only=True)

# Runtime logs: L3 tool/LLM raw data
query_runtime_log_raw(agent_work_dir, run_id, node_id="research")

# Find clean checkpoints
list_agent_checkpoints(agent_work_dir, session_id, is_clean="true")

# Compare checkpoints (memory diff)
compare_agent_checkpoints(agent_work_dir, session_id, cp_before, cp_after)

Phase 5: Recovery

# Inspect checkpoint before resuming
get_agent_checkpoint(agent_work_dir, session_id, checkpoint_id)
# Empty checkpoint_id = latest checkpoint

# Resume from checkpoint via CLI (headless)
uv run hive run exports/{agent_name} \
  --resume-session {session_id} --checkpoint {checkpoint_id}

Anti-Patterns

Example Walkthrough: Deep Research Agent

A complete iteration showing the test loop for an agent with nodes: intake → research → review → report.

Phase 1: Generate tests

# Read the goal
Read(file_path="exports/deep_research_agent/agent.py")

# Get success criteria test guidelines
result = generate_success_tests(
    goal_id="rigorous-interactive-research",
    goal_json='{"id": "rigorous-interactive-research", "success_criteria": [{"id": "source-diversity", "target": ">=5"}, {"id": "citation-coverage", "target": "100%"}, {"id": "report-completeness", "target": "90%"}]}',
    node_names="intake,research,review,report",
    tool_names="web_search,web_scrape",
    agent_path="exports/deep_research_agent"
)

# Write tests
Write(
    file_path=result["output_file"],
    content=result["file_header"] + "\n\n" + test_code
)

Phase 2: First execution

run_tests(
    goal_id="rigorous-interactive-research",
    agent_path="exports/deep_research_agent",
    fail_fast=True
)

Result: test_success_source_diversity fails — agent only found 2 sources instead of 5.

Phase 3: Analyze

# Debug the failing test
debug_test(
    goal_id="rigorous-interactive-research",
    test_name="test_success_source_diversity",
    agent_path="exports/deep_research_agent"
)
# → ASSERTION_FAILURE: Expected >= 5 sources, got 2

# Find the session
list_agent_sessions(
    agent_work_dir="~/.hive/agents/deep_research_agent",
    status="completed",
    limit=1
)
# → session_20260209_150000_abc12345

# See what the research node produced
get_agent_session_memory(
    agent_work_dir="~/.hive/agents/deep_research_agent",
    session_id="session_20260209_150000_abc12345",
    key="research_results"
)
# → Only 2 web_search calls made, each returned 1 source

# Check the LLM's behavior in the research node
query_runtime_log_raw(
    agent_work_dir="~/.hive/agents/deep_research_agent",
    run_id="session_20260209_150000_abc12345",
    node_id="research"
)
# → LLM called web_search only twice, then called set_output

Root cause: The research node's prompt doesn't tell the LLM to search for at least 5 diverse sources. It stops after the first couple of searches.

Phase 4: Fix the prompt

Read(file_path="exports/deep_research_agent/nodes/__init__.py")

Edit(
    file_path="exports/deep_research_agent/nodes/__init__.py",
    old_string='system_prompt="Search for information on the user\'s topic."',
    new_string='system_prompt="Search for information on the user\'s topic. You MUST find at least 5 diverse, authoritative sources. Use multiple different search queries to ensure source diversity. Do not stop searching until you have at least 5 distinct sources."'
)

Phase 5: Resume from checkpoint

For this example, the fix is to the research node. If we had run via CLI with checkpointing, we could resume from the checkpoint after intake to skip re-running intake:

# Check if clean checkpoint exists after intake
list_agent_checkpoints(
    agent_work_dir="~/.hive/agents/deep_research_agent",
    session_id="session_20260209_150000_abc12345",
    is_clean="true"
)
# → cp_node_complete_intake_150005

# Resume from after intake, re-run research with fixed prompt
uv run hive run exports/deep_research_agent \
  --resume-session session_20260209_150000_abc12345 \
  --checkpoint cp_node_complete_intake_150005

Or for this simple case (intake is fast), just re-run:

uv run hive run exports/deep_research_agent --input '{"topic": "test"}'

Phase 6: Final verification

run_tests(
    goal_id="rigorous-interactive-research",
    agent_path="exports/deep_research_agent"
)
# → All 12 tests pass

Test File Structure

exports/{agent_name}/
├── agent.py              ← Agent to test (goal, nodes, edges)
├── nodes/__init__.py     ← Node implementations (prompts, config)
├── config.py             ← Agent configuration
├── mcp_servers.json      ← Tool server config
└── tests/
    ├── conftest.py           ← Shared fixtures + safe access helpers
    ├── test_constraints.py   ← Constraint tests
    ├── test_success_criteria.py  ← Success criteria tests
    └── test_edge_cases.py    ← Edge case tests

Integration with Other Skills

Evidence Capture Protocol

Adapted from Agent33's TDD evidence-capture discipline. Every test cycle must produce verifiable evidence that can be audited later.

Verification Log Format

After completing a test cycle, record evidence in a verification log entry:

| Date | Cycle ID | PR/Branch | Command | Result | Rationale Link | Artifact Link |
|------|----------|-----------|---------|--------|----------------|---------------|
| YYYY-MM-DD | [task-id] | [branch] | [exact test command] | [pass/fail + count] | [session-log-path] | [PR-link] |

TDD Stage Evidence

For each RED/GREEN/REFACTOR stage, capture structured evidence:

## RED Stage Evidence
- Task ID: [task-id]
- Test file(s): [paths]
- Command: [exact test command]
- Result: FAIL (expected)
- Failure reason: [why it fails - missing implementation, etc.]
- Timestamp: [ISO 8601]

## GREEN Stage Evidence
- Task ID: [task-id]
- Implementation file(s): [paths]
- Command: [exact test command]
- Result: PASS
- Tests passed: [count]
- Regressions: [none / list any]
- Timestamp: [ISO 8601]

## REFACTOR Stage Evidence
- Task ID: [task-id]
- Refactoring type: [extract method, rename, restructure, etc.]
- Files changed: [paths]
- Command: [exact test command]
- Result: PASS (maintained)
- Improvements: [brief description]
- Timestamp: [ISO 8601]

Coverage Requirements

Coverage requirements are linked to the type of change being tested:

Evidence Capture Commands

Run these commands and include their output in the session log or PR description:

# Test output with verbose results
npm test -- --verbose 2>&1 | tee test-output.log
# or for Python projects:
pytest -v --tb=short 2>&1 | tee test-output.log

# Coverage report
npm test -- --coverage 2>&1 | tee coverage-report.log
# or for Python projects:
pytest --cov=src --cov-report=term-missing 2>&1 | tee coverage-report.log

# Diff summary for the PR
git diff --stat HEAD~1
git diff HEAD~1 -- '*.ts' '*.js' '*.py'

Acceptance Criteria Linkage

Map each acceptance criterion to its test(s) to ensure full traceability:

## Acceptance Criteria Coverage
| Criterion | Test(s) | Status |
|-----------|---------|--------|
| AC-1: Feature works as specified | `test_feature_happy_path` | Covered |
| AC-2: Edge case handled | `test_edge_case_boundary` | Covered |
| AC-3: Error handling | `test_error_graceful` | Covered |

Regression Prevention Checklist

Before marking a test cycle as complete:

[ ] All new tests pass
[ ] All existing tests still pass (no regressions introduced)
[ ] Coverage meets or exceeds the minimum for the change type
[ ] Test output captured and saved as evidence
[ ] Coverage report captured and saved as evidence
[ ] Diff summary documented with rationale for each changed file
[ ] Acceptance criteria mapped to specific test cases
[ ] Verification log entry created with exact commands and results

Adoption

mattmre/hive-test

$ install --global

Security Scan Results

SKILL.md

Agent Testing

When to Use

Prerequisites

The Iterative Test Loop

Phase 1: Generate Test Scenarios

Step 1a: Read the goal

Step 1b: Get test guidelines

Step 1c: Write tests

Test writing rules

Step 1d: Check existing tests

Phase 2: Execute

Iterative debugging (for complex agents)

Automated regression (for CI or final verification)

Phase 3: Analyze Failures

Step 3a: Get error category

Step 3b: Find the failed session

Step 3c: Inspect session state

Step 3d: Examine runtime logs (L2/L3)

Step 3e: Inspect memory data

Step 3f: Find recovery points

Step 3g: Compare checkpoints (optional)

Phase 4: Fix Based on Root Cause

Fix strategies by error category

Phase 5: Recover & Resume

When to resume from checkpoint

When to re-run from scratch

Inspecting a checkpoint before resuming

Loop back to Phase 2

Phase 6: Final Verification

Credential Requirements

Prerequisites

Mock Mode Limitations

Enforcing Credentials in Tests

User Communication

Safe Test Patterns

OutputCleaner

Safe Access (REQUIRED)

Helper Functions for conftest.py

ExecutionResult Fields

Test Count Guidance

Test Patterns

Happy Path

Boundary Condition

Error Handling

Feedback Loop

MCP Tool Reference

Phase 1: Test Generation

Phase 2: Execution

Phase 3: Analysis

Phase 5: Recovery

Anti-Patterns

Example Walkthrough: Deep Research Agent

Phase 1: Generate tests

Phase 2: First execution

Phase 3: Analyze

Phase 4: Fix the prompt

Phase 5: Resume from checkpoint

Phase 6: Final verification

Test File Structure

Integration with Other Skills

Evidence Capture Protocol

Verification Log Format

TDD Stage Evidence

Coverage Requirements

Evidence Capture Commands

Acceptance Criteria Linkage

Regression Prevention Checklist

Related Skills

mattmre/orchestration-framework

mattmre/triage-issue-skill

mattmre/hive

mattmre/hive-patterns

mattmre/hive-test

$ install --global

Security Scan Results