Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

a-organvm/agent-testing-patterns

Name: agent-testing-patterns
Author: a-organvm

distributions/claude/skills/agent-testing-patterns/SKILL.md

npx skillsauth add a-organvm/a-i--skills agent-testing-patterns

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Agent Testing Patterns

Test AI agent systems that use tools, make decisions, and produce non-deterministic outputs.

Testing Challenges

| Challenge | Cause | Strategy | |-----------|-------|----------| | Non-deterministic output | LLM randomness | Assert on structure, not exact text | | Tool use sequences | Agent autonomy | Verify tool calls, not call order | | Multi-turn state | Conversation context | Snapshot-based assertions | | Cost | API calls | Mock LLM in unit tests | | Latency | API round-trips | Parallel test execution | | Flakiness | Model updates | Semantic assertions, not string matches |

Test Pyramid for Agents

        ╱╲
       ╱  ╲        E2E Agent Tests (few, expensive)
      ╱────╲       Full agent loop with real LLM
     ╱      ╲
    ╱────────╲     Integration Tests (moderate)
   ╱          ╲    Tool execution, state management
  ╱────────────╲
 ╱   Unit Tests ╲  Tool implementations, parsers, validators
╱────────────────╲

Unit Testing (No LLM)

Tool Implementation Tests

import pytest

def test_file_read_tool():
    tool = FileReadTool()
    result = tool.execute({"path": "test.txt"})
    assert result["content"] == "expected content"
    assert result["success"] is True

def test_file_read_tool_missing_file():
    tool = FileReadTool()
    result = tool.execute({"path": "nonexistent.txt"})
    assert result["success"] is False
    assert "not found" in result["error"].lower()

def test_tool_input_validation():
    tool = FileReadTool()
    with pytest.raises(ValueError, match="path is required"):
        tool.execute({})

Response Parser Tests

def test_parse_tool_call():
    raw = '{"tool": "search", "args": {"query": "python"}}'
    result = parse_tool_call(raw)
    assert result.tool == "search"
    assert result.args == {"query": "python"}

def test_parse_malformed_tool_call():
    raw = "not json at all"
    result = parse_tool_call(raw)
    assert result is None

Integration Testing (Mocked LLM)

Mock LLM Client

class MockLLMClient:
    def __init__(self, responses: list[dict]):
        self.responses = iter(responses)
        self.calls: list[dict] = []

    async def generate(self, messages: list[dict], tools: list[dict] = None) -> dict:
        self.calls.append({"messages": messages, "tools": tools})
        return next(self.responses)

@pytest.fixture
def mock_agent():
    client = MockLLMClient(responses=[
        {"content": None, "tool_calls": [{"name": "search", "args": {"query": "python packaging"}}]},
        {"content": "Based on the search results, here's how to package Python..."},
    ])
    return Agent(llm=client, tools=[SearchTool(), FileReadTool()])

Tool Execution Sequence Tests

@pytest.mark.asyncio
async def test_agent_uses_search_then_responds(mock_agent):
    result = await mock_agent.run("How do I package a Python project?")

    # Verify tool was called
    assert len(mock_agent.tool_history) == 1
    assert mock_agent.tool_history[0].tool_name == "search"
    assert "python" in mock_agent.tool_history[0].args["query"].lower()

    # Verify final response exists
    assert result.content is not None
    assert len(result.content) > 0

State Management Tests

@pytest.mark.asyncio
async def test_session_preserves_context(mock_agent):
    await mock_agent.run("My name is Alice")
    result = await mock_agent.run("What's my name?")

    # Verify conversation history maintained
    assert len(mock_agent.messages) == 4  # 2 user + 2 assistant

E2E Testing (Real LLM)

Structural Assertions

@pytest.mark.e2e
@pytest.mark.asyncio
async def test_agent_creates_file(real_agent, tmp_path):
    result = await real_agent.run(f"Create a Python hello world script at {tmp_path}/hello.py")

    # Assert on outcome, not exact content
    hello_file = tmp_path / "hello.py"
    assert hello_file.exists()
    content = hello_file.read_text()
    assert "print" in content  # Must use print
    assert content.strip()  # Non-empty

    # Verify it's valid Python
    compile(content, "hello.py", "exec")

Semantic Assertions

@pytest.mark.e2e
@pytest.mark.asyncio
async def test_agent_explains_concept(real_agent):
    result = await real_agent.run("Explain what a circuit breaker pattern is in 2-3 sentences")

    # Semantic checks (not exact string matching)
    assert len(result.content) > 50
    assert len(result.content) < 1000
    assert any(term in result.content.lower() for term in ["fault", "failure", "threshold", "open", "closed"])

Evaluation Metrics

@dataclass
class AgentEvalResult:
    task_completed: bool
    tool_calls_count: int
    tokens_used: int
    latency_ms: float
    error_recovery_count: int

async def evaluate_agent(agent, test_cases: list[dict]) -> list[AgentEvalResult]:
    results = []
    for case in test_cases:
        start = time.perf_counter()
        try:
            result = await agent.run(case["prompt"])
            completed = case["validator"](result)
        except Exception:
            completed = False
        latency = (time.perf_counter() - start) * 1000

        results.append(AgentEvalResult(
            task_completed=completed,
            tool_calls_count=len(agent.tool_history),
            tokens_used=agent.total_tokens,
            latency_ms=latency,
            error_recovery_count=agent.error_count,
        ))
    return results

Regression Testing

Golden File Tests

def test_tool_call_format_regression():
    """Ensure tool call format hasn't changed."""
    response = agent.format_tool_call("search", {"query": "test"})
    expected = load_golden("tool_call_format.json")
    assert response == expected

Benchmark Suite

BENCHMARK_CASES = [
    {"prompt": "List all Python files in the project", "expected_tools": ["glob"], "max_tokens": 500},
    {"prompt": "Fix the syntax error in app.py", "expected_tools": ["read", "edit"], "max_tokens": 2000},
]

async def run_benchmark(agent):
    for case in BENCHMARK_CASES:
        result = await agent.run(case["prompt"])
        tools_used = {t.tool_name for t in agent.tool_history}
        assert tools_used.issubset(set(case["expected_tools"] + ["think"]))
        assert agent.total_tokens <= case["max_tokens"]

Anti-Patterns

Asserting exact LLM output — Models change; assert structure and semantics
No mocking in unit tests — Real API calls make tests slow, expensive, and flaky
Testing only happy path — Test error recovery, malformed responses, tool failures
No cost tracking — Monitor token usage in E2E tests to catch regressions
Ignoring non-determinism — Run E2E tests multiple times; set pass threshold (e.g., 4/5)
Testing agent internals — Test outcomes and tool call patterns, not internal state

a-organvm/agent-testing-patterns

distributions/claude/skills/agent-testing-patterns/SKILL.md

Test AI agent systems including tool use, multi-turn conversations, error recovery, and non-deterministic outputs. Covers mock strategies, evaluation metrics, and regression testing for agent workflows. Triggers on AI agent testing, LLM evaluation, or agent quality assurance requests.

6 stars

tools

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add a-organvm/a-i--skills agent-testing-patterns

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 17, 2026, 2:21 AM6.3s1 file scanned

SKILL.md

name:: agent-testing-patterns
description:: Test AI agent systems including tool use, multi-turn conversations, error recovery, and non-deterministic outputs. Covers mock strategies, evaluation metrics, and regression testing for agent workflows. Triggers on AI agent testing, LLM evaluation, or agent quality assurance requests.
license:: MIT
complexity:: advanced
time_to_learn:: 30min
governance_phases:: [prove]
governance_norm_group:: quality-gate
organ_affinity:: [all]
triggers:: [user-asks-about-agent-testing, context:llm-testing, context:agent-evaluation, context:tool-use-testing]
complements:: [testing-patterns, tdd-workflow, verification-loop, agent-swarm-orchestrator]

Agent Testing Patterns

Test AI agent systems that use tools, make decisions, and produce non-deterministic outputs.

Testing Challenges

Test Pyramid for Agents

        ╱╲
       ╱  ╲        E2E Agent Tests (few, expensive)
      ╱────╲       Full agent loop with real LLM
     ╱      ╲
    ╱────────╲     Integration Tests (moderate)
   ╱          ╲    Tool execution, state management
  ╱────────────╲
 ╱   Unit Tests ╲  Tool implementations, parsers, validators
╱────────────────╲

Unit Testing (No LLM)

Tool Implementation Tests

import pytest

def test_file_read_tool():
    tool = FileReadTool()
    result = tool.execute({"path": "test.txt"})
    assert result["content"] == "expected content"
    assert result["success"] is True

def test_file_read_tool_missing_file():
    tool = FileReadTool()
    result = tool.execute({"path": "nonexistent.txt"})
    assert result["success"] is False
    assert "not found" in result["error"].lower()

def test_tool_input_validation():
    tool = FileReadTool()
    with pytest.raises(ValueError, match="path is required"):
        tool.execute({})

Response Parser Tests

def test_parse_tool_call():
    raw = '{"tool": "search", "args": {"query": "python"}}'
    result = parse_tool_call(raw)
    assert result.tool == "search"
    assert result.args == {"query": "python"}

def test_parse_malformed_tool_call():
    raw = "not json at all"
    result = parse_tool_call(raw)
    assert result is None

Integration Testing (Mocked LLM)

Mock LLM Client

class MockLLMClient:
    def __init__(self, responses: list[dict]):
        self.responses = iter(responses)
        self.calls: list[dict] = []

    async def generate(self, messages: list[dict], tools: list[dict] = None) -> dict:
        self.calls.append({"messages": messages, "tools": tools})
        return next(self.responses)

@pytest.fixture
def mock_agent():
    client = MockLLMClient(responses=[
        {"content": None, "tool_calls": [{"name": "search", "args": {"query": "python packaging"}}]},
        {"content": "Based on the search results, here's how to package Python..."},
    ])
    return Agent(llm=client, tools=[SearchTool(), FileReadTool()])

Tool Execution Sequence Tests

@pytest.mark.asyncio
async def test_agent_uses_search_then_responds(mock_agent):
    result = await mock_agent.run("How do I package a Python project?")

    # Verify tool was called
    assert len(mock_agent.tool_history) == 1
    assert mock_agent.tool_history[0].tool_name == "search"
    assert "python" in mock_agent.tool_history[0].args["query"].lower()

    # Verify final response exists
    assert result.content is not None
    assert len(result.content) > 0

State Management Tests

@pytest.mark.asyncio
async def test_session_preserves_context(mock_agent):
    await mock_agent.run("My name is Alice")
    result = await mock_agent.run("What's my name?")

    # Verify conversation history maintained
    assert len(mock_agent.messages) == 4  # 2 user + 2 assistant

E2E Testing (Real LLM)

Structural Assertions

@pytest.mark.e2e
@pytest.mark.asyncio
async def test_agent_creates_file(real_agent, tmp_path):
    result = await real_agent.run(f"Create a Python hello world script at {tmp_path}/hello.py")

    # Assert on outcome, not exact content
    hello_file = tmp_path / "hello.py"
    assert hello_file.exists()
    content = hello_file.read_text()
    assert "print" in content  # Must use print
    assert content.strip()  # Non-empty

    # Verify it's valid Python
    compile(content, "hello.py", "exec")

Semantic Assertions

@pytest.mark.e2e
@pytest.mark.asyncio
async def test_agent_explains_concept(real_agent):
    result = await real_agent.run("Explain what a circuit breaker pattern is in 2-3 sentences")

    # Semantic checks (not exact string matching)
    assert len(result.content) > 50
    assert len(result.content) < 1000
    assert any(term in result.content.lower() for term in ["fault", "failure", "threshold", "open", "closed"])

Evaluation Metrics

@dataclass
class AgentEvalResult:
    task_completed: bool
    tool_calls_count: int
    tokens_used: int
    latency_ms: float
    error_recovery_count: int

async def evaluate_agent(agent, test_cases: list[dict]) -> list[AgentEvalResult]:
    results = []
    for case in test_cases:
        start = time.perf_counter()
        try:
            result = await agent.run(case["prompt"])
            completed = case["validator"](result)
        except Exception:
            completed = False
        latency = (time.perf_counter() - start) * 1000

        results.append(AgentEvalResult(
            task_completed=completed,
            tool_calls_count=len(agent.tool_history),
            tokens_used=agent.total_tokens,
            latency_ms=latency,
            error_recovery_count=agent.error_count,
        ))
    return results

Regression Testing

Golden File Tests

def test_tool_call_format_regression():
    """Ensure tool call format hasn't changed."""
    response = agent.format_tool_call("search", {"query": "test"})
    expected = load_golden("tool_call_format.json")
    assert response == expected

Benchmark Suite

BENCHMARK_CASES = [
    {"prompt": "List all Python files in the project", "expected_tools": ["glob"], "max_tokens": 500},
    {"prompt": "Fix the syntax error in app.py", "expected_tools": ["read", "edit"], "max_tokens": 2000},
]

async def run_benchmark(agent):
    for case in BENCHMARK_CASES:
        result = await agent.run(case["prompt"])
        tools_used = {t.tool_name for t in agent.tool_history}
        assert tools_used.issubset(set(case["expected_tools"] + ["think"]))
        assert agent.total_tokens <= case["max_tokens"]

Anti-Patterns

Asserting exact LLM output — Models change; assert structure and semantics
No mocking in unit tests — Real API calls make tests slow, expensive, and flaky
Testing only happy path — Test error recovery, malformed responses, tool failures
No cost tracking — Monitor token usage in E2E tests to catch regressions
Ignoring non-determinism — Run E2E tests multiple times; set pass threshold (e.g., 4/5)
Testing agent internals — Test outcomes and tool call patterns, not internal state

Related Skills

a-organvm/docx

development

VerifiedTrustedCommunity

Comprehensive document creation, editing, and analysis with support for tracked changes, comments, formatting preservation, and text extraction. When Claude needs to work with professional documents (.docx files) for: (1) Creating new documents, (2) Modifying or editing content, (3) Working with tracked changes, (4) Adding comments, or any other document tasks

12SKILL.mdUpdated Jul 20, 2026

a-organvm/workspace-autopsy-governance

development

VerifiedTrustedCommunity

Conducts a full automated autopsy of the current workspace directory to map files, identifies structural issues, proposes a restructuring plan (the signal), and establishes unified governance using templates. Use this skill when a user asks to map, restructure, reorganize, or apply new governance to an existing messy repository.

12SKILL.mdUpdated Jul 20, 2026

a-organvm/workspace-autopsy-governance

a-organvm/workshop-presentation-design

testing

VerifiedTrustedCommunity

Design engaging workshops, conference talks, and educational presentations. Covers learning objectives, activity design, slide craft, and facilitation techniques. Triggers on workshop design, presentation prep, talk structure, or training session requests.

12SKILL.mdUpdated Jul 20, 2026

a-organvm/workshop-presentation-design

a-organvm/webhook-integration-patterns

development

VerifiedTrustedCommunity

Designs reliable webhook systems with proper delivery guarantees, retry logic, signature verification, and idempotent processing for event-driven integrations.

12SKILL.mdUpdated Jul 20, 2026

a-organvm/webhook-integration-patterns

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/a-organvm/a-i--skills.git

# Copy into Claude Code skills folder (global)
cp -r a-i--skills/distributions/claude/skills/agent-testing-patterns ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

a-organvm/a-i--skills

6 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT