skills/automated-structural-testing-llm-based/SKILL.md
Write structural tests for LLM-based agents using trace-based assertions, mocked LLM responses, and the test automation pyramid. Use when the user says 'test my agent', 'write agent tests', 'mock LLM responses', 'add regression tests for my agent', 'structural testing for agents', or 'trace-based assertions'.
npx skillsauth add ndpvt-web/arxiv-claude-skills automated-structural-testing-llm-basedInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to write deep structural tests for LLM-based agents — not just end-to-end acceptance tests, but layered tests that capture agent trajectories as traces, mock LLM behavior for deterministic replay, and assert on internal tool invocations and intermediate outputs. The approach adapts the classical test automation pyramid (unit / integration / E2E) to agentic systems, enabling fast regression detection and root-cause analysis without expensive live LLM calls.
Traditional agent testing evaluates from the user's perspective: give it an input, check the final output. This is the top of the test pyramid — slow, expensive, and brittle. Structural testing operates at the lower layers. It captures traces — structured records of every internal operation (LLM invocations, tool calls, intermediate reasoning) — and makes them the foundation for assertions. Instead of asking "did the agent give a good answer?", you ask "did the agent call get_weather with city='Amsterdam' before composing its response?"
The method has three pillars. Traces (OpenTelemetry-compatible) record agent trajectories: which tools were invoked, with what inputs, what they returned, how many LLM turns occurred, token counts, and latency. Mocking replaces the live LLM with deterministic response sequences so that a test always follows the same execution path — no flakiness from model non-determinism. Assertions operate on the collected traces to verify structural properties: tool inclusion/exclusion, output content, invocation order, and custom validation functions.
This layered approach mirrors the software engineering test pyramid. At the base, unit tests mock the LLM and assert on individual tool invocations. In the middle, integration tests verify multi-step tool chains and conversation flows with mocked responses. At the top, acceptance tests use live LLMs with semantic similarity metrics. Most tests should be at the base — fast, cheap, deterministic — with fewer tests at each higher level.
Identify the agent's tool surface. List every tool/function the agent can invoke, its parameters, and expected return types. This defines what structural assertions are possible.
Define test cases as conversation turns. For each scenario, write the sequence of user inputs that exercises a specific code path. Use Case(user_inputs=["turn1", "turn2", ...]) to model multi-turn interactions.
Create mock LLM responses. For unit and integration tests, build a mock client that returns predetermined responses. Map each user input to a specific tool-use response or text response the LLM would produce, ensuring deterministic execution:
from generative_ai_toolkit.mock import MockBedrockConverse
mock_client = MockBedrockConverse()
agent = BedrockConverseAgent(
model_id="test-model",
bedrock_client=mock_client,
system_prompt="You are a travel assistant."
)
Register tools on the agent. Attach the same tool functions (or lightweight stubs) that the production agent uses, so the mock LLM's tool-use responses can be executed:
agent.register_tool(get_current_location)
agent.register_tool(get_interesting_things_to_do)
Run the test case and collect traces. Execute the conversation and capture the trace log:
from generative_ai_toolkit.test import Case
test_case = Case(user_inputs=["Find things to do near me within 30 minutes"])
traces = test_case.run(agent)
Write trace-based assertions. Use the Expect API to assert on tool invocations, outputs, and text responses:
from generative_ai_toolkit.test import Expect
Expect(traces).tool_invocations.to_include("get_current_location")
Expect(traces).tool_invocations.to_include("get_interesting_things_to_do")
Expect(traces).tool_invocations.to_not_include("start_navigation")
Expect(traces).agent_text_response.to_include("suggest")
Add negative and boundary assertions. Verify the agent does NOT call tools it shouldn't, handles missing parameters gracefully, and respects tool ordering constraints.
Organize tests into pyramid layers. Place fast mocked tests in a tests/unit/ directory, integration tests with multi-step chains in tests/integration/, and any live-LLM semantic tests in tests/acceptance/. Weight the count heavily toward unit tests.
Integrate into CI. Unit and integration tests (mocked) run on every commit with zero LLM cost. Acceptance tests run on a schedule or before releases.
Capture traces as regression baselines. When the agent behaves correctly, save the trace as a snapshot. Future test runs compare against this baseline to catch regressions from prompt changes, model upgrades, or tool modifications.
Example 1: Unit-testing tool selection for a travel agent
User: "Write tests for my travel agent that verify it calls get_location before get_attractions."
Approach:
import pytest
from generative_ai_toolkit.agent import BedrockConverseAgent
from generative_ai_toolkit.mock import MockBedrockConverse
from generative_ai_toolkit.test import Case, Expect
def get_current_location() -> str:
"""Gets the user's current GPS coordinates."""
return "52.3676, 4.9041"
def get_attractions(location: str, max_drive_minutes: int) -> str:
"""Gets nearby attractions within drive time."""
return "Rijksmuseum, Vondelpark, Anne Frank House"
@pytest.fixture
def travel_agent():
agent = BedrockConverseAgent(
model_id="test-model",
bedrock_client=MockBedrockConverse(),
system_prompt="You are a travel assistant. Always get location first.",
)
agent.register_tool(get_current_location)
agent.register_tool(get_attractions)
return agent
def test_tool_ordering(travel_agent):
case = Case(user_inputs=["What can I do nearby within 20 minutes?"])
traces = case.run(travel_agent)
Expect(traces).tool_invocations.to_include("get_current_location")
Expect(traces).tool_invocations.to_include("get_attractions")
# Agent should NOT start navigation without user confirmation
Expect(traces).tool_invocations.to_not_include("start_navigation")
def test_agent_asks_for_preferences(travel_agent):
case = Case(user_inputs=["I want to do something fun"])
traces = case.run(travel_agent)
# Agent should ask clarifying questions, not jump to tool use
Expect(traces).agent_text_response.to_include("prefer")
Example 2: Multi-turn regression test with mocked responses
User: "Add regression tests for my customer support agent's refund flow."
Approach:
def test_refund_flow_happy_path(support_agent):
case = Case(user_inputs=[
"I want a refund for order #12345",
"Yes, the item was damaged",
"Yes, please process the refund",
])
traces = case.run(support_agent)
# Verify correct tool chain
Expect(traces).tool_invocations.to_include("lookup_order")
Expect(traces).tool_invocations.to_include("check_refund_eligibility")
Expect(traces).tool_invocations.to_include("process_refund")
Expect(traces).tool_invocations.to_include("process_refund").with_output("approved")
# Verify agent confirms with user
Expect(traces).agent_text_response.to_include("refund has been processed")
def test_refund_flow_ineligible(support_agent):
case = Case(user_inputs=[
"Refund order #99999",
"I just changed my mind",
])
traces = case.run(support_agent)
Expect(traces).tool_invocations.to_include("lookup_order")
Expect(traces).tool_invocations.to_include("check_refund_eligibility")
# Should NOT process refund for ineligible case
Expect(traces).tool_invocations.to_not_include("process_refund")
Example 3: Test-driven development for a new agent tool
User: "I'm adding a schedule_meeting tool to my assistant. Help me write the tests first."
Approach:
def test_schedule_meeting_requires_all_params(assistant_agent):
"""Agent should ask for missing info before calling schedule_meeting."""
case = Case(user_inputs=["Schedule a meeting with Alice"])
traces = case.run(assistant_agent)
# Agent should NOT schedule without time — it should ask
Expect(traces).tool_invocations.to_not_include("schedule_meeting")
Expect(traces).agent_text_response.to_include("time")
def test_schedule_meeting_with_complete_info(assistant_agent):
"""Agent should call schedule_meeting when all info is provided."""
case = Case(user_inputs=[
"Schedule a meeting with Alice tomorrow at 2pm in Room B"
])
traces = case.run(assistant_agent)
Expect(traces).tool_invocations.to_include("schedule_meeting")
def test_schedule_meeting_conflict_handling(assistant_agent):
"""Agent should relay conflict info and suggest alternatives."""
case = Case(user_inputs=[
"Schedule a meeting with Alice tomorrow at 2pm",
"How about 3pm instead?",
])
traces = case.run(assistant_agent)
Expect(traces).tool_invocations.to_include("schedule_meeting")
Expect(traces).tool_invocations.to_include("check_availability")
Permute to test across multiple system prompts and model variants simultaneously, catching prompt fragility early:
agent_parameters={
"system_prompt": Permute([prompt_v1, prompt_v2]),
"model_id": Permute(["claude-sonnet", "claude-haiku"])
}
delete_account) without explicit user confirmation.to_include for key phrases rather than to_equal for full responses.Expect(traces).tool_invocations.to_include("X") fails, inspect the full trace to see which tools were actually called. Print [t.span_name for t in traces] to debug.agent.reset() between test cases to clear conversation history, preventing state leakage between tests.generative-ai-toolkit) targets Amazon Bedrock's Converse API. Adapting to other providers (OpenAI, Anthropic direct, local models) requires writing custom mock classes and trace collectors.Permute feature for cross-testing prompts and models produces a combinatorial explosion. Limit permutations to 2-3 dimensions to keep test suites manageable.AgentContext.current().tracer — this is a code change in production tools, not just test code.Paper: Automated structural testing of LLM-based agents: methods, framework, and case studies — Kohl et al., IEEE BigData 2025. Focus on Section III (methods: traces, mocking, assertions), Section IV (test automation pyramid adaptation), and Section V (case studies demonstrating faster root-cause analysis).
Code: github.com/awslabs/generative-ai-toolkit — Install with pip install "generative-ai-toolkit[all]". See examples/genai_toolkit_getting_started.ipynb for a complete walkthrough.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".