skills/langgraph-testing-evaluation/SKILL.md
Use this skill when you need to test or evaluate LangGraph/LangChain agents: writing unit or integration tests, generating test scaffolds, mocking LLM/tool behavior, running trajectory evaluation (match or LLM-as-judge), running LangSmith dataset evaluations, and comparing two agent versions with A/B-style offline analysis. Use it for Python and JavaScript/TypeScript workflows, evaluator design, experiment setup, regression gates, and debugging flaky/incorrect evaluation results.
npx skillsauth add lubu-labs/langchain-agent-skills langgraph-testing-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Practical workflows for validating agent quality with:
Use this file for high-level flow. Load references/* for detailed implementation.
Choose the smallest approach that answers your question:
| Goal | Primary method | Load first |
| --- | --- | --- |
| Validate node logic quickly | Unit tests with mocks | references/unit-testing-patterns.md |
| Validate multi-step agent behavior | Trajectory evaluation | references/trajectory-evaluation.md |
| Track quality over datasets over time | LangSmith evaluation | references/langsmith-evaluation.md |
| Compare old vs new agent versions | A/B comparison | references/ab-testing.md |
Recommended order:
Run from repo root.
# Python (preferred)
uv run skills/langgraph-testing-evaluation/scripts/generate_test_cases.py my_agent:graph --output tests/ --framework pytest
# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/generate_test_cases.js ./my-agent.ts:graph --output tests/ --framework vitest
# Python: LLM-as-judge
uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent my_dataset --method llm-judge --model openai:o3-mini
# Python: trajectory match
uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent dataset.json --method match --trajectory-match-mode strict --reference-trajectory reference.json
# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.js ./agent.ts:runAgent my_dataset --method llm-judge --model openai:o3-mini --max-concurrency 4
# Python
uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy,latency --max-concurrency 4
# Python (do not upload experiment results)
uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy --no-upload
# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.js ./agent.ts:runAgent my_dataset --evaluators accuracy,latency --max-concurrency 4
# Python
uv run skills/langgraph-testing-evaluation/scripts/compare_agents.py my_agent:v1 my_agent:v2 dataset.json --output comparison_report.json
# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --output comparison_report.json
# JavaScript/TypeScript (force local dataset file only)
node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --no-langsmith
# Python
uv run skills/langgraph-testing-evaluation/scripts/mock_llm_responses.py create --type sequence --output mock_config.json
# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/mock_llm_responses.js create --type sequence --output mock_config.json
inputs and outputs objects (optional metadata).input, output) for legacy datasets.references/unit-testing-patterns.mdLoad when:
references/trajectory-evaluation.mdLoad when:
strict, unordered, subset, superset).references/langsmith-evaluation.mdLoad when:
references/ab-testing.mdLoad when:
assets/templates/test_template.pythread_idcompiled_graph.nodes[...]assets/datasets/sample_dataset.jsonexamples: [{ inputs, outputs, metadata }] format.assets/examples/README.mdscripts/generate_test_cases.py / .jsUse for fast test scaffolding.
Inputs:
my_module:graph or my_module.graph./file.ts:graphOutputs:
scripts/run_trajectory_eval.py / .jsUse for trajectory scoring with either:
--method match--method llm-judgeSupports:
.json)--reference-trajectorystrict, unordered, subset, supersetLocal-only mode:
--no-langsmith in both Python and JavaScript scripts (requires local JSON dataset file)scripts/evaluate_with_langsmith.py / .jsUse for dataset-based evaluation runs and experiment tracking.
Supports:
--evaluators accuracy,latency,...)--max-concurrency)Python-only:
--no-upload to run without uploading experiment resultsscripts/compare_agents.py / .jsUse for offline version comparisons:
--no-langsmith to disable remote loading)scripts/mock_llm_responses.py / .jsUse for deterministic test doubles:
If behavior is deterministic and local:
If behavior depends on tool sequence/routing:
If behavior depends on realistic distribution quality:
If approving a replacement model/prompt/graph:
unordered, subset, or superset) where appropriate.tools
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
tools
Fetch, organize, and analyze LangSmith traces for debugging and evaluation. Use when you need to: query traces/runs by project, metadata, status, or time window; download traces to JSON; organize outcomes into passed/failed/error buckets; analyze token/message/tool-call patterns; compare passed vs failed behavior; or investigate benchmark and production failures.
development
Design state schemas, implement reducers, configure persistence, and debug state issues for LangGraph applications. Use when users want to (1) design or define state schemas for LangGraph graphs, (2) implement reducer functions for state accumulation, (3) configure persistence with checkpointers (InMemorySaver/MemorySaver, SqliteSaver, PostgresSaver), (4) debug state update issues or unexpected state behavior, (5) migrate state schemas between versions, (6) validate state schema structure, (7) choose between TypedDict and MessagesState patterns, (8) implement custom reducers for lists, dicts, or sets, (9) use the Overwrite type to bypass reducers, (10) set up thread-based persistence for multi-turn conversations, or (11) inspect checkpoints for debugging.
development
Initialize and configure LangGraph projects with proper structure, langgraph.json configuration, environment variables, and dependency management. Use when users want to (1) create a new LangGraph project, (2) set up langgraph.json for deployment, (3) configure environment variables for LLM providers, (4) initialize project structure for agents, (5) set up local development with LangGraph Studio, (6) configure dependencies (pyproject.toml, requirements.txt, package.json), or (7) troubleshoot project configuration issues.