Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

lubu-labs/langgraph-testing-evaluation

Name: langgraph-testing-evaluation
Author: lubu-labs

skills/langgraph-testing-evaluation/SKILL.md

npx skillsauth add lubu-labs/langchain-agent-skills langgraph-testing-evaluation

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

LangGraph Testing & Evaluation

Practical workflows for validating agent quality with:

Unit/integration tests
Trajectory evaluation
LangSmith dataset evaluations
A/B-style comparisons between versions

Use this file for high-level flow. Load references/* for detailed implementation.

Start Here

Choose the smallest approach that answers your question:

| Goal | Primary method | Load first | | --- | --- | --- | | Validate node logic quickly | Unit tests with mocks | references/unit-testing-patterns.md | | Validate multi-step agent behavior | Trajectory evaluation | references/trajectory-evaluation.md | | Track quality over datasets over time | LangSmith evaluation | references/langsmith-evaluation.md | | Compare old vs new agent versions | A/B comparison | references/ab-testing.md |

Recommended order:

Unit tests
Integration/trajectory checks
Dataset evaluation in LangSmith
A/B comparison before deployment

Quick Commands

Run from repo root.

Generate test scaffolding

# Python (preferred)
uv run skills/langgraph-testing-evaluation/scripts/generate_test_cases.py my_agent:graph --output tests/ --framework pytest

# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/generate_test_cases.js ./my-agent.ts:graph --output tests/ --framework vitest

Run trajectory evaluation

# Python: LLM-as-judge
uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent my_dataset --method llm-judge --model openai:o3-mini

# Python: trajectory match
uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent dataset.json --method match --trajectory-match-mode strict --reference-trajectory reference.json

# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.js ./agent.ts:runAgent my_dataset --method llm-judge --model openai:o3-mini --max-concurrency 4

Run LangSmith dataset evaluation

# Python
uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy,latency --max-concurrency 4

# Python (do not upload experiment results)
uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy --no-upload

# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.js ./agent.ts:runAgent my_dataset --evaluators accuracy,latency --max-concurrency 4

Compare two agent versions

# Python
uv run skills/langgraph-testing-evaluation/scripts/compare_agents.py my_agent:v1 my_agent:v2 dataset.json --output comparison_report.json

# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --output comparison_report.json

# JavaScript/TypeScript (force local dataset file only)
node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --no-langsmith

Create mock response configs

# Python
uv run skills/langgraph-testing-evaluation/scripts/mock_llm_responses.py create --type sequence --output mock_config.json

# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/mock_llm_responses.js create --type sequence --output mock_config.json

Core Workflow

Define test scope.

Unit: deterministic logic in one node/function.
Integration: node interactions and routing.
End-to-end: complete response quality on realistic inputs.

Start from deterministic checks.

Mock LLM/tool IO for speed and repeatability.
Keep real-model tests as a smaller, explicit suite.

Build/curate dataset examples.

Use stable inputs and expected outputs.
Keep schema simple: inputs and outputs objects (optional metadata).
Compatibility note: scripts also accept singular keys (input, output) for legacy datasets.

Run evaluation with explicit gates.

Use evaluator keys that map to deployment decisions.
Set thresholds in CI for regression prevention.

Compare versions before rollout.

Run same dataset on both versions.
Check both quality and latency.

Diagnose failures from traces/experiments.

Inspect low-scoring examples.
Split failures by pattern (routing, tool usage, hallucination, latency spikes).

Current References (Load On Demand)

`references/unit-testing-patterns.md`

Load when:

You need node-level and routing test patterns.
You need pytest/vitest/Jest integration patterns.
You need robust mocking and flaky-test reduction.

`references/trajectory-evaluation.md`

Load when:

You need trajectory match evaluation (strict, unordered, subset, superset).
You need LLM-as-judge trajectory scoring.
You need LangSmith experiment comparison for trajectory results.

`references/langsmith-evaluation.md`

Load when:

You need dataset creation/management in LangSmith.
You need evaluator signatures and experiment runs in Python/TS.
You need CI-friendly workflows with quality thresholds.

`references/ab-testing.md`

Load when:

You need offline A/B comparison methodology.
You need significance testing and interpretation.
You need production traffic split strategy and guardrails.

Assets

`assets/templates/test_template.py`

Runnable Python pytest template aligned with current LangGraph testing patterns.
Includes:
- Compiled-graph invocation with thread_id
- Single-node testing via compiled_graph.nodes[...]
- Integration-test placeholder

`assets/datasets/sample_dataset.json`

Deterministic seed dataset for LangSmith ingestion.
Uses examples: [{ inputs, outputs, metadata }] format.

`assets/examples/README.md`

Documentation-only index for current asset usage.
Notes where runnable assets live today.

Script Interface Summary

`scripts/generate_test_cases.py` / `.js`

Use for fast test scaffolding.

Inputs:

Graph module path
- Python: my_module:graph or my_module.graph
- JS/TS: ./file.ts:graph

Outputs:

Framework-specific starter tests in target directory.

`scripts/run_trajectory_eval.py` / `.js`

Use for trajectory scoring with either:

--method match
--method llm-judge

Supports:

Local dataset files (.json)
LangSmith dataset names
Optional reference trajectory file with --reference-trajectory
Match modes: strict, unordered, subset, superset

Local-only mode:

--no-langsmith in both Python and JavaScript scripts (requires local JSON dataset file)

`scripts/evaluate_with_langsmith.py` / `.js`

Use for dataset-based evaluation runs and experiment tracking.

Supports:

Existing dataset by name
Dataset creation from JSON examples file
Multiple evaluators (--evaluators accuracy,latency,...)
Concurrency control (--max-concurrency)

Python-only:

--no-upload to run without uploading experiment results

`scripts/compare_agents.py` / `.js`

Use for offline version comparisons:

Shared dataset input
Success/latency summaries
JSON report output for CI artifacts
Local JSON datasets or LangSmith datasets (JS supports --no-langsmith to disable remote loading)

`scripts/mock_llm_responses.py` / `.js`

Use for deterministic test doubles:

single
sequence
conditional

Decision Rules

If behavior is deterministic and local:

Use unit tests first.

If behavior depends on tool sequence/routing:

Add trajectory evaluation.

If behavior depends on realistic distribution quality:

Run LangSmith dataset evaluation.

If approving a replacement model/prompt/graph:

Run A/B comparison and check both quality and latency.

Common Failure Patterns

Flaky tests

Cause: real-model nondeterminism in unit scope.
Fix: mock LLM/tool calls for unit tests; reserve real-model tests for separate integration marks.

High trajectory variance

Cause: overly strict matching for workflows with equivalent paths.
Fix: switch match mode (unordered, subset, or superset) where appropriate.

Regressions hidden by averages

Cause: only aggregate score monitored.
Fix: inspect per-example failures and segment by category metadata.

Latency regressions with same quality

Cause: no explicit latency gate.
Fix: include latency evaluator and CI threshold.

Minimal Best Practices

Keep fast deterministic tests as the largest share.
Version datasets and keep them stable.
Track both correctness and latency.
Add explicit go/no-go thresholds in CI.
Compare candidate vs baseline before production rollout.
Investigate failures with trace-level evidence, not only aggregate scores.

lubu-labs/langgraph-testing-evaluation

skills/langgraph-testing-evaluation/SKILL.md

Use this skill when you need to test or evaluate LangGraph/LangChain agents: writing unit or integration tests, generating test scaffolds, mocking LLM/tool behavior, running trajectory evaluation (match or LLM-as-judge), running LangSmith dataset evaluations, and comparing two agent versions with A/B-style offline analysis. Use it for Python and JavaScript/TypeScript workflows, evaluator design, experiment setup, regression gates, and debugging flaky/incorrect evaluation results.

88 stars

tools

Updated Apr 6, 2026

$ install --global

skillsauth

npx skillsauth add lubu-labs/langchain-agent-skills langgraph-testing-evaluation

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 20, 2026, 10:39 AM9.5s1 file scanned

SKILL.md

name:: langgraph-testing-evaluation
description:: Use this skill when you need to test or evaluate LangGraph/LangChain agents: writing unit or integration tests, generating test scaffolds, mocking LLM/tool behavior, running trajectory evaluation (match or LLM-as-judge), running LangSmith dataset evaluations, and comparing two agent versions with A/B-style offline analysis. Use it for Python and JavaScript/TypeScript workflows, evaluator design, experiment setup, regression gates, and debugging flaky/incorrect evaluation results.

LangGraph Testing & Evaluation

Practical workflows for validating agent quality with:

Unit/integration tests
Trajectory evaluation
LangSmith dataset evaluations
A/B-style comparisons between versions

Use this file for high-level flow. Load references/* for detailed implementation.

Start Here

Choose the smallest approach that answers your question:

Recommended order:

Unit tests
Integration/trajectory checks
Dataset evaluation in LangSmith
A/B comparison before deployment

Quick Commands

Run from repo root.

Generate test scaffolding

# Python (preferred)
uv run skills/langgraph-testing-evaluation/scripts/generate_test_cases.py my_agent:graph --output tests/ --framework pytest

# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/generate_test_cases.js ./my-agent.ts:graph --output tests/ --framework vitest

Run trajectory evaluation

# Python: LLM-as-judge
uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent my_dataset --method llm-judge --model openai:o3-mini

# Python: trajectory match
uv run skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.py my_agent:run_agent dataset.json --method match --trajectory-match-mode strict --reference-trajectory reference.json

# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/run_trajectory_eval.js ./agent.ts:runAgent my_dataset --method llm-judge --model openai:o3-mini --max-concurrency 4

Run LangSmith dataset evaluation

# Python
uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy,latency --max-concurrency 4

# Python (do not upload experiment results)
uv run skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.py my_agent:run_agent my_dataset --evaluators accuracy --no-upload

# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/evaluate_with_langsmith.js ./agent.ts:runAgent my_dataset --evaluators accuracy,latency --max-concurrency 4

Compare two agent versions

# Python
uv run skills/langgraph-testing-evaluation/scripts/compare_agents.py my_agent:v1 my_agent:v2 dataset.json --output comparison_report.json

# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --output comparison_report.json

# JavaScript/TypeScript (force local dataset file only)
node skills/langgraph-testing-evaluation/scripts/compare_agents.js ./v1.ts:run ./v2.ts:run dataset.json --no-langsmith

Create mock response configs

# Python
uv run skills/langgraph-testing-evaluation/scripts/mock_llm_responses.py create --type sequence --output mock_config.json

# JavaScript/TypeScript
node skills/langgraph-testing-evaluation/scripts/mock_llm_responses.js create --type sequence --output mock_config.json

Core Workflow

Define test scope.

Unit: deterministic logic in one node/function.
Integration: node interactions and routing.
End-to-end: complete response quality on realistic inputs.

Start from deterministic checks.

Mock LLM/tool IO for speed and repeatability.
Keep real-model tests as a smaller, explicit suite.

Build/curate dataset examples.

Use stable inputs and expected outputs.
Keep schema simple: inputs and outputs objects (optional metadata).
Compatibility note: scripts also accept singular keys (input, output) for legacy datasets.

Run evaluation with explicit gates.

Use evaluator keys that map to deployment decisions.
Set thresholds in CI for regression prevention.

Compare versions before rollout.

Run same dataset on both versions.
Check both quality and latency.

Diagnose failures from traces/experiments.

Inspect low-scoring examples.
Split failures by pattern (routing, tool usage, hallucination, latency spikes).

Current References (Load On Demand)

`references/unit-testing-patterns.md`

Load when:

You need node-level and routing test patterns.
You need pytest/vitest/Jest integration patterns.
You need robust mocking and flaky-test reduction.

`references/trajectory-evaluation.md`

Load when:

You need trajectory match evaluation (strict, unordered, subset, superset).
You need LLM-as-judge trajectory scoring.
You need LangSmith experiment comparison for trajectory results.

`references/langsmith-evaluation.md`

Load when:

You need dataset creation/management in LangSmith.
You need evaluator signatures and experiment runs in Python/TS.
You need CI-friendly workflows with quality thresholds.

`references/ab-testing.md`

Load when:

You need offline A/B comparison methodology.
You need significance testing and interpretation.
You need production traffic split strategy and guardrails.

Assets

`assets/templates/test_template.py`

Runnable Python pytest template aligned with current LangGraph testing patterns.
Includes:
- Compiled-graph invocation with thread_id
- Single-node testing via compiled_graph.nodes[...]
- Integration-test placeholder

`assets/datasets/sample_dataset.json`

Deterministic seed dataset for LangSmith ingestion.
Uses examples: [{ inputs, outputs, metadata }] format.

`assets/examples/README.md`

Documentation-only index for current asset usage.
Notes where runnable assets live today.

Script Interface Summary

`scripts/generate_test_cases.py` / `.js`

Use for fast test scaffolding.

Inputs:

Graph module path
- Python: my_module:graph or my_module.graph
- JS/TS: ./file.ts:graph

Outputs:

Framework-specific starter tests in target directory.

`scripts/run_trajectory_eval.py` / `.js`

Use for trajectory scoring with either:

--method match
--method llm-judge

Supports:

Local dataset files (.json)
LangSmith dataset names
Optional reference trajectory file with --reference-trajectory
Match modes: strict, unordered, subset, superset

Local-only mode:

--no-langsmith in both Python and JavaScript scripts (requires local JSON dataset file)

`scripts/evaluate_with_langsmith.py` / `.js`

Use for dataset-based evaluation runs and experiment tracking.

Supports:

Existing dataset by name
Dataset creation from JSON examples file
Multiple evaluators (--evaluators accuracy,latency,...)
Concurrency control (--max-concurrency)

Python-only:

--no-upload to run without uploading experiment results

`scripts/compare_agents.py` / `.js`

Use for offline version comparisons:

Shared dataset input
Success/latency summaries
JSON report output for CI artifacts
Local JSON datasets or LangSmith datasets (JS supports --no-langsmith to disable remote loading)

`scripts/mock_llm_responses.py` / `.js`

Use for deterministic test doubles:

single
sequence
conditional

Decision Rules

If behavior is deterministic and local:

Use unit tests first.

If behavior depends on tool sequence/routing:

Add trajectory evaluation.

If behavior depends on realistic distribution quality:

Run LangSmith dataset evaluation.

If approving a replacement model/prompt/graph:

Run A/B comparison and check both quality and latency.

Common Failure Patterns

Flaky tests

Cause: real-model nondeterminism in unit scope.
Fix: mock LLM/tool calls for unit tests; reserve real-model tests for separate integration marks.

High trajectory variance

Cause: overly strict matching for workflows with equivalent paths.
Fix: switch match mode (unordered, subset, or superset) where appropriate.

Regressions hidden by averages

Cause: only aggregate score monitored.
Fix: inspect per-example failures and segment by category metadata.

Latency regressions with same quality

Cause: no explicit latency gate.
Fix: include latency evaluator and CI threshold.

Minimal Best Practices

Keep fast deterministic tests as the largest share.
Version datasets and keep them stable.
Track both correctness and latency.
Add explicit go/no-go thresholds in CI.
Compare candidate vs baseline before production rollout.
Investigate failures with trace-level evidence, not only aggregate scores.

Related Skills

lubu-labs/skill-creator

tools

VerifiedTrustedCommunity

Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.

88SKILL.mdUpdated Apr 6, 2026

lubu-labs/skill-creator

lubu-labs/langsmith-trace-analyzer

tools

VerifiedTrustedCommunity

Fetch, organize, and analyze LangSmith traces for debugging and evaluation. Use when you need to: query traces/runs by project, metadata, status, or time window; download traces to JSON; organize outcomes into passed/failed/error buckets; analyze token/message/tool-call patterns; compare passed vs failed behavior; or investigate benchmark and production failures.

88SKILL.mdUpdated Apr 6, 2026

lubu-labs/langsmith-trace-analyzer

lubu-labs/langgraph-state-management

development

VerifiedTrustedCommunity

Design state schemas, implement reducers, configure persistence, and debug state issues for LangGraph applications. Use when users want to (1) design or define state schemas for LangGraph graphs, (2) implement reducer functions for state accumulation, (3) configure persistence with checkpointers (InMemorySaver/MemorySaver, SqliteSaver, PostgresSaver), (4) debug state update issues or unexpected state behavior, (5) migrate state schemas between versions, (6) validate state schema structure, (7) choose between TypedDict and MessagesState patterns, (8) implement custom reducers for lists, dicts, or sets, (9) use the Overwrite type to bypass reducers, (10) set up thread-based persistence for multi-turn conversations, or (11) inspect checkpoints for debugging.

88SKILL.mdUpdated Apr 6, 2026

lubu-labs/langgraph-state-management

lubu-labs/langgraph-project-setup

development

VerifiedTrustedCommunity

Initialize and configure LangGraph projects with proper structure, langgraph.json configuration, environment variables, and dependency management. Use when users want to (1) create a new LangGraph project, (2) set up langgraph.json for deployment, (3) configure environment variables for LLM providers, (4) initialize project structure for agents, (5) set up local development with LangGraph Studio, (6) configure dependencies (pyproject.toml, requirements.txt, package.json), or (7) troubleshoot project configuration issues.

88SKILL.mdUpdated Apr 6, 2026

lubu-labs/langgraph-project-setup

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/lubu-labs/langchain-agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r langchain-agent-skills/skills/langgraph-testing-evaluation ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

lubu-labs/langchain-agent-skills

88 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

lubu-labs/langgraph-testing-evaluation

$ install --global

Security Scan Results

SKILL.md

LangGraph Testing & Evaluation

Start Here

Quick Commands

Generate test scaffolding

Run trajectory evaluation

Run LangSmith dataset evaluation

Compare two agent versions

Create mock response configs

Core Workflow

Current References (Load On Demand)

references/unit-testing-patterns.md

references/trajectory-evaluation.md

references/langsmith-evaluation.md

references/ab-testing.md

Assets

assets/templates/test_template.py

assets/datasets/sample_dataset.json

assets/examples/README.md

Script Interface Summary

scripts/generate_test_cases.py / .js

scripts/run_trajectory_eval.py / .js

scripts/evaluate_with_langsmith.py / .js

scripts/compare_agents.py / .js

scripts/mock_llm_responses.py / .js

Decision Rules

Common Failure Patterns

Flaky tests

High trajectory variance

Regressions hidden by averages

Latency regressions with same quality

Minimal Best Practices

Related Skills

lubu-labs/skill-creator

lubu-labs/langsmith-trace-analyzer

lubu-labs/langgraph-state-management

lubu-labs/langgraph-project-setup

lubu-labs/langgraph-testing-evaluation

$ install --global

Security Scan Results

SKILL.md

LangGraph Testing & Evaluation

Start Here

Quick Commands

Generate test scaffolding

Run trajectory evaluation

Run LangSmith dataset evaluation

Compare two agent versions

Create mock response configs

Core Workflow

Current References (Load On Demand)

references/unit-testing-patterns.md

references/trajectory-evaluation.md

references/langsmith-evaluation.md

references/ab-testing.md

Assets

assets/templates/test_template.py

assets/datasets/sample_dataset.json

assets/examples/README.md

Script Interface Summary

scripts/generate_test_cases.py / .js

scripts/run_trajectory_eval.py / .js

scripts/evaluate_with_langsmith.py / .js

scripts/compare_agents.py / .js

scripts/mock_llm_responses.py / .js

Decision Rules

Common Failure Patterns

Flaky tests

High trajectory variance

Regressions hidden by averages

Latency regressions with same quality

Minimal Best Practices

Related Skills

lubu-labs/skill-creator

lubu-labs/langsmith-trace-analyzer

lubu-labs/langgraph-state-management

`references/unit-testing-patterns.md`

`references/trajectory-evaluation.md`

`references/langsmith-evaluation.md`

`references/ab-testing.md`

`assets/templates/test_template.py`

`assets/datasets/sample_dataset.json`

`assets/examples/README.md`

`scripts/generate_test_cases.py` / `.js`

`scripts/run_trajectory_eval.py` / `.js`

`scripts/evaluate_with_langsmith.py` / `.js`

`scripts/compare_agents.py` / `.js`

`scripts/mock_llm_responses.py` / `.js`

`references/unit-testing-patterns.md`

`references/trajectory-evaluation.md`

`references/langsmith-evaluation.md`

`references/ab-testing.md`

`assets/templates/test_template.py`

`assets/datasets/sample_dataset.json`

`assets/examples/README.md`

`scripts/generate_test_cases.py` / `.js`

`scripts/run_trajectory_eval.py` / `.js`

`scripts/evaluate_with_langsmith.py` / `.js`

`scripts/compare_agents.py` / `.js`

`scripts/mock_llm_responses.py` / `.js`