skills/evaluations/SKILL.md
Set up comprehensive evaluations for your AI agent with LangWatch — experiments (batch testing), evaluators (scoring functions), datasets, online evaluation (production monitoring), and guardrails (real-time blocking). Supports both code (SDK) and platform (MCP) approaches. Use when the user wants to evaluate, test, benchmark, monitor, or safeguard their agent.
npx skillsauth add langwatch/langwatch evaluationsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
LangWatch Evaluations is a comprehensive quality assurance system. Understand which part the user needs:
| User says... | They need... | Go to... | |---|---|---| | "test my agent", "benchmark", "compare models" | Experiments | Step A | | "monitor production", "track quality", "block harmful content", "safety" | Online Evaluation (includes guardrails) | Step B | | "create an evaluator", "scoring function" | Evaluators | Step C | | "create a dataset", "test data" | Datasets | Step D | | "evaluate" (ambiguous) | Ask: "batch test or production monitoring?" | - |
Evaluations sit at the component level of the testing pyramid — they test specific aspects of your agent with many input/output examples. This is different from scenarios (end-to-end multi-turn conversation testing).
Use evaluations when:
Use scenarios instead when:
For onboarding, create 1-2 Jupyter notebooks (or scripts) maximum. Focus on generating domain-realistic data that's as close to real-world inputs as possible.
If the user's request is general ("set up evaluations", "evaluate my agent"):
If the user's request is specific ("add a faithfulness evaluator", "create a dataset for RAG testing"):
package.json, pyproject.toml, requirements.txt, etc.)Some features are code-only (experiments, guardrails) and some are platform-only (monitors). Evaluators work on both surfaces.
See Plan Limits for how to handle free plan limits gracefully. Focus on delivering value within the limits — create 1-2 high-quality experiments with domain-realistic data rather than many shallow ones. Do NOT try to work around limits by deleting existing resources. Show the user the value of what you created before suggesting an upgrade.
Preferred: Use the LangWatch CLI (see CLI Setup)
The CLI is the primary interface for agents. If the CLI is not available, fall back to MCP tools.
See MCP Setup for MCP installation as an alternative.
If MCP installation fails, see docs fallback.
Read the evaluations overview first: call fetch_langwatch_docs with url https://langwatch.ai/docs/evaluations/overview.md
Create a script or notebook that runs your agent against a dataset and measures quality.
fetch_langwatch_docs with url https://langwatch.ai/docs/evaluations/experiments/sdk.mdPython — Jupyter Notebook (.ipynb):
import langwatch
import pandas as pd
# Dataset tailored to the agent's domain
data = {
"input": ["domain-specific question 1", "domain-specific question 2"],
"expected_output": ["expected answer 1", "expected answer 2"],
}
df = pd.DataFrame(data)
evaluation = langwatch.experiment.init("agent-evaluation")
for index, row in evaluation.loop(df.iterrows()):
response = my_agent(row["input"])
evaluation.evaluate(
"ragas/answer_relevancy",
index=index,
data={"input": row["input"], "output": response},
settings={"model": "openai/gpt-5-mini", "max_tokens": 2048},
)
TypeScript — Script (.ts):
import { LangWatch } from "langwatch";
const langwatch = new LangWatch();
const dataset = [
{ input: "domain-specific question", expectedOutput: "expected answer" },
];
const evaluation = await langwatch.experiments.init("agent-evaluation");
await evaluation.run(dataset, async ({ item, index }) => {
const response = await myAgent(item.input);
await evaluation.evaluate("ragas/answer_relevancy", {
index,
data: { input: item.input, output: response },
settings: { model: "openai/gpt-5-mini", max_tokens: 2048 },
});
});
ALWAYS run the experiment after creating it. If it fails, fix it. An experiment that isn't executed is useless.
For Python notebooks: Create an accompanying script to run it:
# run_experiment.py
import subprocess
subprocess.run(["jupyter", "nbconvert", "--to", "notebook", "--execute", "experiment.ipynb"], check=True)
Or simply run the cells in order via the notebook interface.
For TypeScript: npx tsx experiment.ts
Online evaluation has two modes:
Set up monitors that continuously score production traffic.
fetch_langwatch_docs with url https://langwatch.ai/docs/evaluations/online-evaluation/overview.mdAdd code to block harmful content before it reaches users (synchronous, real-time).
fetch_langwatch_docs with url https://langwatch.ai/docs/evaluations/guardrails/code-integration.mdimport langwatch
@langwatch.trace()
def my_agent(user_input):
guardrail = langwatch.evaluation.evaluate(
"azure/jailbreak",
name="Jailbreak Detection",
as_guardrail=True,
data={"input": user_input},
)
if not guardrail.passed:
return "I can't help with that request."
# Continue with normal processing...
Key distinction: Monitors measure (async, observability). Guardrails act (sync, enforcement via code with as_guardrail=True).
Create or configure evaluators — the functions that score your agent's outputs.
fetch_langwatch_docs with url https://langwatch.ai/docs/evaluations/evaluators/overview.mdhttps://langwatch.ai/docs/evaluations/evaluators/list.mdevaluation.evaluate("ragas/faithfulness", index=idx, data={...})
langwatch evaluator list # List evaluators
langwatch evaluator create "My Evaluator" --type langevals/llm_judge
langwatch evaluator get <idOrSlug> # View details
langwatch evaluator update <idOrSlug> --name "New Name" # Update
langwatch evaluation run <slug> --wait # Run evaluation and wait
langwatch evaluation status <runId> # Check run status
langwatch evaluator list # List evaluators
langwatch evaluator get <idOrSlug> # Get details
langwatch evaluator create "Name" --type langevals/llm_judge # Create
langwatch evaluator update <idOrSlug> --name "New Name" # Update
If the CLI is not available, use MCP tools as a fallback:
discover_schema with category "evaluators" to see available typesplatform_create_evaluator to create an evaluator on the platformThis is useful for setting up LLM-as-judge evaluators, custom evaluators, or configuring evaluators that will be used in platform experiments and monitors.
Create test datasets for experiments.
langwatch dataset list # List datasets
langwatch dataset create "My Dataset" -c input:string,output:string
langwatch dataset upload my-dataset data.csv # Upload CSV/JSON
langwatch dataset records list my-dataset # View records
langwatch dataset download my-dataset -f csv # Download
fetch_langwatch_docs with url https://langwatch.ai/docs/datasets/overview.md| Agent type | Dataset examples | |---|---| | Chatbot | Realistic user questions matching the bot's persona | | RAG pipeline | Questions with expected answers testing retrieval quality | | Classifier | Inputs with expected category labels | | Code assistant | Coding tasks with expected outputs | | Customer support | Support tickets and customer questions | | Summarizer | Documents with expected summaries |
CRITICAL: The dataset MUST be specific to what the agent ACTUALLY does. Before generating any data:
Then generate data that reflects EXACTLY this agent's real-world usage. For example:
NEVER use generic examples like "What is 2+2?", "What is the capital of France?", or "Explain quantum computing". These are useless for evaluating the specific agent. Every single example must be something a real user of THIS specific agent would actually say.
https://langwatch.ai/docs/datasets/programmatic-access.mdhttps://langwatch.ai/docs/datasets/ai-dataset-generation.mdWhen the user has no codebase and wants to set up evaluation building blocks on the platform. Use the CLI as the primary interface:
langwatch prompt list # List existing prompts
langwatch prompt create my-prompt # Create a new prompt YAML
langwatch prompt push # Push to the platform
langwatch prompt versions my-prompt # View version history
langwatch prompt tag assign my-prompt production # Tag a version
Before creating evaluators, verify model providers are configured:
langwatch model-provider list # Check existing providers
langwatch model-provider set openai --api-key sk-... # Set up a provider
langwatch evaluator list # See available evaluators
langwatch evaluator create "Quality Judge" --type langevals/llm_judge
langwatch evaluator get <idOrSlug> --format json # View details
langwatch dataset create "Test Data" -c input:string,expected_output:string
langwatch dataset upload test-data data.csv # Upload from CSV/JSON/JSONL
langwatch dataset records list test-data # View records
langwatch monitor create "Toxicity Check" --check-type ragas/toxicity
langwatch monitor create "PII Detection" --check-type presidio/pii_detection --sample 0.5
langwatch monitor list # View all monitors
Go to https://app.langwatch.ai and:
If the CLI is not available, use MCP tools instead (platform_create_prompt, platform_create_evaluator, etc.).
langwatch CLI over MCP tools for platform features (evaluators, monitors, datasets)as_guardrail=True) — both are online evaluationLANGWATCH_API_KEY in .envdiscover_schema before creating evaluators via MCP to understand available typeslangwatch prompt create CLI when using the platform approach — that's for code-based projectsdevelopment
Add LangWatch tracing and observability to your code. Use for both onboarding (instrument an entire codebase) and targeted operations (add tracing to a specific function or module). Supports Python and TypeScript with all major frameworks.
tools
Test your AI agent with simulation-based scenarios. Covers writing scenario test code (Scenario SDK), creating platform scenarios (CLI or MCP), and red teaming for security vulnerabilities. Auto-detects whether to use code or platform approach based on context.
testing
Test that your AI agent stays observational and doesn't give prescriptive advice in regulated domains (healthcare, finance, legal). Creates scenario tests for boundary enforcement and red team tests for adversarial probing. Use when your agent advises but must not prescribe.
tools
Write scenario tests that verify your CLI tool is usable by AI agents. Ensures commands work non-interactively, provide clear output, and don't hang on prompts. Use when you want to prove your CLI is agent-friendly.