skills/recipes/generate-rag-dataset/SKILL.md
Generate a synthetic evaluation dataset from your RAG knowledge base. Creates diverse Q&A pairs with expected answers and relevant context, ready for LangWatch experiments and platform import. Use when you need test data for your RAG pipeline.
npx skillsauth add langwatch/langwatch generate-rag-datasetInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This recipe analyzes your RAG knowledge base and generates a comprehensive Q&A evaluation dataset.
Read the codebase to find the knowledge base:
Read every document you can access. Understand:
Create questions across these categories:
Direct questions answerable from a single passage:
Questions requiring information from multiple passages:
Questions comparing concepts within the knowledge base:
Questions about boundary conditions or unusual scenarios:
Questions about topics NOT covered by the knowledge base:
These help test that the agent correctly says "I don't know" rather than hallucinating.
For each Q&A pair, include the relevant document chunk(s) that contain the answer. This enables:
Format:
{
"input": "When should I irrigate apple orchards?",
"expected_output": "Irrigate when soil moisture exceeds 35 kPa...",
"context": "## Irrigation Management\nSoil moisture threshold for apple orchards: maintain between 25-35 kPa...",
"question_type": "factual_recall"
}
Create both:
import pandas as pd
df = pd.DataFrame(dataset)
df.to_csv("rag_evaluation_dataset.csv", index=False)
Export with columns: input, expected_output, context, question_type
This can be imported directly into LangWatch platform datasets.
Before using the dataset:
development
Add LangWatch tracing and observability to your code. Use for both onboarding (instrument an entire codebase) and targeted operations (add tracing to a specific function or module). Supports Python and TypeScript with all major frameworks.
tools
Test your AI agent with simulation-based scenarios. Covers writing scenario test code (Scenario SDK), creating platform scenarios (CLI or MCP), and red teaming for security vulnerabilities. Auto-detects whether to use code or platform approach based on context.
testing
Test that your AI agent stays observational and doesn't give prescriptive advice in regulated domains (healthcare, finance, legal). Creates scenario tests for boundary enforcement and red team tests for adversarial probing. Use when your agent advises but must not prescribe.
tools
Write scenario tests that verify your CLI tool is usable by AI agents. Ensures commands work non-interactively, provide clear output, and don't hang on prompts. Use when you want to prove your CLI is agent-friendly.