Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

langwatch/generate-rag-dataset

Name: generate-rag-dataset
Author: langwatch

skills/recipes/generate-rag-dataset/SKILL.md

npx skillsauth add langwatch/langwatch generate-rag-dataset

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Generate a RAG Evaluation Dataset

This recipe analyzes your RAG knowledge base and generates a comprehensive Q&A evaluation dataset.

Step 1: Analyze the Knowledge Base

Read the codebase to find the knowledge base:

Document files (PDFs, markdown, text files)
Database schemas (if documents are stored in a DB)
Vector store configuration (what's being embedded)
Chunking strategy (how documents are split)

Read every document you can access. Understand:

What topics does the knowledge base cover?
What's the depth of information?
What terminology is used?
What are the boundaries (what's NOT covered)?

Step 2: Generate Diverse Question Types

Create questions across these categories:

Factual Recall

Direct questions answerable from a single passage:

"What is the recommended threshold for X?"
"When should Y be applied?"

Multi-Hop Reasoning

Questions requiring information from multiple passages:

"Given condition A and condition B, what should be done?"
"How do X and Y interact when Z occurs?"

Comparison

Questions comparing concepts within the knowledge base:

"What's the difference between approach A and approach B?"
"When should you use X instead of Y?"

Edge Cases

Questions about boundary conditions or unusual scenarios:

"What happens if the measurement is outside normal range?"
"What if two recommendations conflict?"

Negative Cases

Questions about topics NOT covered by the knowledge base:

"Does the system support Z?" (when it doesn't)
Questions requiring external knowledge the KB doesn't have

These help test that the agent correctly says "I don't know" rather than hallucinating.

Step 3: Include Context Per Row

For each Q&A pair, include the relevant document chunk(s) that contain the answer. This enables:

Platform experiments without the full RAG pipeline
Evaluating answer quality independent of retrieval quality
Testing with different prompts using the same retrieved context

Format:

{
    "input": "When should I irrigate apple orchards?",
    "expected_output": "Irrigate when soil moisture exceeds 35 kPa...",
    "context": "## Irrigation Management\nSoil moisture threshold for apple orchards: maintain between 25-35 kPa...",
    "question_type": "factual_recall"
}

Step 4: Export Formats

Create both:

Python DataFrame (for SDK experiments)

import pandas as pd
df = pd.DataFrame(dataset)
df.to_csv("rag_evaluation_dataset.csv", index=False)

Platform-Ready CSV

Export with columns: input, expected_output, context, question_type This can be imported directly into LangWatch platform datasets.

Step 5: Validate Dataset Quality

Before using the dataset:

Check topic coverage — are all knowledge base topics represented?
Verify answers are actually in the context — no hallucinated expected outputs
Check question diversity — not all the same type
Verify negative cases have appropriate "I don't know" expected outputs
Run a quick experiment to baseline accuracy

Common Mistakes

Do NOT generate questions without reading the actual knowledge base first
Do NOT skip negative cases — testing "I don't know" is crucial for RAG
Do NOT use the same question pattern for every entry — diversify types
Do NOT forget to include the relevant context per row
Do NOT generate expected outputs that aren't actually in the knowledge base

langwatch/generate-rag-dataset

skills/recipes/generate-rag-dataset/SKILL.md

Generate a synthetic evaluation dataset from your RAG knowledge base. Creates diverse Q&A pairs with expected answers and relevant context, ready for LangWatch experiments and platform import. Use when you need test data for your RAG pipeline.

3,203 stars

testing

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add langwatch/langwatch generate-rag-dataset

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 8:47 PM1.9s1 file scanned

SKILL.md

name:: generate-rag-dataset
description:: Generate a synthetic evaluation dataset from your RAG knowledge base. Creates diverse Q&A pairs with expected answers and relevant context, ready for LangWatch experiments and platform import. Use when you need test data for your RAG pipeline.
license:: MIT
compatibility:: Requires LangWatch SDK. Works with Claude Code and similar coding agents.
category:: recipe

Generate a RAG Evaluation Dataset

This recipe analyzes your RAG knowledge base and generates a comprehensive Q&A evaluation dataset.

Step 1: Analyze the Knowledge Base

Read the codebase to find the knowledge base:

Document files (PDFs, markdown, text files)
Database schemas (if documents are stored in a DB)
Vector store configuration (what's being embedded)
Chunking strategy (how documents are split)

Read every document you can access. Understand:

What topics does the knowledge base cover?
What's the depth of information?
What terminology is used?
What are the boundaries (what's NOT covered)?

Step 2: Generate Diverse Question Types

Create questions across these categories:

Factual Recall

Direct questions answerable from a single passage:

"What is the recommended threshold for X?"
"When should Y be applied?"

Multi-Hop Reasoning

Questions requiring information from multiple passages:

"Given condition A and condition B, what should be done?"
"How do X and Y interact when Z occurs?"

Comparison

Questions comparing concepts within the knowledge base:

"What's the difference between approach A and approach B?"
"When should you use X instead of Y?"

Edge Cases

Questions about boundary conditions or unusual scenarios:

"What happens if the measurement is outside normal range?"
"What if two recommendations conflict?"

Negative Cases

Questions about topics NOT covered by the knowledge base:

"Does the system support Z?" (when it doesn't)
Questions requiring external knowledge the KB doesn't have

These help test that the agent correctly says "I don't know" rather than hallucinating.

Step 3: Include Context Per Row

For each Q&A pair, include the relevant document chunk(s) that contain the answer. This enables:

Platform experiments without the full RAG pipeline
Evaluating answer quality independent of retrieval quality
Testing with different prompts using the same retrieved context

Format:

{
    "input": "When should I irrigate apple orchards?",
    "expected_output": "Irrigate when soil moisture exceeds 35 kPa...",
    "context": "## Irrigation Management\nSoil moisture threshold for apple orchards: maintain between 25-35 kPa...",
    "question_type": "factual_recall"
}

Step 4: Export Formats

Create both:

Python DataFrame (for SDK experiments)

import pandas as pd
df = pd.DataFrame(dataset)
df.to_csv("rag_evaluation_dataset.csv", index=False)

Platform-Ready CSV

Export with columns: input, expected_output, context, question_type This can be imported directly into LangWatch platform datasets.

Step 5: Validate Dataset Quality

Before using the dataset:

Check topic coverage — are all knowledge base topics represented?
Verify answers are actually in the context — no hallucinated expected outputs
Check question diversity — not all the same type
Verify negative cases have appropriate "I don't know" expected outputs
Run a quick experiment to baseline accuracy

Common Mistakes

Do NOT generate questions without reading the actual knowledge base first
Do NOT skip negative cases — testing "I don't know" is crucial for RAG
Do NOT use the same question pattern for every entry — diversify types
Do NOT forget to include the relevant context per row
Do NOT generate expected outputs that aren't actually in the knowledge base

Related Skills

langwatch/tracing

development

VerifiedTrustedCommunity

Add LangWatch tracing and observability to your code. Use for both onboarding (instrument an entire codebase) and targeted operations (add tracing to a specific function or module). Supports Python and TypeScript with all major frameworks.

3,203SKILL.mdUpdated Apr 15, 2026

langwatch/scenarios

tools

VerifiedTrustedCommunity

Test your AI agent with simulation-based scenarios. Covers writing scenario test code (Scenario SDK), creating platform scenarios (CLI or MCP), and red teaming for security vulnerabilities. Auto-detects whether to use code or platform approach based on context.

3,203SKILL.mdUpdated Apr 15, 2026

langwatch/test-compliance

testing

VerifiedTrustedCommunity

Test that your AI agent stays observational and doesn't give prescriptive advice in regulated domains (healthcare, finance, legal). Creates scenario tests for boundary enforcement and red team tests for adversarial probing. Use when your agent advises but must not prescribe.

3,203SKILL.mdUpdated Apr 15, 2026

langwatch/test-compliance

langwatch/test-cli-usability

tools

VerifiedTrustedCommunity

Write scenario tests that verify your CLI tool is usable by AI agents. Ensures commands work non-interactively, provide clear output, and don't hang on prompts. Use when you want to prove your CLI is agent-friendly.

3,203SKILL.mdUpdated Apr 15, 2026

langwatch/test-cli-usability

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/langwatch/langwatch.git

# Copy into Claude Code skills folder (global)
cp -r langwatch/skills/recipes/generate-rag-dataset ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

langwatch/langwatch

3,203 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT