skills/generate-synthetic-data/SKILL.md
Create diverse synthetic test inputs for LLM pipeline evaluation using dimension-based tuple generation. Use when bootstrapping an eval dataset, when real user data is sparse, or when stress-testing specific failure hypotheses. Do NOT use when you already have 100+ representative real traces (use stratified sampling instead), or when the task is collecting production logs.
npx skillsauth add hamelsmu/evals-skills generate-synthetic-dataInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Generate diverse, realistic test inputs that cover the failure space of an LLM pipeline.
Before generating synthetic data, identify where the pipeline is likely to fail. Ask the user about known failure-prone areas, review existing user feedback, or form hypotheses from available traces. Dimensions (Step 1) must target anticipated failures, not arbitrary variation.
Dimensions are axes of variation specific to your application. Choose dimensions based on where you expect failures.
Dimension 1: [Name] — [What it captures]
Values: [value_a, value_b, value_c, ...]
Dimension 2: [Name] — [What it captures]
Values: [value_a, value_b, value_c, ...]
Dimension 3: [Name] — [What it captures]
Values: [value_a, value_b, value_c, ...]
Example for a real estate assistant:
Feature: what task the user wants
Values: [property search, scheduling, email drafting]
Client Persona: who the user serves
Values: [first-time buyer, investor, luxury buyer]
Scenario Type: query clarity
Values: [well-specified, ambiguous, out-of-scope]
Start with 3 dimensions. Add more only if initial traces reveal failure patterns along new axes.
A tuple is one combination of dimension values defining a specific test case. Present 20 draft tuples to the user and iterate until they confirm the tuples reflect realistic scenarios. The user's domain knowledge is essential here — they know which combinations actually occur and which are unrealistic.
(Feature: Property Search, Persona: Investor, Scenario: Ambiguous)
(Feature: Scheduling, Persona: First-time Buyer, Scenario: Well-specified)
(Feature: Email Drafting, Persona: Luxury Buyer, Scenario: Out-of-scope)
Generate 10 random combinations of ({dim1}, {dim2}, {dim3})
for a {your application description}.
The dimensions are:
{dim1}: {description}. Possible values: {values}
{dim2}: {description}. Possible values: {values}
{dim3}: {description}. Possible values: {values}
Output each tuple in the format: ({dim1}, {dim2}, {dim3})
Avoid duplicates. Vary values across dimensions.
Use a separate prompt for this step. Single-step generation (tuples + queries together) produces repetitive phrasing.
We are generating synthetic user queries for a {your application}.
{Brief description of what it does.}
Given:
{dim1}: {value}
{dim2}: {value}
{dim3}: {value}
Write a realistic query that a user might enter. The query should
reflect the specified persona and scenario characteristics.
Example: "{one of your hand-written examples}"
Now generate a new query.
Review generated queries. Discard and regenerate when:
Optional: use an LLM to rate realism on a 1-5 scale, discard below 3.
Execute all queries through the full LLM pipeline. Capture complete traces: input, all intermediate steps, tool calls, retrieved docs, final output.
Target: ~100 high-quality, diverse traces. This is a rough heuristic for reaching saturation (where new traces stop revealing new failure categories). The number depends on system complexity.
When you have real queries available, don't sample randomly. Use stratified sampling:
When both real and synthetic data are available, use synthetic data to fill gaps in underrepresented query types.
development
Design LLM-as-Judge evaluators for subjective criteria that code-based checks cannot handle. Use when a failure mode requires interpretation (tone, faithfulness, relevance, completeness). Do NOT use when the failure mode can be checked with code (regex, schema validation, execution tests). Do NOT use when you need to validate or calibrate the judge — use validate-evaluator instead.
development
Calibrate an LLM judge against human labels using data splits, TPR/TNR, and bias correction. Use after writing a judge prompt (write-judge-prompt) when you need to verify alignment before trusting its outputs. Do NOT use for code-based evaluators (those are deterministic; test with standard unit tests).
testing
Guides evaluation of RAG pipeline retrieval and generation quality. Use when evaluating a retrieval-augmented generation system, measuring retrieval quality, assessing generation faithfulness or relevance, generating synthetic QA pairs for retrieval testing, or optimizing chunking strategies.
development
Audit an LLM eval pipeline and surface problems: missing error analysis, unvalidated judges, vanity metrics, etc. Use when inheriting an eval system, when unsure whether evals are trustworthy, or as a starting point when no eval infrastructure exists. Do NOT use when the goal is to build a new evaluator from scratch (use error-analysis, write-judge-prompt, or validate-evaluator instead).