Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

latestaiagents/eval-dataset-design

Name: eval-dataset-design
Author: latestaiagents

skills/evals/eval-dataset-design/SKILL.md

npx skillsauth add latestaiagents/agent-skills eval-dataset-design

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Eval Dataset Design

Your evals are only as good as the dataset they run on. Miss a user scenario and you'll never catch regressions on it.

When to Use

Starting an eval program from zero
Your evals pass but users still hit issues → coverage gap
Labels are inconsistent across reviewers → quality problem
Adding evals for a new feature or domain

Dataset Properties Worth Optimizing

Coverage — representative of real user queries
Difficulty distribution — mix of easy/medium/hard, not all easy
Label consistency — two humans agree on the label
Stability — same inputs → same evaluable outputs over time
Uncontaminated — not in the model's training data

Sourcing Inputs

Best to worst:

Real user queries (anonymized) — highest signal
Synthetic queries generated from real templates — fills gaps
Adversarial queries hand-crafted for known failure modes
Existing benchmarks — context, but often contaminated and dated

A good eval set mixes all four. Typical split: 60% real, 20% synthetic, 15% adversarial, 5% benchmark.

Stratification

Split your dataset by categories that matter:

dataset:
  categories:
    simple_qa: 100 samples        # easy, high-frequency
    multi_step_reasoning: 50       # medium
    ambiguous_queries: 30          # hard
    edge_cases: 20                 # adversarial
    rare_domains: 20               # coverage of long tail

Report metrics per stratum, not just the aggregate. A model can improve on average while regressing on edge cases — you'll only see it stratified.

Labeling Quality

Two people label the same 50 items independently. Compute inter-annotator agreement:

from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(labeler_a, labeler_b)

Target:

κ > 0.8: excellent, labels are reliable
κ 0.6-0.8: good, some ambiguity
κ < 0.6: rewrite your labeling rubric — humans can't agree, so neither can models

Resolve disagreements with a tiebreaker, then update the rubric based on what caused disagreement.

Labeling Rubric

Write explicit guidelines with examples:

### Label: helpful

**Definition**: Response addresses the user's question directly and accurately.

**Examples**:
- Query: "How do I loop in Python?" / Response: Shows `for` loop → YES
- Query: "How do I loop in Python?" / Response: General loop theory → NO (dodges the specific language)
- Query: "Fix this bug" / Response: Points out the bug + fix → YES
- Query: "Fix this bug" / Response: "I'll need more info" (bug is in the code) → NO

If two labelers disagree, add their disputed case as a rubric example.

Contamination

If your eval is in the model's training data, scores are inflated. Check:

Hash the query and search public datasets / GitHub / web — common sources
Ask the model to complete the eval query's preamble — if it auto-completes with the expected answer, it's memorized
Regenerate with paraphrasing — rewrite queries so training data near-matches become mismatches

For production evals, rotate the dataset yearly and keep a private held-out set.

Difficulty Calibration

Track difficulty via model pass rate:

< 30% pass: too hard; models improve but you can't measure it
30-80% pass: useful range
> 95% pass: too easy; dataset has plateaued

Prune items that reach 100% for several consecutive model generations — they no longer discriminate.

Synthetic Generation

When you need more coverage:

const prompt = `Generate 20 diverse user queries that a customer support bot might receive.
Cover: billing (5), technical issues (5), account access (5), general FAQ (5).
Vary wording: formal, casual, angry, confused.
Return JSON array.`;

const response = await client.messages.create({
  model: "claude-opus-4-6",
  max_tokens: 4000,
  messages: [{ role: "user", content: prompt }],
});

Then:

Human-review every synthetic query for realism
Label them the same way as real queries
Track whether synthetic vs real have different score profiles — a red flag if they diverge

Size

Smoke test: 20-50 items, run in CI
Regression set: 200-500 items, run weekly
Full eval: 1000-5000 items, run per major release
Beyond that: sampling with stratification, not more volume

Quality > quantity. 200 well-labeled items beat 5000 noisy ones.

Versioning

Treat datasets like code:

evals/
  customer_support/
    v1/
      dataset.jsonl
      rubric.md
      CHANGELOG.md
    v2/
      dataset.jsonl
      rubric.md
      CHANGELOG.md

Never silently edit. Version bumps communicate "scores before v2 are not comparable to scores after".

Private Held-Out Set

Keep 100-200 items never published, never used for prompt iteration. Only for:

Measuring generalization on unseen examples
Catching overfitting to your public eval

Rotate a fraction yearly.

Anti-Patterns

Evals that only cover easy cases — model passes your eval, fails in prod
Single labeler — no way to know if labels are noisy
No versioning — silent edits invalidate historical trends
Overfit to eval — you tuned prompts against the eval set; now it doesn't generalize
All real or all synthetic — synthetic misses distribution, real misses edge cases

Best Practices

Mix real + synthetic + adversarial; stratify and report per category
Label with a written rubric; measure inter-annotator kappa
Check for training-data contamination; paraphrase suspicious queries
Track item-level difficulty; prune items that hit 100% pass
Version datasets like code; publish CHANGELOGs
Keep a private held-out set for generalization checks
Quality > quantity; 200 good labels beat 5000 rushed ones

latestaiagents/eval-dataset-design

skills/evals/eval-dataset-design/SKILL.md

Design eval datasets that actually measure model quality — coverage, difficulty distribution, labeling consistency, and avoiding contamination. Covers sourcing, stratification, label quality, and when to generate vs curate. Use this skill when building a new eval set, realizing your current evals don't catch regressions, or labeling is inconsistent. Activate when: eval dataset, benchmark, test set, eval coverage, label quality, synthetic eval, dataset design.

2 stars

development

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add latestaiagents/agent-skills eval-dataset-design

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 2:55 AM7.3s1 file scanned

SKILL.md

name:: eval-dataset-design
description:: |
Activate when:: eval dataset, benchmark, test set, eval coverage, label quality, synthetic eval, dataset design.

Eval Dataset Design

Your evals are only as good as the dataset they run on. Miss a user scenario and you'll never catch regressions on it.

When to Use

Starting an eval program from zero
Your evals pass but users still hit issues → coverage gap
Labels are inconsistent across reviewers → quality problem
Adding evals for a new feature or domain

Dataset Properties Worth Optimizing

Coverage — representative of real user queries
Difficulty distribution — mix of easy/medium/hard, not all easy
Label consistency — two humans agree on the label
Stability — same inputs → same evaluable outputs over time
Uncontaminated — not in the model's training data

Sourcing Inputs

Best to worst:

Real user queries (anonymized) — highest signal
Synthetic queries generated from real templates — fills gaps
Adversarial queries hand-crafted for known failure modes
Existing benchmarks — context, but often contaminated and dated

A good eval set mixes all four. Typical split: 60% real, 20% synthetic, 15% adversarial, 5% benchmark.

Stratification

Split your dataset by categories that matter:

dataset:
  categories:
    simple_qa: 100 samples        # easy, high-frequency
    multi_step_reasoning: 50       # medium
    ambiguous_queries: 30          # hard
    edge_cases: 20                 # adversarial
    rare_domains: 20               # coverage of long tail

Report metrics per stratum, not just the aggregate. A model can improve on average while regressing on edge cases — you'll only see it stratified.

Labeling Quality

Two people label the same 50 items independently. Compute inter-annotator agreement:

from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(labeler_a, labeler_b)

Target:

κ > 0.8: excellent, labels are reliable
κ 0.6-0.8: good, some ambiguity
κ < 0.6: rewrite your labeling rubric — humans can't agree, so neither can models

Resolve disagreements with a tiebreaker, then update the rubric based on what caused disagreement.

Labeling Rubric

Write explicit guidelines with examples:

### Label: helpful

**Definition**: Response addresses the user's question directly and accurately.

**Examples**:
- Query: "How do I loop in Python?" / Response: Shows `for` loop → YES
- Query: "How do I loop in Python?" / Response: General loop theory → NO (dodges the specific language)
- Query: "Fix this bug" / Response: Points out the bug + fix → YES
- Query: "Fix this bug" / Response: "I'll need more info" (bug is in the code) → NO

If two labelers disagree, add their disputed case as a rubric example.

Contamination

If your eval is in the model's training data, scores are inflated. Check:

Hash the query and search public datasets / GitHub / web — common sources
Ask the model to complete the eval query's preamble — if it auto-completes with the expected answer, it's memorized
Regenerate with paraphrasing — rewrite queries so training data near-matches become mismatches

For production evals, rotate the dataset yearly and keep a private held-out set.

Difficulty Calibration

Track difficulty via model pass rate:

< 30% pass: too hard; models improve but you can't measure it
30-80% pass: useful range
> 95% pass: too easy; dataset has plateaued

Prune items that reach 100% for several consecutive model generations — they no longer discriminate.

Synthetic Generation

When you need more coverage:

const prompt = `Generate 20 diverse user queries that a customer support bot might receive.
Cover: billing (5), technical issues (5), account access (5), general FAQ (5).
Vary wording: formal, casual, angry, confused.
Return JSON array.`;

const response = await client.messages.create({
  model: "claude-opus-4-6",
  max_tokens: 4000,
  messages: [{ role: "user", content: prompt }],
});

Then:

Human-review every synthetic query for realism
Label them the same way as real queries
Track whether synthetic vs real have different score profiles — a red flag if they diverge

Size

Smoke test: 20-50 items, run in CI
Regression set: 200-500 items, run weekly
Full eval: 1000-5000 items, run per major release
Beyond that: sampling with stratification, not more volume

Quality > quantity. 200 well-labeled items beat 5000 noisy ones.

Versioning

Treat datasets like code:

evals/
  customer_support/
    v1/
      dataset.jsonl
      rubric.md
      CHANGELOG.md
    v2/
      dataset.jsonl
      rubric.md
      CHANGELOG.md

Never silently edit. Version bumps communicate "scores before v2 are not comparable to scores after".

Private Held-Out Set

Keep 100-200 items never published, never used for prompt iteration. Only for:

Measuring generalization on unseen examples
Catching overfitting to your public eval

Rotate a fraction yearly.

Anti-Patterns

Evals that only cover easy cases — model passes your eval, fails in prod
Single labeler — no way to know if labels are noisy
No versioning — silent edits invalidate historical trends
Overfit to eval — you tuned prompts against the eval set; now it doesn't generalize
All real or all synthetic — synthetic misses distribution, real misses edge cases

Best Practices

Mix real + synthetic + adversarial; stratify and report per category
Label with a written rubric; measure inter-annotator kappa
Check for training-data contamination; paraphrase suspicious queries
Track item-level difficulty; prune items that hit 100% pass
Version datasets like code; publish CHANGELOGs
Keep a private held-out set for generalization checks
Quality > quantity; 200 good labels beat 5000 rushed ones

Related Skills

latestaiagents/skill-testing

development

VerifiedTrustedCommunity

Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-testing

latestaiagents/skill-frontmatter

documentation

VerifiedTrustedCommunity

Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-frontmatter

latestaiagents/skill-activation-patterns

development

VerifiedTrustedCommunity

Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-activation-patterns

latestaiagents/progressive-disclosure

development

VerifiedTrustedCommunity

Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/progressive-disclosure

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/latestaiagents/agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r agent-skills/skills/evals/eval-dataset-design ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

latestaiagents/agent-skills

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT