skills/evals/eval-dataset-design/SKILL.md
Design eval datasets that actually measure model quality — coverage, difficulty distribution, labeling consistency, and avoiding contamination. Covers sourcing, stratification, label quality, and when to generate vs curate. Use this skill when building a new eval set, realizing your current evals don't catch regressions, or labeling is inconsistent. Activate when: eval dataset, benchmark, test set, eval coverage, label quality, synthetic eval, dataset design.
npx skillsauth add latestaiagents/agent-skills eval-dataset-designInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Your evals are only as good as the dataset they run on. Miss a user scenario and you'll never catch regressions on it.
Best to worst:
A good eval set mixes all four. Typical split: 60% real, 20% synthetic, 15% adversarial, 5% benchmark.
Split your dataset by categories that matter:
dataset:
categories:
simple_qa: 100 samples # easy, high-frequency
multi_step_reasoning: 50 # medium
ambiguous_queries: 30 # hard
edge_cases: 20 # adversarial
rare_domains: 20 # coverage of long tail
Report metrics per stratum, not just the aggregate. A model can improve on average while regressing on edge cases — you'll only see it stratified.
Two people label the same 50 items independently. Compute inter-annotator agreement:
from sklearn.metrics import cohen_kappa_score
kappa = cohen_kappa_score(labeler_a, labeler_b)
Target:
Resolve disagreements with a tiebreaker, then update the rubric based on what caused disagreement.
Write explicit guidelines with examples:
### Label: helpful
**Definition**: Response addresses the user's question directly and accurately.
**Examples**:
- Query: "How do I loop in Python?" / Response: Shows `for` loop → YES
- Query: "How do I loop in Python?" / Response: General loop theory → NO (dodges the specific language)
- Query: "Fix this bug" / Response: Points out the bug + fix → YES
- Query: "Fix this bug" / Response: "I'll need more info" (bug is in the code) → NO
If two labelers disagree, add their disputed case as a rubric example.
If your eval is in the model's training data, scores are inflated. Check:
For production evals, rotate the dataset yearly and keep a private held-out set.
Track difficulty via model pass rate:
< 30% pass: too hard; models improve but you can't measure it30-80% pass: useful range> 95% pass: too easy; dataset has plateauedPrune items that reach 100% for several consecutive model generations — they no longer discriminate.
When you need more coverage:
const prompt = `Generate 20 diverse user queries that a customer support bot might receive.
Cover: billing (5), technical issues (5), account access (5), general FAQ (5).
Vary wording: formal, casual, angry, confused.
Return JSON array.`;
const response = await client.messages.create({
model: "claude-opus-4-6",
max_tokens: 4000,
messages: [{ role: "user", content: prompt }],
});
Then:
Quality > quantity. 200 well-labeled items beat 5000 noisy ones.
Treat datasets like code:
evals/
customer_support/
v1/
dataset.jsonl
rubric.md
CHANGELOG.md
v2/
dataset.jsonl
rubric.md
CHANGELOG.md
Never silently edit. Version bumps communicate "scores before v2 are not comparable to scores after".
Keep 100-200 items never published, never used for prompt iteration. Only for:
Rotate a fraction yearly.
development
Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.
documentation
Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.
development
Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.
development
Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.