skills/evals/regression-evals/SKILL.md
Set up continuous regression evals so model/prompt/tool changes don't silently break existing behavior. Covers gating thresholds, CI integration, statistical significance, and response to regressions. Use this skill when deploying prompts to production, gating model upgrades, or noticing "it worked yesterday" in AI features. Activate when: regression eval, eval CI, prompt regression, model upgrade gate, eval threshold, eval alert.
npx skillsauth add latestaiagents/agent-skills regression-evalsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Every prompt change, model upgrade, or tool tweak is a potential regression. Regression evals catch breakage before users do — if you gate deploys on them.
1. Maintain a versioned eval dataset (see eval-dataset-design)
2. On every proposed change, run evals on baseline AND candidate
3. Compare per-stratum metrics with significance tests
4. Gate merge on: no stratum regresses beyond threshold
interface EvalCase { id: string; input: any; expected: any; stratum: string }
async function runEvals(model: string, prompt: string, cases: EvalCase[]) {
const results = [];
for (const c of cases) {
const output = await runModel(model, prompt, c.input);
const score = await scoreOutput(c.expected, output); // 0-1
results.push({ id: c.id, stratum: c.stratum, score });
}
return results;
}
const baseline = await runEvals("claude-sonnet-4-6", oldPrompt, cases);
const candidate = await runEvals("claude-sonnet-4-6", newPrompt, cases);
const regressed = compareByStratum(baseline, candidate, { alpha: 0.05 });
if (regressed.length) throw new Error(`Regressions in: ${regressed.join(", ")}`);
A 2% drop on 100 items is noise, not signal. Use bootstrap confidence intervals:
import numpy as np
def bootstrap_ci(scores, n=1000, alpha=0.05):
means = [np.mean(np.random.choice(scores, size=len(scores), replace=True)) for _ in range(n)]
return np.percentile(means, [100*alpha/2, 100*(1-alpha/2)])
baseline_ci = bootstrap_ci(baseline_scores)
candidate_ci = bootstrap_ci(candidate_scores)
# If CIs don't overlap AND candidate lower, it's a real regression.
For pass/fail metrics, use McNemar's test on the paired outcomes. For scalars, paired bootstrap.
Hard rules:
| Metric | Regression threshold | Action | |---|---|---| | Any stratum | ≥ 3% drop with p < 0.05 | Block merge | | Aggregate | ≥ 5% drop with p < 0.05 | Block merge | | Single catastrophic item | New case drops from pass→fail | Investigate, likely block | | Variance | CI widens significantly | Investigate (noisier outputs) |
Soft rules:
# .github/workflows/eval.yml
name: regression-evals
on: pull_request
jobs:
eval:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- uses: actions/setup-node@v4
- run: npm ci
- run: npm run evals:baseline # fetch cached baseline scores
- run: npm run evals:candidate
- run: npm run evals:compare
- uses: actions/upload-artifact@v4
with:
name: eval-report
path: eval-report.html
Publish the report as a PR comment. Reviewers should see which strata changed and by how much.
When Anthropic ships a new model, don't just swap in production. Gate:
Your "baseline" needs to be a real, stored artifact — not "whatever main is today".
Store baselines in an artifact store keyed on (prompt version, model, dataset version):
evals/baselines/
prompt-v42_claude-sonnet-4-6_dataset-v3.json
When you merge a change, UPDATE the baseline. New baseline is the reference for the next candidate.
Some items are nondeterministic even at temperature 0 (tools, time-dependent data). Options:
Don't let flakes erode your alert fidelity.
If every PR triggers an alert, reviewers stop reading. Tune:
Fewer, higher-quality alerts get acted on.
When CI blocks the PR:
Never:
development
Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.
documentation
Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.
development
Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.
development
Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.