Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

latestaiagents/regression-evals

Name: regression-evals
Author: latestaiagents

skills/evals/regression-evals/SKILL.md

npx skillsauth add latestaiagents/agent-skills regression-evals

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Regression Evals

Every prompt change, model upgrade, or tool tweak is a potential regression. Regression evals catch breakage before users do — if you gate deploys on them.

When to Use

Any AI feature in production
Before upgrading model versions
Before merging prompt changes
Weekly as a drift/decay check

The Core Loop

1. Maintain a versioned eval dataset (see eval-dataset-design)
2. On every proposed change, run evals on baseline AND candidate
3. Compare per-stratum metrics with significance tests
4. Gate merge on: no stratum regresses beyond threshold

Minimal Harness

interface EvalCase { id: string; input: any; expected: any; stratum: string }

async function runEvals(model: string, prompt: string, cases: EvalCase[]) {
  const results = [];
  for (const c of cases) {
    const output = await runModel(model, prompt, c.input);
    const score = await scoreOutput(c.expected, output); // 0-1
    results.push({ id: c.id, stratum: c.stratum, score });
  }
  return results;
}

const baseline = await runEvals("claude-sonnet-4-6", oldPrompt, cases);
const candidate = await runEvals("claude-sonnet-4-6", newPrompt, cases);

const regressed = compareByStratum(baseline, candidate, { alpha: 0.05 });
if (regressed.length) throw new Error(`Regressions in: ${regressed.join(", ")}`);

Significance Testing

A 2% drop on 100 items is noise, not signal. Use bootstrap confidence intervals:

import numpy as np
def bootstrap_ci(scores, n=1000, alpha=0.05):
    means = [np.mean(np.random.choice(scores, size=len(scores), replace=True)) for _ in range(n)]
    return np.percentile(means, [100*alpha/2, 100*(1-alpha/2)])

baseline_ci = bootstrap_ci(baseline_scores)
candidate_ci = bootstrap_ci(candidate_scores)

# If CIs don't overlap AND candidate lower, it's a real regression.

For pass/fail metrics, use McNemar's test on the paired outcomes. For scalars, paired bootstrap.

Thresholds

Hard rules:

| Metric | Regression threshold | Action | |---|---|---| | Any stratum | ≥ 3% drop with p < 0.05 | Block merge | | Aggregate | ≥ 5% drop with p < 0.05 | Block merge | | Single catastrophic item | New case drops from pass→fail | Investigate, likely block | | Variance | CI widens significantly | Investigate (noisier outputs) |

Soft rules:

1-3% drop: warning, review required
Improvement on one stratum, regression on another: flag for human judgment

CI Integration

# .github/workflows/eval.yml
name: regression-evals
on: pull_request
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npm run evals:baseline  # fetch cached baseline scores
      - run: npm run evals:candidate
      - run: npm run evals:compare
      - uses: actions/upload-artifact@v4
        with:
          name: eval-report
          path: eval-report.html

Publish the report as a PR comment. Reviewers should see which strata changed and by how much.

Gating Model Upgrades

When Anthropic ships a new model, don't just swap in production. Gate:

Run full eval on new model vs current model
Bucket regressions: acceptable, acceptable-with-prompt-tweak, hard-regression
For hard regressions, either stay on current model or fix before swapping
Canary new model to 5-10% of traffic; monitor live metrics for a week before full rollout

Baseline Management

Your "baseline" needs to be a real, stored artifact — not "whatever main is today".

Store baselines in an artifact store keyed on (prompt version, model, dataset version):

evals/baselines/
  prompt-v42_claude-sonnet-4-6_dataset-v3.json

When you merge a change, UPDATE the baseline. New baseline is the reference for the next candidate.

Handling Flaky Evals

Some items are nondeterministic even at temperature 0 (tools, time-dependent data). Options:

Mark items as flaky; exclude from regression calc; track separately
Run flaky items N times; use majority vote
Fix the underlying issue — usually a bad eval item, not a bad model

Don't let flakes erode your alert fidelity.

Alert Fatigue

If every PR triggers an alert, reviewers stop reading. Tune:

Raise the p-value threshold for borderline cases
Classify small drifts as "review required" (not blocking)
Group related strata to reduce notification count

Fewer, higher-quality alerts get acted on.

Handling a Real Regression

When CI blocks the PR:

Read the report — which stratum dropped?
Look at specific failed items — reproduce locally
Is it the prompt, the tool, the model? Bisect if needed
Fix and re-run. If you can't fix, decide: accept the regression (with sign-off) or revert

Never:

Lower the threshold to make it pass
Remove the failing stratum
Skip eval on "urgent" PRs — those are exactly when you need it most

Anti-Patterns

Running evals once, shipping forever — drift happens
No per-stratum breakdown — aggregate hides localized regressions
No significance testing — you chase noise
Manual compare — human eyeballs miss small systematic drops
Baseline updated silently — you lose the ability to detect drift
Lowering thresholds to make CI green — defeats the purpose

Best Practices

Run regression evals on every PR, gate merges
Report per-stratum metrics with confidence intervals
Store versioned baselines in an artifact store
Use paired bootstrap or McNemar's for significance
Canary model upgrades; don't swap in production cold
Publish eval reports as PR comments for visibility
Maintain flaky-item exclusion lists explicitly, don't hide them

latestaiagents/regression-evals

skills/evals/regression-evals/SKILL.md

Set up continuous regression evals so model/prompt/tool changes don't silently break existing behavior. Covers gating thresholds, CI integration, statistical significance, and response to regressions. Use this skill when deploying prompts to production, gating model upgrades, or noticing "it worked yesterday" in AI features. Activate when: regression eval, eval CI, prompt regression, model upgrade gate, eval threshold, eval alert.

2 stars

tools

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add latestaiagents/agent-skills regression-evals

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 2:55 AM8.1s1 file scanned

SKILL.md

name:: regression-evals
description:: |
Activate when:: regression eval, eval CI, prompt regression, model upgrade gate, eval threshold, eval alert.

Regression Evals

Every prompt change, model upgrade, or tool tweak is a potential regression. Regression evals catch breakage before users do — if you gate deploys on them.

When to Use

Any AI feature in production
Before upgrading model versions
Before merging prompt changes
Weekly as a drift/decay check

The Core Loop

1. Maintain a versioned eval dataset (see eval-dataset-design)
2. On every proposed change, run evals on baseline AND candidate
3. Compare per-stratum metrics with significance tests
4. Gate merge on: no stratum regresses beyond threshold

Minimal Harness

interface EvalCase { id: string; input: any; expected: any; stratum: string }

async function runEvals(model: string, prompt: string, cases: EvalCase[]) {
  const results = [];
  for (const c of cases) {
    const output = await runModel(model, prompt, c.input);
    const score = await scoreOutput(c.expected, output); // 0-1
    results.push({ id: c.id, stratum: c.stratum, score });
  }
  return results;
}

const baseline = await runEvals("claude-sonnet-4-6", oldPrompt, cases);
const candidate = await runEvals("claude-sonnet-4-6", newPrompt, cases);

const regressed = compareByStratum(baseline, candidate, { alpha: 0.05 });
if (regressed.length) throw new Error(`Regressions in: ${regressed.join(", ")}`);

Significance Testing

A 2% drop on 100 items is noise, not signal. Use bootstrap confidence intervals:

import numpy as np
def bootstrap_ci(scores, n=1000, alpha=0.05):
    means = [np.mean(np.random.choice(scores, size=len(scores), replace=True)) for _ in range(n)]
    return np.percentile(means, [100*alpha/2, 100*(1-alpha/2)])

baseline_ci = bootstrap_ci(baseline_scores)
candidate_ci = bootstrap_ci(candidate_scores)

# If CIs don't overlap AND candidate lower, it's a real regression.

For pass/fail metrics, use McNemar's test on the paired outcomes. For scalars, paired bootstrap.

Thresholds

Hard rules:

Soft rules:

1-3% drop: warning, review required
Improvement on one stratum, regression on another: flag for human judgment

CI Integration

# .github/workflows/eval.yml
name: regression-evals
on: pull_request
jobs:
  eval:
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - uses: actions/setup-node@v4
      - run: npm ci
      - run: npm run evals:baseline  # fetch cached baseline scores
      - run: npm run evals:candidate
      - run: npm run evals:compare
      - uses: actions/upload-artifact@v4
        with:
          name: eval-report
          path: eval-report.html

Publish the report as a PR comment. Reviewers should see which strata changed and by how much.

Gating Model Upgrades

When Anthropic ships a new model, don't just swap in production. Gate:

Run full eval on new model vs current model
Bucket regressions: acceptable, acceptable-with-prompt-tweak, hard-regression
For hard regressions, either stay on current model or fix before swapping
Canary new model to 5-10% of traffic; monitor live metrics for a week before full rollout

Baseline Management

Your "baseline" needs to be a real, stored artifact — not "whatever main is today".

Store baselines in an artifact store keyed on (prompt version, model, dataset version):

evals/baselines/
  prompt-v42_claude-sonnet-4-6_dataset-v3.json

When you merge a change, UPDATE the baseline. New baseline is the reference for the next candidate.

Handling Flaky Evals

Some items are nondeterministic even at temperature 0 (tools, time-dependent data). Options:

Mark items as flaky; exclude from regression calc; track separately
Run flaky items N times; use majority vote
Fix the underlying issue — usually a bad eval item, not a bad model

Don't let flakes erode your alert fidelity.

Alert Fatigue

If every PR triggers an alert, reviewers stop reading. Tune:

Raise the p-value threshold for borderline cases
Classify small drifts as "review required" (not blocking)
Group related strata to reduce notification count

Fewer, higher-quality alerts get acted on.

Handling a Real Regression

When CI blocks the PR:

Read the report — which stratum dropped?
Look at specific failed items — reproduce locally
Is it the prompt, the tool, the model? Bisect if needed
Fix and re-run. If you can't fix, decide: accept the regression (with sign-off) or revert

Never:

Lower the threshold to make it pass
Remove the failing stratum
Skip eval on "urgent" PRs — those are exactly when you need it most

Anti-Patterns

Running evals once, shipping forever — drift happens
No per-stratum breakdown — aggregate hides localized regressions
No significance testing — you chase noise
Manual compare — human eyeballs miss small systematic drops
Baseline updated silently — you lose the ability to detect drift
Lowering thresholds to make CI green — defeats the purpose

Best Practices

Run regression evals on every PR, gate merges
Report per-stratum metrics with confidence intervals
Store versioned baselines in an artifact store
Use paired bootstrap or McNemar's for significance
Canary model upgrades; don't swap in production cold
Publish eval reports as PR comments for visibility
Maintain flaky-item exclusion lists explicitly, don't hide them

Related Skills

latestaiagents/skill-testing

development

VerifiedTrustedCommunity

Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-testing

latestaiagents/skill-frontmatter

documentation

VerifiedTrustedCommunity

Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-frontmatter

latestaiagents/skill-activation-patterns

development

VerifiedTrustedCommunity

Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-activation-patterns

latestaiagents/progressive-disclosure

development

VerifiedTrustedCommunity

Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/progressive-disclosure

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/latestaiagents/agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r agent-skills/skills/evals/regression-evals ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

latestaiagents/agent-skills

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT