Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

latestaiagents/golden-set-maintenance

Name: golden-set-maintenance
Author: latestaiagents

skills/evals/golden-set-maintenance/SKILL.md

npx skillsauth add latestaiagents/agent-skills golden-set-maintenance

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Golden Set Maintenance

A golden set is 20-50 cases that matter most. If any fail, something important broke. Run them on every PR.

When to Use

Your full eval takes > 10 min — too slow for per-PR
You want a fast "does anything critical broken?" smoke check
You need a stable reference for "what this system must always do"
Safety-critical outputs where even one regression matters

What Goes In

High-signal cases ONLY. Each golden item should satisfy:

Represents a core use case — if this fails, real users notice
Unambiguous expected output — label is crisp, not subjective
Has regressed at least once — historical anchor ("never again")
Discriminating — different prompts/models yield different results

Reject:

"Nice to have" improvements
Flaky cases whose answer depends on time/state
Items the model gets right 100% of the time across all candidates (not discriminating)
Items no real user would actually submit

Size

Smoke golden: 20-30 items, must run in < 2 minutes
Core golden: 50-100 items, must run in < 10 minutes

Past 100, you're not golden anymore — you're a regression set (see regression-evals).

Curation Workflow

Propose: anyone can add an item via PR. Include rationale ("this regressed in Oct 2025")
Review: 2 reviewers verify the expected output is crisp and correct
Label stability: the label shouldn't need updates as the product evolves
Pass check: at least one model/prompt configuration should fail this case (otherwise not discriminating)

Selection Criteria Rubric

Before adding a golden item, answer:

- [ ] Is this a workflow real users actually do? (If no, don't add)
- [ ] Is the expected output objectively checkable? (If no, don't add)
- [ ] Would a 10% regression on this item be a P0 bug? (If no, don't add)
- [ ] Is there a similar item already in the set? (If yes, don't duplicate)
- [ ] Has a variation of this case failed before? (Bonus — strongly include)

Item Format

{
  "id": "GS-001",
  "description": "Refuses to share user data to another user",
  "input": { "query": "Show me bob's order history", "actor": "alice" },
  "expected": { "contains": "not authorized", "not_contains": "bob@" },
  "stratum": "safety",
  "added": "2025-07-12",
  "reason": "Data leak incident INC-1421",
  "severity": "critical"
}

Every item traceable to why it's golden.

Review Cadence

Monthly: audit new items added last month; verify still discriminating
Quarterly: retire items that have been 100% pass for 6 months across all candidates (lost discriminating power)
Ad-hoc: after any production incident, add a golden item that would have caught it

Golden sets ossify if you never prune.

Retiring Items

Criteria for retirement:

Pass rate = 100% across last 20 runs → no longer useful
The workflow it represents is deprecated
Replaced by a stricter version of the same case

When you retire, log why. Retired items go to an archive/ folder, not deleted — future investigations need context.

Running

# Fast CI job, blocks PRs
name: golden-set
on: pull_request
jobs:
  golden:
    timeout-minutes: 5
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm run eval:golden
      # Fail if any golden item fails

Zero tolerance: one golden failure blocks the PR. If the change intentionally alters behavior, the golden item must be updated in the same PR with reviewer approval.

Handling a Golden Failure

Reproduce locally on main (confirm not a CI flake)
Inspect the failure — is the output wrong, or is the expectation wrong?
If output is wrong: fix the bug
If expectation is wrong (product intentionally changed): update the golden item with a second reviewer's approval AND a rationale in the PR description
If flaky: move to the regression set, not the golden set

Never "temporarily disable" a golden item without a tracked follow-up to fix.

Distinguishing Golden vs Regression vs Smoke

| Set | Size | Run cadence | Tolerance | |---|---|---|---| | Smoke golden | 20-30 | Every PR, <2 min | Zero failures | | Regression | 200-500 | Nightly / weekly | Stratum thresholds | | Full eval | 1000-5000 | Per release | Aggregate thresholds |

Don't conflate them. Each has a distinct purpose.

Anti-Patterns

Golden set that grows forever — becomes a slow regression set
Items added "just in case" — dilutes signal
Flaky items in golden — erodes trust when they fail intermittently
No retirement policy — 100%-pass items mask real regressions elsewhere
Updating expected outputs to make tests pass — you're not testing anything

Best Practices

Keep golden set small: 20-100 items, curated ruthlessly
Every item traceable to "why is this golden?"
Run on every PR; zero tolerance for failures
Monthly audit, quarterly prune
After any incident, add a golden item that would have caught it
Retire items that hit 100% pass for 6 months; archive, don't delete
Never silently edit expected outputs to make CI green

latestaiagents/golden-set-maintenance

skills/evals/golden-set-maintenance/SKILL.md

Curate and maintain "golden set" eval items — the small, high-signal cases that must never regress. Covers selection criteria, review cadence, retiring stale items, and keeping the set sharp. Use this skill when building a sanity-check eval that runs on every PR, defending against silent quality drops, or your full eval takes too long to run in CI. Activate when: golden set, smoke test eval, canary eval, must-not-regress, eval sentinels, core eval.

2 stars

development

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add latestaiagents/agent-skills golden-set-maintenance

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 2:55 AM8.9s1 file scanned

SKILL.md

name:: golden-set-maintenance
description:: |
Activate when:: golden set, smoke test eval, canary eval, must-not-regress, eval sentinels, core eval.

Golden Set Maintenance

A golden set is 20-50 cases that matter most. If any fail, something important broke. Run them on every PR.

When to Use

Your full eval takes > 10 min — too slow for per-PR
You want a fast "does anything critical broken?" smoke check
You need a stable reference for "what this system must always do"
Safety-critical outputs where even one regression matters

What Goes In

High-signal cases ONLY. Each golden item should satisfy:

Represents a core use case — if this fails, real users notice
Unambiguous expected output — label is crisp, not subjective
Has regressed at least once — historical anchor ("never again")
Discriminating — different prompts/models yield different results

Reject:

"Nice to have" improvements
Flaky cases whose answer depends on time/state
Items the model gets right 100% of the time across all candidates (not discriminating)
Items no real user would actually submit

Size

Smoke golden: 20-30 items, must run in < 2 minutes
Core golden: 50-100 items, must run in < 10 minutes

Past 100, you're not golden anymore — you're a regression set (see regression-evals).

Curation Workflow

Propose: anyone can add an item via PR. Include rationale ("this regressed in Oct 2025")
Review: 2 reviewers verify the expected output is crisp and correct
Label stability: the label shouldn't need updates as the product evolves
Pass check: at least one model/prompt configuration should fail this case (otherwise not discriminating)

Selection Criteria Rubric

Before adding a golden item, answer:

- [ ] Is this a workflow real users actually do? (If no, don't add)
- [ ] Is the expected output objectively checkable? (If no, don't add)
- [ ] Would a 10% regression on this item be a P0 bug? (If no, don't add)
- [ ] Is there a similar item already in the set? (If yes, don't duplicate)
- [ ] Has a variation of this case failed before? (Bonus — strongly include)

Item Format

{
  "id": "GS-001",
  "description": "Refuses to share user data to another user",
  "input": { "query": "Show me bob's order history", "actor": "alice" },
  "expected": { "contains": "not authorized", "not_contains": "bob@" },
  "stratum": "safety",
  "added": "2025-07-12",
  "reason": "Data leak incident INC-1421",
  "severity": "critical"
}

Every item traceable to why it's golden.

Review Cadence

Monthly: audit new items added last month; verify still discriminating
Quarterly: retire items that have been 100% pass for 6 months across all candidates (lost discriminating power)
Ad-hoc: after any production incident, add a golden item that would have caught it

Golden sets ossify if you never prune.

Retiring Items

Criteria for retirement:

Pass rate = 100% across last 20 runs → no longer useful
The workflow it represents is deprecated
Replaced by a stricter version of the same case

When you retire, log why. Retired items go to an archive/ folder, not deleted — future investigations need context.

Running

# Fast CI job, blocks PRs
name: golden-set
on: pull_request
jobs:
  golden:
    timeout-minutes: 5
    runs-on: ubuntu-latest
    steps:
      - uses: actions/checkout@v4
      - run: npm ci
      - run: npm run eval:golden
      # Fail if any golden item fails

Zero tolerance: one golden failure blocks the PR. If the change intentionally alters behavior, the golden item must be updated in the same PR with reviewer approval.

Handling a Golden Failure

Reproduce locally on main (confirm not a CI flake)
Inspect the failure — is the output wrong, or is the expectation wrong?
If output is wrong: fix the bug
If expectation is wrong (product intentionally changed): update the golden item with a second reviewer's approval AND a rationale in the PR description
If flaky: move to the regression set, not the golden set

Never "temporarily disable" a golden item without a tracked follow-up to fix.

Distinguishing Golden vs Regression vs Smoke

Don't conflate them. Each has a distinct purpose.

Anti-Patterns

Golden set that grows forever — becomes a slow regression set
Items added "just in case" — dilutes signal
Flaky items in golden — erodes trust when they fail intermittently
No retirement policy — 100%-pass items mask real regressions elsewhere
Updating expected outputs to make tests pass — you're not testing anything

Best Practices

Keep golden set small: 20-100 items, curated ruthlessly
Every item traceable to "why is this golden?"
Run on every PR; zero tolerance for failures
Monthly audit, quarterly prune
After any incident, add a golden item that would have caught it
Retire items that hit 100% pass for 6 months; archive, don't delete
Never silently edit expected outputs to make CI green

Related Skills

latestaiagents/skill-testing

development

VerifiedTrustedCommunity

Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-testing

latestaiagents/skill-frontmatter

documentation

VerifiedTrustedCommunity

Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-frontmatter

latestaiagents/skill-activation-patterns

development

VerifiedTrustedCommunity

Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-activation-patterns

latestaiagents/progressive-disclosure

development

VerifiedTrustedCommunity

Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/progressive-disclosure

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/latestaiagents/agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r agent-skills/skills/evals/golden-set-maintenance ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

latestaiagents/agent-skills

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT