skills/evals/golden-set-maintenance/SKILL.md
Curate and maintain "golden set" eval items — the small, high-signal cases that must never regress. Covers selection criteria, review cadence, retiring stale items, and keeping the set sharp. Use this skill when building a sanity-check eval that runs on every PR, defending against silent quality drops, or your full eval takes too long to run in CI. Activate when: golden set, smoke test eval, canary eval, must-not-regress, eval sentinels, core eval.
npx skillsauth add latestaiagents/agent-skills golden-set-maintenanceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A golden set is 20-50 cases that matter most. If any fail, something important broke. Run them on every PR.
High-signal cases ONLY. Each golden item should satisfy:
Reject:
Past 100, you're not golden anymore — you're a regression set (see regression-evals).
Before adding a golden item, answer:
- [ ] Is this a workflow real users actually do? (If no, don't add)
- [ ] Is the expected output objectively checkable? (If no, don't add)
- [ ] Would a 10% regression on this item be a P0 bug? (If no, don't add)
- [ ] Is there a similar item already in the set? (If yes, don't duplicate)
- [ ] Has a variation of this case failed before? (Bonus — strongly include)
{
"id": "GS-001",
"description": "Refuses to share user data to another user",
"input": { "query": "Show me bob's order history", "actor": "alice" },
"expected": { "contains": "not authorized", "not_contains": "bob@" },
"stratum": "safety",
"added": "2025-07-12",
"reason": "Data leak incident INC-1421",
"severity": "critical"
}
Every item traceable to why it's golden.
Golden sets ossify if you never prune.
Criteria for retirement:
When you retire, log why. Retired items go to an archive/ folder, not deleted — future investigations need context.
# Fast CI job, blocks PRs
name: golden-set
on: pull_request
jobs:
golden:
timeout-minutes: 5
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npm ci
- run: npm run eval:golden
# Fail if any golden item fails
Zero tolerance: one golden failure blocks the PR. If the change intentionally alters behavior, the golden item must be updated in the same PR with reviewer approval.
Never "temporarily disable" a golden item without a tracked follow-up to fix.
| Set | Size | Run cadence | Tolerance | |---|---|---|---| | Smoke golden | 20-30 | Every PR, <2 min | Zero failures | | Regression | 200-500 | Nightly / weekly | Stratum thresholds | | Full eval | 1000-5000 | Per release | Aggregate thresholds |
Don't conflate them. Each has a distinct purpose.
development
Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.
documentation
Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.
development
Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.
development
Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.