01-package-scaffolding/skill-eval-runner/SKILL.md
Run trigger tests, behavior tests, and baseline comparisons for a skill's eval suite, then produce a structured quality verdict. Use when a skill has been modified and needs regression testing, when CI/pre-release validation requires documented eval results, or when measuring quality before catalog inclusion. Do not use when no evals exist yet (build them first) or for manual evaluation without test files.
npx skillsauth add chelch5/skilllibrary skill-eval-runnerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Executes a skill's eval suite and produces a structured quality report.
Check for eval files in the skill directory:
evals/triggers.yaml — trigger accuracy testsevals/outputs.yaml — behavior correctness testsevals/baselines.yaml — baseline comparison testsNote which exist and which are missing.
For each case in triggers.yaml:
Record: prompt, expected, actual, Pass/Fail.
For each case in outputs.yaml:
expected_sections, required_patterns, forbidden_patternsRecord: test name, checks passed/total, Pass/Fail.
For each case in baselines.yaml:
Calculate:
| Verdict | Criteria | |---------|----------| | Pass | All rates ≥80% AND baseline win | | Pass with issues | Any rate 60-79% | | Fail | Any rate <60% OR baseline lose |
## Eval Report: [skill-name]
Date: [YYYY-MM-DD]
### Trigger Tests
| Prompt | Type | Expected | Actual | Result |
|--------|------|----------|--------|--------|
Precision: X% Recall: Y%
### Output Tests
| Test | Checks Passed | Result |
|------|--------------|--------|
Pass rate: X%
### Baseline Comparison
Skill adds value: [Yes/No]
### Verdict: [Pass | Pass with issues | Fail]
Issues: [list or "None"]
testing
Manages context window budgets, loading strategies, and compaction techniques for AI-assisted coding sessions. Trigger on 'context window', 'what to load', 'context management', 'context overflow', 'token budget'. DO NOT USE for loading specific project docs into agent context (use project-context) or prompt wording and optimization (use prompt-crafting).
development
Implements authentication, session, token, and authorization patterns for the current stack. Trigger on 'add auth', 'JWT', 'OAuth', 'login endpoint', 'session management', 'API key auth'. DO NOT USE for OWASP hardening checklists (use security-hardening), threat modeling (use security-threat-model), or secret rotation/storage (use security-best-practices).
tools
Defines request/response shapes, versioning, validation, and compatibility rules for API-first work. Trigger on 'design API', 'OpenAPI spec', 'REST schema', 'API versioning', 'generate client SDK'. DO NOT USE for GraphQL schemas, gRPC/protobuf definitions (use stack-standards), auth endpoint logic (use auth-patterns), or external API client wrappers (use external-api-client).
development
Create a repo-local ticket system with an index, machine-readable manifest, board, and individual ticket files. Use when a repo needs task decomposition that autonomous agents can follow without re-planning the whole project each session. Do not use for executing tickets (use ticket-execution) or quick fixes that don't warrant formal tickets.