skills/validation/skill-benchmark/SKILL.md
Evaluates skill effectiveness by comparing agent output on a task with and without skills activated. Use when validating that a skill improves output quality, when creating evidence for a skill PR, or when regression-testing skill changes.
npx skillsauth add pkuppens/pkuppens skill-benchmarkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Measures the real-world impact of a skill by comparing agent output with and without skill activation.
When to use:
Instructions:
Select a benchmark task:
on_prem_ragon_prem_rag 0.9.0tmp/skills/benchmark/<skill-name>/task.mdRun baseline (without skill):
tmp/skills/benchmark/<skill-name>/output-baseline.mdRun with skill activated:
tmp/skills/benchmark/<skill-name>/output-with-skill.mdScore both outputs on 4 dimensions (1–5 scale each):
| Dimension | Description | |-----------|-------------| | Coverage | Did the output address all required sub-tasks? | | Specificity | Are commands, formats, and values concrete (not vague)? | | Correctness | Is the output factually and technically correct? | | Completeness | Could an engineer act on this output without further clarification? |
Produce comparison table:
| Dimension | Without Skill | With Skill | Delta | |-----------|--------------|------------|-------| | Coverage | /5 | /5 | +N | | Specificity | /5 | /5 | +N | | Correctness | /5 | /5 | +N | | Completeness | /5 | /5 | +N | | Total | /20 | /20 | +N |
Write benchmark report in tmp/skills/benchmark/<skill-name>/report.md:
Output format:
## Skill Benchmark — <skill-name>
- Task: <one-line description>
- Without skill total: <N>/20
- With skill total: <N>/20
- Delta: +<N>
- Verdict: significant improvement | marginal | no improvement
- Key observation: <one sentence on the most important qualitative difference>
Benchmark task library: Reusable prompts live under docs/skills/benchmark/tasks/ in this repo (versioned). Local copies may also sit in tmp/skills/benchmark/tasks/ during runs.
issue-creation.md — create a GitHub issue for a featureadr-authoring.md — write an ADR for a technology choicedeployment-checklist.md — produce a deployment checklist for a releaseCommitted reports (examples / evidence): docs/skills/benchmark/<skill-name>/report.md. Session transcripts can remain in tmp/skills/benchmark/<skill-name>/.
Anti-patterns:
Integration: Output feeds architecture-risks-debt if skills have known gaps. Store runnable evidence under docs/skills/benchmark/ (or tmp/skills/benchmark/ for scratch). See COOPERATION.md.
tools
Creates, queries, updates, and links Azure Boards work items via az boards CLI. Use when filing ADO work items, running WIQL queries, or setting area path, iteration, tags, and assignee.
tools
Creates, reviews, and completes Azure Repos pull requests and branch policies via az repos CLI. Use when opening ADO PRs, setting required reviewers, or configuring build validation policies.
development
Guides Azure Pipelines YAML structure, build validation on PRs, and staged deployment with environments and approvals. Use when authoring azure-pipelines.yml or configuring CI/CD on Azure DevOps.
tools
Orchestrates Azure DevOps work item, repo, and pipeline workflows using az CLI. Use when working with Azure DevOps, Azure Repos, Azure Boards, Azure Pipelines, or az devops commands.