evaluate-plugin/skills/evaluate-report/SKILL.md
View evaluation results and benchmark reports for a skill or plugin. Use when reviewing past eval results, comparing benchmark runs, or tracking quality trends.
npx skillsauth add laurigates/claude-plugins evaluate-reportInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
View evaluation results and benchmark reports for a skill or plugin.
| Use this skill when... | Use alternative when... |
|------------------------|------------------------|
| Want to see results from a past evaluation | Need to run new evaluations -> /evaluate:skill |
| Comparing benchmark runs over time | Want improvement suggestions -> /evaluate:improve |
| Reviewing quality trends for a plugin | Need structural compliance -> plugin-compliance-check.sh |
Parse these from $ARGUMENTS:
| Parameter | Default | Description |
|-----------|---------|-------------|
| <target> | required | plugin/skill-name for skill or plugin-name for plugin |
| --latest | true | Show most recent benchmark only |
| --history | false | Show all benchmark iterations |
| --compare | false | Compare current vs previous benchmark |
Determine if $ARGUMENTS refers to a skill or plugin:
/: treat as plugin-name/skill-name, look for <plugin>/skills/<skill>/eval-results/benchmark.json/: treat as plugin-name, look for <plugin>/eval-results/plugin-benchmark.jsonIf the target file does not exist, report that no evaluation results are available and suggest running /evaluate:skill or /evaluate:plugin.
Skill-level report (--latest or default):
Read benchmark.json and display:
## Evaluation Report: <plugin/skill-name>
**Last run**: <timestamp>
**Runs per eval**: N
| Metric | With Skill | Baseline | Delta |
|--------|-----------|----------|-------|
| Pass Rate | 85% | 42% | +43% |
| Duration | 14s | 12s | +2s |
### Per-Eval Results
| Eval ID | Description | Pass Rate | Failures |
|---------|-------------|-----------|----------|
| eval-001 | Basic usage | 100% | — |
| eval-002 | Edge case | 67% | assertion X |
Skill-level history (--history):
Read history.json and display improvement iterations:
## Evaluation History: <plugin/skill-name>
| Version | Date | Pass Rate | Changes |
|---------|------|-----------|---------|
| v3 (current) | 2026-03-04 | 85% | Improved step 3 instructions |
| v2 | 2026-03-01 | 72% | Added error handling |
| v1 | 2026-02-28 | 55% | Initial evaluation |
Skill-level comparison (--compare):
Read current and previous benchmarks, show delta:
## Comparison: <plugin/skill-name>
| Metric | Previous | Current | Delta |
|--------|----------|---------|-------|
| Pass Rate | 72% | 85% | +13% |
| Duration | 16s | 14s | -2s |
Improved evals: eval-002, eval-004
Regressed evals: none
Plugin-level report:
Read plugin-benchmark.json and display:
## Plugin Report: <plugin-name>
| Skill | Evals | Pass Rate | Status |
|-------|-------|-----------|--------|
| skill-a | 4 | 100% | PASS |
| skill-b | 3 | 67% | NEEDS WORK |
**Overall**: 82% pass rate
**Skills evaluated**: N / M total
| Context | Command |
|---------|---------|
| Find skill benchmark | cat <plugin>/skills/<skill>/eval-results/benchmark.json \| jq .summary |
| Find plugin benchmark | cat <plugin>/eval-results/plugin-benchmark.json \| jq .summary |
| Check for history | cat <plugin>/skills/<skill>/eval-results/history.json \| jq '.iterations \| length' |
| List available results | find <plugin> -name benchmark.json -o -name plugin-benchmark.json |
| Flag | Description |
|------|-------------|
| --latest | Most recent benchmark (default) |
| --history | Show all iterations |
| --compare | Compare current vs previous |
tools
Scaffold a new ComfyUI custom-node repo (pyproject, CI, release-please, vitest+pytest, JS extension skeleton) in the picker/gesture vein. Use when bootstrapping or init-ing a comfyui node pack.
tools
Orchestrate a ComfyUI node pack from idea to registry: scaffold, create + seed the repo, open the gitops adoption PR. Use when releasing or spinning up a new comfyui node pack.
testing
macOS EndpointSecurity/EDR high CPU & battery drain. Use when Kandji ESF / XProtect pegs a core; trace the exec storm via powermetrics + eslogger.
development
odiff pixel-by-pixel image diffing. Use when comparing screenshots, detecting visual regressions, diffing before/after PNGs, asserting golden images.