claude/skills/run-skill-eval/SKILL.md
Generate variations, eval all variants (HEAD/working/optimized) via claude CLI, report via @visualizer. Iterate until optimal. Use when: /run-skill-eval, "optimize my skills", "evaluate changed skills", "find the best version", skill quality before committing. Do NOT use for creating new skills (skill-creator), code review (run-review), or one-off skill authoring review (use-skill-craft).
npx skillsauth add kenoxa/spine run-skill-evalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Optimize (generate variations) → Evaluate (claude -p) → Report (@visualizer HTML) → iterate.
| Agent | Role |
|-------|------|
| @spine:run-skill-eval:reviewer | Craft review per unit |
| @spine:run-skill-eval:optimizer | Generate optimized variants |
| @spine:run-skill-eval:designer | Design quality + trigger eval test cases |
| @spine:run-skill-eval:runner | Calibration + eval execution + metrics extraction |
| @spine:run-skill-eval:grader | Grade variant outputs against expectations |
Workspace: .scratch/<session>/benchmark/<unit>/ per eval unit.
Priority order:
git diff --name-only ${base:-HEAD} # tracked changes
git status --porcelain # untracked (??) files
Filter to: */SKILL.md, agents/*.md, AGENTS.md, CLAUDE.md, **/CLAUDE.md
Deduplicate. No files → "No evaluatable changes detected." → exit.| Pattern | Unit type |
|---------|-----------|
| <dir>/SKILL.md + sibling references/*.md | skill unit |
| agents/<name>.md | agent unit |
| CLAUDE.md / AGENTS.md / @-included files | instruction unit |
Parse root CLAUDE.md @ includes. Changed included file → always-loaded flag (cross-cutting eval, Step 2).
git show ${base:-HEAD}:<path> > .scratch/<session>/baselines/<path>
Grep changed ref filenames across all SKILL.md; add transitive consumers to eval set.
SKILL_CREATOR_DIR="$(jq -r '.plugins["skill-creator@claude-plugins-official"][0].installPath' ~/.claude/plugins/installed_plugins.json)/skills/skill-creator"
Resolve once. Verify the path exists — if null or missing, skill-creator plugin is not installed. All subsequent steps reference $SKILL_CREATOR_DIR.
Classify each unit: tool-dependent (body references Agent, @-agent, Read, Write, Grep, Glob, Bash, or dispatch) vs text-only (no tool/agent references). Grep skill body excluding anti-patterns section. Classification determines whether calibration runs in Step 2.
Dispatch @spine:run-skill-eval:reviewer per eval unit. Output: .scratch/<session>/optimize/<unit>/craft-findings.md.
Dispatch @spine:run-skill-eval:optimizer 1-2 times per unit (one per strategy: compress, restructure, or custom based on craft findings). Each receives working copy, baseline, and craft findings.
.scratch/<session>/optimize/<unit>/variation-<strategy>.mdCandidates: baseline (HEAD), working (current), + variations.
Skills that dispatch subagents need real file context — a model won't dispatch @inspector agents against a diff embedded in text. For skills with dispatch, codebase interaction, or file I/O:
.scratch/<session>/eval-fixtures/<unit>/FIXTURE_DIR=".scratch/<session>/eval-fixtures/<unit>"
mkdir -p "$FIXTURE_DIR"
# Write test source files, init git repo, create test diff
cd "$FIXTURE_DIR" && git init && git add -A && git commit -m "baseline"
# Apply test changes (uncommitted) so the model sees a real diff
For skills that are purely text-processing (no file dispatch): inline prompt with prepended variant content is sufficient — skip fixture setup.
Dispatch @spine:run-skill-eval:designer per eval unit. Dispatch receives:
unit_path — eval unit file pathunit_type — skill | agent | instructioncraft_findings_path — .scratch/<session>/optimize/<unit>/craft-findings.mdbaseline_path — .scratch/<session>/baselines/<path>working_copy_path — current file pathbenchmark_dir — .scratch/<session>/benchmark/<unit>/Output:
<benchmark_dir>/evals.json — quality eval test cases (IDs must be sequential integers)<benchmark_dir>/trigger-evals.json — trigger eval set for description optimizationGate: verify evals.length > 0 after dispatch. Zero evals → fail with diagnostic.
Dispatch @spine:run-skill-eval:runner per eval unit. Each receives:
evals_json_path — <benchmark_dir>/evals.jsonbenchmark_dir — .scratch/<session>/benchmark/<unit>/variant_files — list of variant file paths (baseline, working, variation-*)fixture_dir — .scratch/<session>/eval-fixtures/<unit>/eval_mode — tool-dependent or text-onlyOutput layout:
<benchmark_dir>/
eval-<id>/
eval_metadata.json
<variant>/
run-1/
eval_metadata.json
outputs/
output.jsonl
metrics.json
timing.json
Calibration output goes to .scratch/<session>/calibration/<unit>/ — separate from benchmark dir. Main thread reads calibration-result.json (pass/fail gate) and metrics.json per run — never raw stream-json.
--append-system-prompt (cross-cutting isolation)Dispatch @spine:run-skill-eval:grader per eval unit:
benchmark_dir — .scratch/<session>/benchmark/<unit>/ (grader iterates the hierarchy itself)calibration_result_path — .scratch/<session>/calibration/<unit>/calibration-result.jsonexpectations — assertions from evals.jsonGrader checks calibration gate, grades each variant against expectations using both text output and metrics.json, writes grading.json per variant + grading-summary.md per unit.
Orchestrator reads summaries only — never full grading.json (context management).
Run aggregation per unit (uses $SKILL_CREATOR_DIR from Step 0):
python3 "$SKILL_CREATOR_DIR/scripts/aggregate_benchmark.py" \
.scratch/<session>/benchmark/<unit>/ \
--skill-name <unit-name>
Produces benchmark.json comparing all variants: pass rates, timing, token usage.
Winning variant per unit: highest pass rate → lowest tokens as tiebreak.
When always-loaded files changed, isolate instruction impact:
Present choice BEFORE generating reports: "Review per-run outputs (generate_review.py) or skip to comparison dashboard (@visualizer)?"
python3 "$SKILL_CREATOR_DIR/eval-viewer/generate_review.py" \
.scratch/<session>/benchmark/<unit>/ \
--skill-name <unit-name> \
--benchmark .scratch/<session>/benchmark/<unit>/benchmark.json \
--static .scratch/<session>/benchmark/<unit>/review.html
For iterations: add --previous-workspace .scratch/<session>/benchmark/<unit>/ (or iteration-<N-1>/ for N>2).
Dispatch @visualizer: comparison dashboard — variant x unit pass rates, token delta vs baseline, winning variant highlighted, per-unit assertion breakdown, craft-review findings. Data: [grading-summary.md, benchmark.json paths]. Output: .scratch/<session>/optimize-report.html (iteration N: iteration-<N>/optimize-report.html).
.scratch/<session>/optimize-report.html — opened in browser.
Even when all variants score identically: list every file checked, assertion count, variants compared, baseline commit. Never omit the comparison table.
Triggered by: explicit user request, iteration plateau, or post-final acceptance.
Preflight checks (all blocking):
installed_plugins.json (same resolution as aggregation)trigger-evals.json exists at <benchmark_dir>/trigger-evals.json with >= 2 entries--- frontmatterpython3 -c "import anthropic" succeedsANTHROPIC_API_KEY env var setcd "$SKILL_CREATOR_DIR" && python3 -m scripts.run_loop \
--eval-set <trigger-evals-path> \
--skill-path <unit-dir> \
--model ${model:-sonnet} \
--max-iterations 5 \
--verbose \
--results-dir .scratch/<session>/description-opt/
run_loop.py creates a timestamped subdirectory under --results-dir. Discover it, then read results.json within → present best_description to user → apply if approved.
After user reviews the report:
--previous iteration for comparison.scratch/<session>/benchmark/<unit>/iteration-<N>/.scratch/<session>/benchmark/<unit>/ directly — iteration-1/ never existsFor iterations using generate_review.py: add --previous-workspace pointing to the prior iteration directory.
skill-creator@claude-plugins-official — contracts validated against d5c15b861cd2 (resolved at runtime via installed_plugins.json)git diff alone without git status --porcelain for untracked filesgrading.json in orchestrator — summaries only (context management)@visualizerdeclare -A or other bash 4+ features — macOS ships bash 3.2; use #!/bin/sh and per-variant files@visualizer subagent@visualizer dispatch — subagent reads them directly<session> placeholder in @visualizer dispatch — inject the literal resolved session path and enumerated unit paths@visualizer without scoping output_path to iteration-<N>/ for iterated runs@spine:run-skill-eval:runner, read metrics.json summaries onlyrun_loop.py — it expects [{query, should_trigger}] format (trigger-evals.json)aggregate_benchmark.py expects integer directory namesrun_loop.py without ANTHROPIC_API_KEY env var setrun_loop.py without cd to skill-creator dir firstfind to locate skill-creator instead of installed_plugins.jsonrun_loop.py without verifying SKILL.md --- frontmatter existsaggregate_benchmark.py would include themoutputs/ directory in run layout — generate_review.py requires outputs/output.jsonlbenchmark.md as authoritative for N>2 configs — use benchmark.jsontesting
Use when: 'clawpatch', 'clawpatch campaign', 'clawpatch review fix revalidate'.
tools
Use when: 'create a worktree', 'git worktree', 'parallel branch'.
tools
Use when: 'session state', 'resume work', 'worktree session'.
development
Use when: 'goal prompt', 'frame this', 'scope this', 'design', 'plan the approach', 'implement and review', 'just ship it', 'fix this', 'add this'.