evaluate-plugin/skills/evaluate-skill/SKILL.md
Evaluate a skill by running test cases and grading results. Use when testing whether a skill produces correct guidance, validating improvements, or benchmarking before release.
npx skillsauth add laurigates/claude-plugins evaluate-skillInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Evaluate a skill's effectiveness by running behavioral test cases and grading the results against assertions.
| Use this skill when... | Use alternative when... |
|------------------------|------------------------|
| Want to test if a skill produces correct results | Need structural validation -> scripts/plugin-compliance-check.sh |
| Validating skill improvements before merging | Want to file feedback about a session -> /feedback:session |
| Benchmarking a skill against a baseline | Need to check skill freshness -> /health:audit |
| Creating eval cases for a new skill | Want to review code quality -> /code-review |
bash ${CLAUDE_PLUGIN_ROOT}/scripts/inspect_eval.sh --plugin-dir $1Parse these from $ARGUMENTS:
| Parameter | Default | Description |
|-----------|---------|-------------|
| <plugin/skill-name> | required | Path as plugin-name/skill-name |
| --create-evals | false | Generate eval cases if none exist |
| --runs N | 1 | Number of runs per eval case |
| --baseline | false | Also run without skill for comparison |
Parse $ARGUMENTS to extract <plugin-name> and <skill-name>. The skill file lives at:
<plugin-name>/skills/<skill-name>/SKILL.md
Read the SKILL.md to confirm it exists and understand what the skill does.
Run the compliance check to confirm the skill passes basic structural validation:
bash scripts/plugin-compliance-check.sh <plugin-name>
If structural issues are found, report them and stop. Behavioral evaluation on a structurally broken skill is wasted effort.
Look for <plugin-name>/skills/<skill-name>/evals.json.
If the file exists: read and validate it against the evals.json schema (see evaluate-plugin/references/schemas.md).
If the file does not exist AND --create-evals is set: Analyze the SKILL.md and generate eval cases:
id: Unique identifier (e.g., eval-001)description: What this test validatesprompt: The user prompt to simulateexpectations: List of assertion strings the output should satisfytags: Categorization tags<plugin-name>/skills/<skill-name>/evals.json.If the file does not exist AND --create-evals is NOT set: Report that no eval cases exist and suggest running with --create-evals.
For each eval case, for each run (up to --runs N):
bash ${CLAUDE_PLUGIN_ROOT}/scripts/prepare_run.sh \
--skill-dir <plugin-name>/skills/<skill-name> \
--eval-id <eval-id> --run <N>
Parse RUN_DIR=, MANIFEST=, and STARTED_AT= from output.subagent_type: general-purpose) that:
$RUN_DIR/timing.json.$RUN_DIR/transcript.md.If --baseline is set, repeat Step 4 but without loading the skill content. Pass --baseline to prepare_run.sh so results are written into a parallel baseline/ subdirectory. This creates a comparison point to measure skill effectiveness.
Use the same eval prompts and record results in the baseline/ subdirectory.
For each run, delegate grading to the eval-grader agent via Task:
Task subagent_type: eval-grader
Prompt: Grade this eval run against the assertions.
Eval case: <eval case from evals.json>
Transcript: <path to transcript.md>
Output artifacts: <list of created/modified files>
The grader produces grading.json for each run.
Compute aggregate statistics across all runs:
If --baseline was used, also compute:
Write aggregated results to <plugin-name>/skills/<skill-name>/eval-results/benchmark.json.
Print a summary table:
## Evaluation Results: <plugin/skill-name>
| Metric | With Skill | Baseline | Delta |
|--------|-----------|----------|-------|
| Pass Rate | 85% | 42% | +43% |
| Duration | 14s | 12s | +2s |
| Runs | 3 | 3 | — |
### Per-Eval Breakdown
| Eval | Description | Pass Rate | Status |
|------|-------------|-----------|--------|
| eval-001 | Basic usage | 100% | PASS |
| eval-002 | Edge case | 67% | PARTIAL |
| eval-003 | Boundary | 100% | PASS |
| Context | Command |
|---------|---------|
| Inspect skill eval setup | bash evaluate-plugin/scripts/inspect_eval.sh --plugin <plugin> --skill <skill> |
| Print evals JSON | bash evaluate-plugin/scripts/inspect_eval.sh --plugin <plugin> --skill <skill> --print-evals |
| Prepare a run directory | bash evaluate-plugin/scripts/prepare_run.sh --skill-dir <plugin>/skills/<skill> --eval-id <id> --run <N> |
| Aggregate results | bash evaluate-plugin/scripts/aggregate_benchmark.sh <plugin> |
| Flag | Description |
|------|-------------|
| --create-evals | Generate eval cases from SKILL.md analysis |
| --runs N | Number of runs per eval case (default: 1) |
| --baseline | Run without skill for comparison |
tools
Scaffold a new ComfyUI custom-node repo (pyproject, CI, release-please, vitest+pytest, JS extension skeleton) in the picker/gesture vein. Use when bootstrapping or init-ing a comfyui node pack.
tools
Orchestrate a ComfyUI node pack from idea to registry: scaffold, create + seed the repo, open the gitops adoption PR. Use when releasing or spinning up a new comfyui node pack.
testing
macOS EndpointSecurity/EDR high CPU & battery drain. Use when Kandji ESF / XProtect pegs a core; trace the exec storm via powermetrics + eslogger.
development
odiff pixel-by-pixel image diffing. Use when comparing screenshots, detecting visual regressions, diffing before/after PNGs, asserting golden images.