Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

laurigates/evaluate-skill

Name: evaluate-skill
Author: laurigates

evaluate-plugin/skills/evaluate-skill/SKILL.md

npx skillsauth add laurigates/claude-plugins evaluate-skill

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

/evaluate:skill

Evaluate a skill's effectiveness by running behavioral test cases and grading the results against assertions.

When to Use This Skill

| Use this skill when... | Use alternative when... | |------------------------|------------------------| | Want to test if a skill produces correct results | Need structural validation -> scripts/plugin-compliance-check.sh | | Validating skill improvements before merging | Want to file feedback about a session -> /feedback:session | | Benchmarking a skill against a baseline | Need to check skill freshness -> /health:audit | | Creating eval cases for a new skill | Want to review code quality -> /code-review |

Context

Available skills: !find . -path '*/skills/*' -name 'SKILL.md' -not -path '*/.claude/worktrees/*'

Parameters

Parse these from $ARGUMENTS:

| Parameter | Default | Description | |-----------|---------|-------------| | <plugin/skill-name> | required | Path as plugin-name/skill-name | | --create-evals | false | Generate eval cases if none exist | | --runs N | 1 | Number of runs per eval case | | --baseline | false | Also run without skill for comparison |

Execution

Step 1: Resolve skill path

Parse $ARGUMENTS to extract <plugin-name> and <skill-name>. The skill file lives at:

<plugin-name>/skills/<skill-name>/SKILL.md

Read the SKILL.md to confirm it exists and understand what the skill does.

Step 2: Run structural pre-check

Run the compliance check to confirm the skill passes basic structural validation:

bash scripts/plugin-compliance-check.sh <plugin-name>

If structural issues are found, report them and stop. Behavioral evaluation on a structurally broken skill is wasted effort.

Step 3: Load or create eval cases

Look for <plugin-name>/skills/<skill-name>/evals.json.

If the file exists: read and validate it against the evals.json schema (see evaluate-plugin/references/schemas.md). Accept the optional, back-compatible evals[].fixture block (dir / setup / teardown / workdir) — an eval without it is unchanged; an eval with it needs an isolated execution context (Step 4).

If the file does not exist AND --create-evals is set: Analyze the SKILL.md and generate eval cases:

Read the skill thoroughly — understand its purpose, parameters, execution steps, and expected behaviors.
Generate 3-5 eval cases covering:
- Happy path: Standard usage that should work correctly
- Edge case: Unusual but valid inputs
- Boundary: Inputs that test the limits of the skill's scope
For each eval case, write:
- id: Unique identifier (e.g., eval-001)
- description: What this test validates
- prompt: The user prompt to simulate
- expectations: List of assertion strings the output should satisfy
- tags: Categorization tags
Write the generated cases to <plugin-name>/skills/<skill-name>/evals.json.

If the file does not exist AND --create-evals is NOT set: Report that no eval cases exist and suggest running with --create-evals.

Step 4: Run evaluations

For each eval case, for each run (up to --runs N):

Scaffold the run directory and record the start time by running:

bash ${CLAUDE_PLUGIN_ROOT}/scripts/prepare_run.sh \
  --skill-dir <plugin-name>/skills/<skill-name> \
  --eval-id <eval-id> --run <N>

Parse RUN_DIR=, MANIFEST=, and STARTED_AT= from output.

If the eval carries a fixture block, apply it to get an isolated workdir:
```
bash ${CLAUDE_PLUGIN_ROOT}/scripts/apply_fixture.sh \
  --fixture '<eval.fixture JSON>' --repo-root "$(pwd)"
```
Parse WORKDIR= (the subagent then operates there). Skip this for evals without a fixture — they run in the repo as before.
Spawn a Task subagent (subagent_type: general-purpose) that:
- Receives the skill content as context
- Executes the eval prompt
- Works in $WORKDIR if a fixture was applied, else in the repository
Capture the subagent output.
Record timing data (duration) and write to $RUN_DIR/timing.json.
Write the transcript to $RUN_DIR/transcript.md.
If a fixture was applied, tear it down after the transcript is copied out: bash ${CLAUDE_PLUGIN_ROOT}/scripts/apply_fixture.sh --teardown "$WORKDIR" --fixture '<eval.fixture JSON>'.

Step 5: Run baseline (if --baseline)

If --baseline is set, repeat Step 4 but without loading the skill content. Pass --baseline to prepare_run.sh so results are written into a parallel baseline/ subdirectory. This creates a comparison point to measure skill effectiveness.

Use the same eval prompts and record results in the baseline/ subdirectory.

Step 6: Grade results

For each run, delegate grading to the eval-grader agent via Task:

Task subagent_type: eval-grader
Prompt: Grade this eval run against the assertions.
  Eval case: <eval case from evals.json>
  Transcript: <path to transcript.md>
  Output artifacts: <list of created/modified files>

The grader produces grading.json for each run.

Step 7: Aggregate and report

Compute aggregate statistics across all runs:

Mean pass rate (assertions passed / total assertions)
Standard deviation of pass rate
Mean duration

If --baseline was used, also compute:

Baseline mean pass rate
Delta (improvement from skill)

Write aggregated results to <plugin-name>/skills/<skill-name>/eval-results/benchmark.json.

Print a summary table:

## Evaluation Results: <plugin/skill-name>

| Metric | With Skill | Baseline | Delta |
|--------|-----------|----------|-------|
| Pass Rate | 85% | 42% | +43% |
| Duration | 14s | 12s | +2s |
| Runs | 3 | 3 | — |

### Per-Eval Breakdown

| Eval | Description | Pass Rate | Status |
|------|-------------|-----------|--------|
| eval-001 | Basic usage | 100% | PASS |
| eval-002 | Edge case | 67% | PARTIAL |
| eval-003 | Boundary | 100% | PASS |

Agentic Optimizations

| Context | Command | |---------|---------| | Inspect skill eval setup | bash evaluate-plugin/scripts/inspect_eval.sh --plugin <plugin> --skill <skill> | | Print evals JSON | bash evaluate-plugin/scripts/inspect_eval.sh --plugin <plugin> --skill <skill> --print-evals | | Prepare a run directory | bash evaluate-plugin/scripts/prepare_run.sh --skill-dir <plugin>/skills/<skill> --eval-id <id> --run <N> | | Aggregate results | bash evaluate-plugin/scripts/aggregate_benchmark.sh <plugin> |

Quick Reference

| Flag | Description | |------|-------------| | --create-evals | Generate eval cases from SKILL.md analysis | | --runs N | Number of runs per eval case (default: 1) | | --baseline | Run without skill for comparison |

laurigates/evaluate-skill

evaluate-plugin/skills/evaluate-skill/SKILL.md

Evaluate a skill by running test cases and grading results. Use when testing whether a skill produces correct guidance, validating improvements, or benchmarking before release.

46 stars

testing

Updated Jul 19, 2026

$ install --global

skillsauth

npx skillsauth add laurigates/claude-plugins evaluate-skill

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jul 19, 2026, 7:03 AM170.1s1 file scanned

SKILL.md

name:: evaluate-skill
description:: Evaluate a skill by running test cases and grading results. Use when testing whether a skill produces correct guidance, validating improvements, or benchmarking before release.
args:: <plugin/skill-name> [--create-evals] [--runs N] [--baseline]
allowed-tools:: Task, Read, Write, Edit, Glob, Grep, Bash(bash *), TodoWrite
argument-hint:: git-plugin/git-commit [--create-evals] [--runs 3] [--baseline]
agent:: general-purpose
context:: fork
created:: 2026-03-04
modified:: 2026-07-18
compatibility:: claude-code
reviewed:: 2026-03-04

/evaluate:skill

Evaluate a skill's effectiveness by running behavioral test cases and grading the results against assertions.

When to Use This Skill

Context

Available skills: !find . -path '*/skills/*' -name 'SKILL.md' -not -path '*/.claude/worktrees/*'

Parameters

Parse these from $ARGUMENTS:

Execution

Step 1: Resolve skill path

Parse $ARGUMENTS to extract <plugin-name> and <skill-name>. The skill file lives at:

<plugin-name>/skills/<skill-name>/SKILL.md

Read the SKILL.md to confirm it exists and understand what the skill does.

Step 2: Run structural pre-check

Run the compliance check to confirm the skill passes basic structural validation:

bash scripts/plugin-compliance-check.sh <plugin-name>

If structural issues are found, report them and stop. Behavioral evaluation on a structurally broken skill is wasted effort.

Step 3: Load or create eval cases

Look for <plugin-name>/skills/<skill-name>/evals.json.

If the file does not exist AND --create-evals is set: Analyze the SKILL.md and generate eval cases:

Read the skill thoroughly — understand its purpose, parameters, execution steps, and expected behaviors.
Generate 3-5 eval cases covering:
- Happy path: Standard usage that should work correctly
- Edge case: Unusual but valid inputs
- Boundary: Inputs that test the limits of the skill's scope
For each eval case, write:
- id: Unique identifier (e.g., eval-001)
- description: What this test validates
- prompt: The user prompt to simulate
- expectations: List of assertion strings the output should satisfy
- tags: Categorization tags
Write the generated cases to <plugin-name>/skills/<skill-name>/evals.json.

If the file does not exist AND --create-evals is NOT set: Report that no eval cases exist and suggest running with --create-evals.

Step 4: Run evaluations

For each eval case, for each run (up to --runs N):

Scaffold the run directory and record the start time by running:

bash ${CLAUDE_PLUGIN_ROOT}/scripts/prepare_run.sh \
  --skill-dir <plugin-name>/skills/<skill-name> \
  --eval-id <eval-id> --run <N>

Parse RUN_DIR=, MANIFEST=, and STARTED_AT= from output.

If the eval carries a fixture block, apply it to get an isolated workdir:
```
bash ${CLAUDE_PLUGIN_ROOT}/scripts/apply_fixture.sh \
  --fixture '<eval.fixture JSON>' --repo-root "$(pwd)"
```
Parse WORKDIR= (the subagent then operates there). Skip this for evals without a fixture — they run in the repo as before.
Spawn a Task subagent (subagent_type: general-purpose) that:
- Receives the skill content as context
- Executes the eval prompt
- Works in $WORKDIR if a fixture was applied, else in the repository
Capture the subagent output.
Record timing data (duration) and write to $RUN_DIR/timing.json.
Write the transcript to $RUN_DIR/transcript.md.
If a fixture was applied, tear it down after the transcript is copied out: bash ${CLAUDE_PLUGIN_ROOT}/scripts/apply_fixture.sh --teardown "$WORKDIR" --fixture '<eval.fixture JSON>'.

Step 5: Run baseline (if --baseline)

Use the same eval prompts and record results in the baseline/ subdirectory.

Step 6: Grade results

For each run, delegate grading to the eval-grader agent via Task:

Task subagent_type: eval-grader
Prompt: Grade this eval run against the assertions.
  Eval case: <eval case from evals.json>
  Transcript: <path to transcript.md>
  Output artifacts: <list of created/modified files>

The grader produces grading.json for each run.

Step 7: Aggregate and report

Compute aggregate statistics across all runs:

Mean pass rate (assertions passed / total assertions)
Standard deviation of pass rate
Mean duration

If --baseline was used, also compute:

Baseline mean pass rate
Delta (improvement from skill)

Write aggregated results to <plugin-name>/skills/<skill-name>/eval-results/benchmark.json.

Print a summary table:

## Evaluation Results: <plugin/skill-name>

| Metric | With Skill | Baseline | Delta |
|--------|-----------|----------|-------|
| Pass Rate | 85% | 42% | +43% |
| Duration | 14s | 12s | +2s |
| Runs | 3 | 3 | — |

### Per-Eval Breakdown

| Eval | Description | Pass Rate | Status |
|------|-------------|-----------|--------|
| eval-001 | Basic usage | 100% | PASS |
| eval-002 | Edge case | 67% | PARTIAL |
| eval-003 | Boundary | 100% | PASS |

Agentic Optimizations

Quick Reference

Related Skills

laurigates/quoted-description

development

VerifiedTrustedCommunity

Debug HTTP APIs: trace requests, inspect headers. Use when a request fails: check status first.

46SKILL.mdUpdated Jul 20, 2026

laurigates/quoted-description

laurigates/folded-description

documentation

VerifiedTrustedCommunity

Render architecture diagrams from text sources. Use when documenting system topology.

46SKILL.mdUpdated Jul 20, 2026

laurigates/folded-description

laurigates/adapters/tests/fixtures/mini-marketplace-src/beta-plugin/skilldirs/no-name

tools

VerifiedTrustedCommunity

Inspect JSON payloads and extract nested fields. Use when parsing API responses.

46SKILL.mdUpdated Jul 20, 2026

laurigates/adapters/tests/fixtures/mini-marketplace-src/beta-plugin/skilldirs/no-name

laurigates/no-description

tools

VerifiedTrustedCommunity

--- name: no-description allowed-tools: Read --- # No Description This skill has no description and must be dropped with a warning.

46SKILL.mdUpdated Jul 20, 2026

laurigates/no-description

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/laurigates/claude-plugins.git

# Copy into Claude Code skills folder (global)
cp -r claude-plugins/evaluate-plugin/skills/evaluate-skill ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

laurigates/claude-plugins

46 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT