Skill Tester

Design tests, run them, grade results, evaluate description triggering, produce actionable report. All in one workflow — no separate test-design step needed.

You are team lead, test designer, user actor, and analyst — all in one.

Phase 1: Understand & Design

1a. Read the target skill

User provides skill name or path
Read the target skill's SKILL.md + ALL referenced files completely
Map out:
- Skill type: procedural / informational
- Input: what does the skill expect? (user message, task file, structured data)
- Output: what should the skill produce? (files, messages, actions, decisions)
- Phases: list all phases/steps with their checkpoints
- References: list all files the skill tells agents to read
- Decision points: where does the skill branch based on input?
- Dialogue points: where does the skill ask the user questions?

1b. Design test prompts

Design test prompts applying criteria from test-design-guide.md (realistic prompts, assertion design, persona setup):

Propose 2-3 test prompts:
- 1 happy-path: the most common, standard use case
- 1-2 edge cases: where the skill might break or behave unexpectedly
For each prompt, propose assertions — binary, observable checks. Categories:
- [Process]: Did the agent follow the skill's workflow?
- [Outcome]: Is the result correct and complete?
- [Compliance]: Did the agent obey all skill instructions?

1c. Design trigger eval queries

Design trigger eval queries applying patterns from trigger-eval-guide.md:

Generate 15-20 trigger eval queries:
- 8-10 should-trigger: varied phrasings of the skill's intended use
- 8-10 should-not-trigger: near-misses that share keywords but need something different
These will be used in Phase 4 to test description accuracy

1d. Confirm with user

Present the full test plan:

Test prompts with assertions
Trigger eval queries
Proposed model for runners
Persona (use default, modify only if user requests)

"Here are the test cases and trigger queries I've prepared. Do these look right, or do you want to adjust anything?"

Wait for confirmation before proceeding.

Checkpoint: User confirmed test plan. All prompts have assertions. Trigger eval queries prepared.

Phase 2: Execute Tests

2a. Setup

Create workspace: ~/.claude/skill-tests/{skill-name}/iteration-{N}/ (N = 1 for first run, increment for re-runs)
Save test plan to workspace:
- evals.json with all prompts and assertions
- trigger-evals.json with all trigger queries
TeamCreate(team_name="skill-test-{skill-name}")
Plan runners: per scenario = 2 with-skill + 1 baseline without skill

Show plan to user: "I'll run {N} scenarios, {M} runners total. Model: {model}. Proceed?"

2b. Spawn runners

For each scenario, spawn all runners in parallel:

With-skill runners (2 per scenario):

Prompt = scenario's task prompt (natural, as user would write)
Each runner loads the tested skill: Skill(skill="{tested-skill-name}")
Model: as confirmed with user
Use run_in_background: true

Baseline runner (1 per scenario):

Same task prompt, same model
Receives no skill to load
Use run_in_background: true

Save each runner's task_id — needed for grader agents to retrieve transcripts.

Scenarios run sequentially. Runners within a scenario run in parallel.

2c. Interact as user persona

If runners send questions, answer in character per the scenario's persona. Rules:

Stay in character: answer as the user would
Be consistent: same question from different runners → same answer
Answer naturally — without guidance toward any specific behavior
Keep conversation purely about the task itself
Baseline runner may ask different questions (no skill to guide it) — this is expected, answer them too

2d. Capture timing data

When each runner completes, immediately save timing data:

{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}

This is the only opportunity to capture this data — it comes through the task notification and isn't persisted elsewhere. Process each notification as it arrives rather than trying to batch them.

Save to timing.json in each runner's result directory.

Checkpoint: All runners completed. Timing data captured.

Phase 3: Grade & Analyze

3a. Grade via grader agents

When all runners for a scenario finish, spawn grader agents — one per runner. Delegate transcript analysis to grader agents — transcripts are large and reading them directly would exhaust the lead agent's context, leaving no room for report compilation.

Each grader receives instructions from grading-guide.md and:

The runner's task_id (grader calls TaskOutput(task_id) for transcript)
The scenario's assertions (copy the criteria list into the prompt)
The skill's SKILL.md path (grader reads it for compliance check)
Whether this is a skill-runner or baseline

Spawn all graders in parallel. Wait for all to return.

3b. Compile results per scenario

Using only grader outputs (not transcripts):

Build results table (assertions × runners)
Cross-runner consistency: where did skill-runners diverge?
- Divergence on a criterion = ambiguous instruction in the skill
Baseline comparison:
- Passed by skill-runners ONLY → skill adds value
- Passed by ALL → criterion too easy or skill doesn't help here
- Failed by ALL → criterion may be unrealistic
- Passed by baseline ONLY → skill might be harmful for this case

Clean up runners for this scenario before moving to the next one.

3c. Benchmark aggregation

Across all scenarios, compute:

Pass rate per assertion per config (with-skill / baseline)
Timing comparison: tokens and duration per config
Overall skill value: how many assertions improve vs baseline

3d. Analyst pass

Surface patterns the aggregate stats might hide:

Non-discriminating assertions: pass regardless of whether skill is used. These don't prove the skill helps — consider removing or replacing with harder assertions.
High-variance assertions: one skill-runner passes, other fails on same criterion. Usually means the skill's instruction is ambiguous — identify the specific instruction and quote it.
Time/token tradeoffs: skill adds value but costs 2x tokens? Flag it. The user should know the cost of improvement.
Repeated code in transcripts: if multiple runners independently wrote similar helper scripts, flag this as a candidate for bundling in the skill's scripts/ directory.

Checkpoint: All scenarios graded. Benchmark computed. Analyst observations recorded.

Phase 4: Test Description Triggering

4a. Evaluate trigger accuracy

For each trigger eval query from trigger-evals.json:

Assess whether the skill's current description would cause Claude to invoke the skill for this query
Consider: does the query's intent match the description's keywords and contexts? Would Claude see this as the skill's domain?

Categorize each query:

True positive: should-trigger → would trigger
True negative: should-not-trigger → would not trigger
False negative: should-trigger → would NOT trigger (undertriggering)
False positive: should-not-trigger → would trigger (overtriggering)

4b. Calculate trigger accuracy

Trigger accuracy = (true positives + true negatives) / total queries
False negative rate = false negatives / should-trigger queries
False positive rate = false positives / should-not-trigger queries

False negatives (undertriggering) are the most costly — users won't discover the skill exists. False positives waste time but are less harmful.

4c. Suggest improved description

If trigger accuracy < 85% or false negative rate > 20%:

Analyze which queries fail and why
Draft an improved description that would trigger correctly
Show before/after comparison in the report

Checkpoint: Trigger accuracy calculated. Description improvement suggested if needed.

Phase 5: Report

Structure the report according to report-template.md.

The report includes:

Results per scenario — assertions × runners table with evidence
Skill compliance — phase-by-phase execution check
Benchmark summary — pass_rate, tokens, time per config
Analyst observations — non-discriminating, high-variance, cost analysis
Description trigger accuracy — accuracy metrics + suggested improvement
Scripts to bundle — if repeated code found across transcripts
Recommendations — priority-ordered specific fixes for skill-master

Save to: ~/.claude/skill-tests/{skill-name}/reports/{timestamp}-report.md

Show report to user: "Here's the test report. Key findings: [summary]. The report is at [path] — you can share it with skill-master to apply fixes."

TeamDelete after report delivery.

Improving the Skill (Iteration)

If the user wants to iterate after receiving the report:

User (or skill-master) applies fixes to the skill
Run skill-tester again → results go to iteration-{N+1}/
Previous iteration results are available for comparison
Report shows delta: what improved, what regressed

When iterating, keep these principles in mind:

Generalize from feedback: resist fiddly changes targeted at specific test cases. If a skill works only for its test cases, it's useless at scale.
Keep the prompt lean: read transcripts. If the skill makes the model waste time doing unproductive things, remove those parts.
Explain the why: rather than adding rigid ALWAYS/NEVER rules, explain reasoning so the model understands the intent.

Self-Verification

[ ] Target skill fully read (SKILL.md + all references)
[ ] All scenarios executed (2 skill-runners + 1 baseline each)
[ ] Grader agents used for transcript analysis (not read by lead directly)
[ ] Every assertion graded with cited evidence from tool call transcripts
[ ] Skill compliance checked per runner (phase-by-phase)
[ ] Baseline comparison completed per assertion
[ ] Benchmark aggregated (pass_rate, tokens, time)
[ ] Analyst pass completed (non-discriminating, high-variance, cost, repeated code)
[ ] Trigger eval queries tested against description
[ ] Description improvement suggested if accuracy < 85%
[ ] Report saved and shown to user
[ ] Team deleted after report delivery

Skill Tester

Design tests, run them, grade results, evaluate description triggering, produce actionable report. All in one workflow — no separate test-design step needed.

You are team lead, test designer, user actor, and analyst — all in one.

Phase 1: Understand & Design

1a. Read the target skill

User provides skill name or path
Read the target skill's SKILL.md + ALL referenced files completely
Map out:
- Skill type: procedural / informational
- Input: what does the skill expect? (user message, task file, structured data)
- Output: what should the skill produce? (files, messages, actions, decisions)
- Phases: list all phases/steps with their checkpoints
- References: list all files the skill tells agents to read
- Decision points: where does the skill branch based on input?
- Dialogue points: where does the skill ask the user questions?

1b. Design test prompts

Design test prompts applying criteria from test-design-guide.md (realistic prompts, assertion design, persona setup):

Propose 2-3 test prompts:
- 1 happy-path: the most common, standard use case
- 1-2 edge cases: where the skill might break or behave unexpectedly
For each prompt, propose assertions — binary, observable checks. Categories:
- [Process]: Did the agent follow the skill's workflow?
- [Outcome]: Is the result correct and complete?
- [Compliance]: Did the agent obey all skill instructions?

1c. Design trigger eval queries

Design trigger eval queries applying patterns from trigger-eval-guide.md:

Generate 15-20 trigger eval queries:
- 8-10 should-trigger: varied phrasings of the skill's intended use
- 8-10 should-not-trigger: near-misses that share keywords but need something different
These will be used in Phase 4 to test description accuracy

1d. Confirm with user

Present the full test plan:

Test prompts with assertions
Trigger eval queries
Proposed model for runners
Persona (use default, modify only if user requests)

"Here are the test cases and trigger queries I've prepared. Do these look right, or do you want to adjust anything?"

Wait for confirmation before proceeding.

Checkpoint: User confirmed test plan. All prompts have assertions. Trigger eval queries prepared.

Phase 2: Execute Tests

2a. Setup

Create workspace: ~/.claude/skill-tests/{skill-name}/iteration-{N}/ (N = 1 for first run, increment for re-runs)
Save test plan to workspace:
- evals.json with all prompts and assertions
- trigger-evals.json with all trigger queries
TeamCreate(team_name="skill-test-{skill-name}")
Plan runners: per scenario = 2 with-skill + 1 baseline without skill

Show plan to user: "I'll run {N} scenarios, {M} runners total. Model: {model}. Proceed?"

2b. Spawn runners

For each scenario, spawn all runners in parallel:

With-skill runners (2 per scenario):

Prompt = scenario's task prompt (natural, as user would write)
Each runner loads the tested skill: Skill(skill="{tested-skill-name}")
Model: as confirmed with user
Use run_in_background: true

Baseline runner (1 per scenario):

Same task prompt, same model
Receives no skill to load
Use run_in_background: true

Save each runner's task_id — needed for grader agents to retrieve transcripts.

Scenarios run sequentially. Runners within a scenario run in parallel.

2c. Interact as user persona

If runners send questions, answer in character per the scenario's persona. Rules:

Stay in character: answer as the user would
Be consistent: same question from different runners → same answer
Answer naturally — without guidance toward any specific behavior
Keep conversation purely about the task itself
Baseline runner may ask different questions (no skill to guide it) — this is expected, answer them too

2d. Capture timing data

When each runner completes, immediately save timing data:

{
  "total_tokens": 84852,
  "duration_ms": 23332,
  "total_duration_seconds": 23.3
}

This is the only opportunity to capture this data — it comes through the task notification and isn't persisted elsewhere. Process each notification as it arrives rather than trying to batch them.

Save to timing.json in each runner's result directory.

Checkpoint: All runners completed. Timing data captured.

Phase 3: Grade & Analyze

3a. Grade via grader agents

Each grader receives instructions from grading-guide.md and:

The runner's task_id (grader calls TaskOutput(task_id) for transcript)
The scenario's assertions (copy the criteria list into the prompt)
The skill's SKILL.md path (grader reads it for compliance check)
Whether this is a skill-runner or baseline

Spawn all graders in parallel. Wait for all to return.

3b. Compile results per scenario

Using only grader outputs (not transcripts):

Build results table (assertions × runners)
Cross-runner consistency: where did skill-runners diverge?
- Divergence on a criterion = ambiguous instruction in the skill
Baseline comparison:
- Passed by skill-runners ONLY → skill adds value
- Passed by ALL → criterion too easy or skill doesn't help here
- Failed by ALL → criterion may be unrealistic
- Passed by baseline ONLY → skill might be harmful for this case

Clean up runners for this scenario before moving to the next one.

3c. Benchmark aggregation

Across all scenarios, compute:

Pass rate per assertion per config (with-skill / baseline)
Timing comparison: tokens and duration per config
Overall skill value: how many assertions improve vs baseline

3d. Analyst pass

Surface patterns the aggregate stats might hide:

Non-discriminating assertions: pass regardless of whether skill is used. These don't prove the skill helps — consider removing or replacing with harder assertions.
High-variance assertions: one skill-runner passes, other fails on same criterion. Usually means the skill's instruction is ambiguous — identify the specific instruction and quote it.
Time/token tradeoffs: skill adds value but costs 2x tokens? Flag it. The user should know the cost of improvement.
Repeated code in transcripts: if multiple runners independently wrote similar helper scripts, flag this as a candidate for bundling in the skill's scripts/ directory.

Checkpoint: All scenarios graded. Benchmark computed. Analyst observations recorded.

Phase 4: Test Description Triggering

4a. Evaluate trigger accuracy

For each trigger eval query from trigger-evals.json:

Assess whether the skill's current description would cause Claude to invoke the skill for this query
Consider: does the query's intent match the description's keywords and contexts? Would Claude see this as the skill's domain?

Categorize each query:

True positive: should-trigger → would trigger
True negative: should-not-trigger → would not trigger
False negative: should-trigger → would NOT trigger (undertriggering)
False positive: should-not-trigger → would trigger (overtriggering)

4b. Calculate trigger accuracy

Trigger accuracy = (true positives + true negatives) / total queries
False negative rate = false negatives / should-trigger queries
False positive rate = false positives / should-not-trigger queries

False negatives (undertriggering) are the most costly — users won't discover the skill exists. False positives waste time but are less harmful.

4c. Suggest improved description

If trigger accuracy < 85% or false negative rate > 20%:

Analyze which queries fail and why
Draft an improved description that would trigger correctly
Show before/after comparison in the report

Checkpoint: Trigger accuracy calculated. Description improvement suggested if needed.

Phase 5: Report

Structure the report according to report-template.md.

The report includes:

Results per scenario — assertions × runners table with evidence
Skill compliance — phase-by-phase execution check
Benchmark summary — pass_rate, tokens, time per config
Analyst observations — non-discriminating, high-variance, cost analysis
Description trigger accuracy — accuracy metrics + suggested improvement
Scripts to bundle — if repeated code found across transcripts
Recommendations — priority-ordered specific fixes for skill-master

Save to: ~/.claude/skill-tests/{skill-name}/reports/{timestamp}-report.md

Show report to user: "Here's the test report. Key findings: [summary]. The report is at [path] — you can share it with skill-master to apply fixes."

TeamDelete after report delivery.

Improving the Skill (Iteration)

If the user wants to iterate after receiving the report:

User (or skill-master) applies fixes to the skill
Run skill-tester again → results go to iteration-{N+1}/
Previous iteration results are available for comparison
Report shows delta: what improved, what regressed

When iterating, keep these principles in mind:

Generalize from feedback: resist fiddly changes targeted at specific test cases. If a skill works only for its test cases, it's useless at scale.
Keep the prompt lean: read transcripts. If the skill makes the model waste time doing unproductive things, remove those parts.
Explain the why: rather than adding rigid ALWAYS/NEVER rules, explain reasoning so the model understands the intent.

Self-Verification

[ ] Target skill fully read (SKILL.md + all references)
[ ] All scenarios executed (2 skill-runners + 1 baseline each)
[ ] Grader agents used for transcript analysis (not read by lead directly)
[ ] Every assertion graded with cited evidence from tool call transcripts
[ ] Skill compliance checked per runner (phase-by-phase)
[ ] Baseline comparison completed per assertion
[ ] Benchmark aggregated (pass_rate, tokens, time)
[ ] Analyst pass completed (non-discriminating, high-variance, cost, repeated code)
[ ] Trigger eval queries tested against description
[ ] Description improvement suggested if accuracy < 85%
[ ] Report saved and shown to user
[ ] Team deleted after report delivery

Adoption

pavel-molyanov/skill-tester

$ install --global

Security Scan Results

SKILL.md

Skill Tester

Phase 1: Understand & Design

1a. Read the target skill

1b. Design test prompts

1c. Design trigger eval queries

1d. Confirm with user

Phase 2: Execute Tests

2a. Setup

2b. Spawn runners

2c. Interact as user persona

2d. Capture timing data

Phase 3: Grade & Analyze

3a. Grade via grader agents

3b. Compile results per scenario

3c. Benchmark aggregation

3d. Analyst pass

Phase 4: Test Description Triggering

4a. Evaluate trigger accuracy

4b. Calculate trigger accuracy

4c. Suggest improved description

Phase 5: Report

Improving the Skill (Iteration)

Self-Verification

Related Skills

pavel-molyanov/user-spec-planning

pavel-molyanov/test-master

pavel-molyanov/tech-spec-planning

pavel-molyanov/task-decomposition

pavel-molyanov/skill-tester

$ install --global

Security Scan Results

SKILL.md

Skill Tester

Phase 1: Understand & Design

1a. Read the target skill

1b. Design test prompts

1c. Design trigger eval queries

1d. Confirm with user

Phase 2: Execute Tests

2a. Setup

2b. Spawn runners

2c. Interact as user persona

2d. Capture timing data

Phase 3: Grade & Analyze

3a. Grade via grader agents

3b. Compile results per scenario

3c. Benchmark aggregation

3d. Analyst pass

Phase 4: Test Description Triggering

4a. Evaluate trigger accuracy

4b. Calculate trigger accuracy

4c. Suggest improved description

Phase 5: Report

Improving the Skill (Iteration)

Self-Verification

Related Skills

pavel-molyanov/user-spec-planning

pavel-molyanov/test-master

pavel-molyanov/tech-spec-planning

pavel-molyanov/task-decomposition