/evaluate:improve

Analyze evaluation results and suggest concrete improvements to a skill.

When to Use This Skill

| Use this skill when... | Use alternative when... | |------------------------|------------------------| | Have eval results and want to improve the skill | Need to run evals first -> /evaluate:skill | | Want to improve skill description for better triggering | Want to view raw results -> /evaluate:report | | Iterating on a skill to increase pass rate | Want to file a bug -> /feedback:session | | Optimizing skill instructions after benchmarking | Need structural fixes -> plugin-compliance-check.sh |

Parameters

Parse these from $ARGUMENTS:

| Parameter | Default | Description | |-----------|---------|-------------| | <plugin/skill-name> | required | Path as plugin-name/skill-name | | --apply | false | Apply approved changes to SKILL.md | | --description-only | false | Focus on description improvements only | | --best-of N | 1 | Generate N candidate revisions and apply the eval-ranked winner (requires --apply) | | --force-apply | false | Apply even when the delta-verify gate shows the edit does not shrink the source-failure set (override; requires --apply) |

Execution

Step 1: Load eval results

Read the most recent benchmark from:

<plugin-name>/skills/<skill-name>/eval-results/benchmark.json

If no results exist, suggest running /evaluate:skill first and stop.

Also read the current SKILL.md to understand the skill.

Capture the source-failure set. From the benchmark, record the set of eval-case IDs that failed with the skill active — these are the cases the forthcoming edit is meant to fix, and they are the input to the delta-verify gate below:

cat <plugin>/skills/<skill>/eval-results/benchmark.json \
  | jq -r '[.cases[] | select(.with_skill.passed == false) | .id]'

This set is distinct from the golden evals.json suite as a whole: the golden set measures overall pass rate, the source-failure set measures whether the edit fixed the specific failures that motivated it (AEGIS delta-verify). If the set is empty (a clean benchmark, or no per-case data), there is nothing for the gate to verify — skip it and proceed.

Step 2: Analyze results

Delegate analysis to the eval-analyzer agent via Task:

Task subagent_type: eval-analyzer
Prompt: Analyze these evaluation results and identify improvement opportunities.
  Skill: <path to SKILL.md>
  Benchmark: <benchmark.json contents>
  Mode: comparison (if baseline data exists) or benchmark (otherwise)

The analyzer produces categorized suggestions:

instructions: Execution flow improvements
description: Better intent-matching text
examples: Missing or insufficient examples
error_handling: Missing edge cases
tools: Better tool configurations
structure: Organizational improvements

Step 3: Filter suggestions

If --description-only, filter to only description category suggestions.

Sort remaining suggestions by priority (high > medium > low).

Step 4: Present suggestions

Present the categorized suggestions to the user:

## Improvement Suggestions: <plugin/skill-name>

Current pass rate: 72%

### High Priority

1. **[instructions]** Add explicit error handling for missing git config
   Evidence: eval-003 fails because the skill doesn't check for git user.name

2. **[description]** Add "conventional commit" as trigger phrase
   Evidence: Skill not selected when user says "make a conventional commit"

### Medium Priority

3. **[examples]** Add breaking change example to execution steps
   Evidence: eval-004 inconsistently handles breaking changes

### Low Priority

4. **[structure]** Move flag reference to Quick Reference table
   Evidence: Flags scattered across multiple sections

If --apply is NOT set, stop here.

Delta-verify gate (AEGIS source-cases — required before any apply)

Before any edit is written to the live SKILL.md — both the plain --apply path (Step 5) and the --best-of path (Step 5a) — confirm the edit actually shrinks the source-failure set captured in Step 1, not merely that the overall golden-set pass rate is higher. Ranking by aggregate pass rate can reward a candidate that fixes unrelated cases while leaving the motivating failures broken; this gate closes that gap (HarnessX/AEGIS: re-run on the source cases, confirm the failure count shrinks before applying).

Run the gate against the drafted candidate (a candidate file under eval-results/candidates/, or for plain --apply a draft written there first), never the live SKILL.md:

Re-run only the source-failure cases against the candidate — spawn one Task subagent (subagent_type: general-purpose) per case with the candidate content as the skill context (the same rollout machinery as Step 5a; use prepare_run.sh), and grade each transcript with python3 evaluate-plugin/scripts/grade_deterministic.py.
Compute delta = (source failures before) − (source failures after).
Gate: apply only when delta > 0 (the candidate fixes at least one motivating failure and regresses none of the others). When delta <= 0, do not write the edit — report which source cases still fail and suggest revising the suggestions. --force-apply overrides the gate (records the override in history). When the source-failure set is empty, the gate is a no-op and the apply proceeds.

Step 5: Apply changes (if --apply)

Use AskUserQuestion to let the user select which suggestions to apply:

Which improvements should I apply?
[x] Add error handling for missing git config
[x] Add trigger phrases to description
[ ] Add breaking change example
[ ] Restructure flag reference

If --best-of N with N > 1, follow Step 5a to pick the winning revision first, then continue with the apply flow below using the winner's content.

Draft the approved edits into a candidate file and run them through the Delta-verify gate above. Only proceed to write the live SKILL.md when the gate passes (or --force-apply is set). For each approved suggestion:

Read the current SKILL.md
Apply the change using Edit
Update the modified date in frontmatter

Step 5a: Generate and rank candidates (if --best-of N > 1)

Instead of drafting the approved edits once, generate N alternative drafts and let evaluation pick the winner.

Generate candidates. Write N complete candidate revisions of the SKILL.md to <plugin>/skills/<skill>/eval-results/candidates/candidate-<i>.md (the eval-results/ tree is gitignored). Each candidate implements the approved suggestions with a genuinely different strategy — different instruction placement, phrasing, or example choice — not paraphrases of one draft.
Rank with real grading when evals exist. If the skill has evals.json:
- For each candidate, run one pass per eval case: spawn a Task subagent (subagent_type: general-purpose) that receives the candidate content as the skill context and executes the eval prompt (mirrors /evaluate:skill Step 4; use prepare_run.sh for the run directories).
- Grade each transcript with python3 evaluate-plugin/scripts/grade_deterministic.py — typed checks grade for zero judge tokens; defer fuzzy assertions to the eval-grader agent.
- Rank candidates by source-failure delta first (how many of the Step 1 source-failure cases each candidate fixes — the Delta-verify gate signal), then by mean golden-set pass rate, so a candidate that lifts the aggregate while leaving the motivating failures broken never wins. Break remaining ties with the eval-comparator agent: blind pairwise comparison of the tied candidates' transcripts. Discard any candidate with delta <= 0 unless --force-apply is set.
Fall back to blind self-preference when no evals exist. Without evals.json there are no prompts to roll out. Rank via the eval-comparator agent — pairwise, candidates presented as Output A/B, the analyzer's weakness list passed as the assertions. Flag this in the report as a weaker signal and suggest re-running /evaluate:skill with --create-evals first.
Apply the winner through the Step 5 apply flow, and record the ranking in the history entry (Step 5b below): a candidates array with each candidate's id, pass rate (or comparison score), and a selected flag.

Token cost is bounded at N × eval cases × 1 run plus grading; treat --best-of without a number as N=3. Prefer this mode for skills that have evals.json — deterministic ranking of real rollouts is the point; text-only self-preference is the fallback.

Step 5b: Record history

After applying changes, update (or create) the history file at:

<plugin-name>/skills/<skill-name>/eval-results/history.json

Add a new iteration entry recording:

Version number (increment from previous)
Timestamp
Pass rate from current benchmark
Summary of changes made
Delta-verify result: source_failures_before, source_failures_after, and the resulting source_failure_delta (and whether --force-apply overrode a non-positive delta)
Candidate ranking when --best-of was used: a candidates array of {id, pass_rate, source_failure_delta, selected} (use comparison_score instead of pass_rate for the no-evals fallback)

Step 6: Suggest re-evaluation

After applying changes, suggest:

Changes applied. Run `/evaluate:skill <plugin/skill-name>` to measure improvement.

Agentic Optimizations

| Context | Command | |---------|---------| | Read benchmark | cat <plugin>/skills/<skill>/eval-results/benchmark.json \| jq .summary | | Read skill | cat <plugin>/skills/<skill>/SKILL.md | | Read history | cat <plugin>/skills/<skill>/eval-results/history.json \| jq '.iterations[-1]' | | Check pass rate | cat <plugin>/skills/<skill>/eval-results/benchmark.json \| jq '.summary.with_skill.mean_pass_rate' | | Source-failure set | cat <plugin>/skills/<skill>/eval-results/benchmark.json \| jq -r '[.cases[] \| select(.with_skill.passed == false) \| .id]' |

Quick Reference

| Flag | Description | |------|-------------| | --apply | Apply approved changes to SKILL.md | | --description-only | Focus on description improvements only | | --best-of N | Generate N candidate revisions, rank by source-failure delta then pass rate, apply winner | | --force-apply | Apply even when the delta-verify gate shows the edit does not shrink the source-failure set |

/evaluate:improve

Analyze evaluation results and suggest concrete improvements to a skill.

When to Use This Skill

Parameters

Parse these from $ARGUMENTS:

Execution

Step 1: Load eval results

Read the most recent benchmark from:

<plugin-name>/skills/<skill-name>/eval-results/benchmark.json

If no results exist, suggest running /evaluate:skill first and stop.

Also read the current SKILL.md to understand the skill.

cat <plugin>/skills/<skill>/eval-results/benchmark.json \
  | jq -r '[.cases[] | select(.with_skill.passed == false) | .id]'

Step 2: Analyze results

Delegate analysis to the eval-analyzer agent via Task:

Task subagent_type: eval-analyzer
Prompt: Analyze these evaluation results and identify improvement opportunities.
  Skill: <path to SKILL.md>
  Benchmark: <benchmark.json contents>
  Mode: comparison (if baseline data exists) or benchmark (otherwise)

The analyzer produces categorized suggestions:

instructions: Execution flow improvements
description: Better intent-matching text
examples: Missing or insufficient examples
error_handling: Missing edge cases
tools: Better tool configurations
structure: Organizational improvements

Step 3: Filter suggestions

If --description-only, filter to only description category suggestions.

Sort remaining suggestions by priority (high > medium > low).

Step 4: Present suggestions

Present the categorized suggestions to the user:

## Improvement Suggestions: <plugin/skill-name>

Current pass rate: 72%

### High Priority

1. **[instructions]** Add explicit error handling for missing git config
   Evidence: eval-003 fails because the skill doesn't check for git user.name

2. **[description]** Add "conventional commit" as trigger phrase
   Evidence: Skill not selected when user says "make a conventional commit"

### Medium Priority

3. **[examples]** Add breaking change example to execution steps
   Evidence: eval-004 inconsistently handles breaking changes

### Low Priority

4. **[structure]** Move flag reference to Quick Reference table
   Evidence: Flags scattered across multiple sections

If --apply is NOT set, stop here.

Delta-verify gate (AEGIS source-cases — required before any apply)

Run the gate against the drafted candidate (a candidate file under eval-results/candidates/, or for plain --apply a draft written there first), never the live SKILL.md:

Re-run only the source-failure cases against the candidate — spawn one Task subagent (subagent_type: general-purpose) per case with the candidate content as the skill context (the same rollout machinery as Step 5a; use prepare_run.sh), and grade each transcript with python3 evaluate-plugin/scripts/grade_deterministic.py.
Compute delta = (source failures before) − (source failures after).
Gate: apply only when delta > 0 (the candidate fixes at least one motivating failure and regresses none of the others). When delta <= 0, do not write the edit — report which source cases still fail and suggest revising the suggestions. --force-apply overrides the gate (records the override in history). When the source-failure set is empty, the gate is a no-op and the apply proceeds.

Step 5: Apply changes (if --apply)

Use AskUserQuestion to let the user select which suggestions to apply:

Which improvements should I apply?
[x] Add error handling for missing git config
[x] Add trigger phrases to description
[ ] Add breaking change example
[ ] Restructure flag reference

If --best-of N with N > 1, follow Step 5a to pick the winning revision first, then continue with the apply flow below using the winner's content.

Read the current SKILL.md
Apply the change using Edit
Update the modified date in frontmatter

Step 5a: Generate and rank candidates (if --best-of N > 1)

Instead of drafting the approved edits once, generate N alternative drafts and let evaluation pick the winner.

Generate candidates. Write N complete candidate revisions of the SKILL.md to <plugin>/skills/<skill>/eval-results/candidates/candidate-<i>.md (the eval-results/ tree is gitignored). Each candidate implements the approved suggestions with a genuinely different strategy — different instruction placement, phrasing, or example choice — not paraphrases of one draft.
Rank with real grading when evals exist. If the skill has evals.json:
- For each candidate, run one pass per eval case: spawn a Task subagent (subagent_type: general-purpose) that receives the candidate content as the skill context and executes the eval prompt (mirrors /evaluate:skill Step 4; use prepare_run.sh for the run directories).
- Grade each transcript with python3 evaluate-plugin/scripts/grade_deterministic.py — typed checks grade for zero judge tokens; defer fuzzy assertions to the eval-grader agent.
- Rank candidates by source-failure delta first (how many of the Step 1 source-failure cases each candidate fixes — the Delta-verify gate signal), then by mean golden-set pass rate, so a candidate that lifts the aggregate while leaving the motivating failures broken never wins. Break remaining ties with the eval-comparator agent: blind pairwise comparison of the tied candidates' transcripts. Discard any candidate with delta <= 0 unless --force-apply is set.
Fall back to blind self-preference when no evals exist. Without evals.json there are no prompts to roll out. Rank via the eval-comparator agent — pairwise, candidates presented as Output A/B, the analyzer's weakness list passed as the assertions. Flag this in the report as a weaker signal and suggest re-running /evaluate:skill with --create-evals first.
Apply the winner through the Step 5 apply flow, and record the ranking in the history entry (Step 5b below): a candidates array with each candidate's id, pass rate (or comparison score), and a selected flag.

Step 5b: Record history

After applying changes, update (or create) the history file at:

<plugin-name>/skills/<skill-name>/eval-results/history.json

Add a new iteration entry recording:

Version number (increment from previous)
Timestamp
Pass rate from current benchmark
Summary of changes made
Delta-verify result: source_failures_before, source_failures_after, and the resulting source_failure_delta (and whether --force-apply overrode a non-positive delta)
Candidate ranking when --best-of was used: a candidates array of {id, pass_rate, source_failure_delta, selected} (use comparison_score instead of pass_rate for the no-evals fallback)

Step 6: Suggest re-evaluation

After applying changes, suggest:

Changes applied. Run `/evaluate:skill <plugin/skill-name>` to measure improvement.

Adoption

laurigates/evaluate-improve

$ install --global

Security Scan Results

SKILL.md

/evaluate:improve

When to Use This Skill

Parameters

Execution

Step 1: Load eval results

Step 2: Analyze results

Step 3: Filter suggestions

Step 4: Present suggestions

Delta-verify gate (AEGIS source-cases — required before any apply)

Step 5: Apply changes (if --apply)

Step 5a: Generate and rank candidates (if --best-of N > 1)

Step 5b: Record history

Step 6: Suggest re-evaluation

Agentic Optimizations

Quick Reference

Related Skills

laurigates/quoted-description

laurigates/folded-description

laurigates/adapters/tests/fixtures/mini-marketplace-src/beta-plugin/skilldirs/no-name

laurigates/no-description

laurigates/evaluate-improve

$ install --global

Security Scan Results

SKILL.md

/evaluate:improve

When to Use This Skill

Parameters

Execution

Step 1: Load eval results

Step 2: Analyze results

Step 3: Filter suggestions

Step 4: Present suggestions

Delta-verify gate (AEGIS source-cases — required before any apply)

Step 5: Apply changes (if --apply)

Step 5a: Generate and rank candidates (if --best-of N > 1)

Step 5b: Record history

Step 6: Suggest re-evaluation

Agentic Optimizations

Quick Reference

Related Skills

laurigates/quoted-description

laurigates/folded-description

laurigates/adapters/tests/fixtures/mini-marketplace-src/beta-plugin/skilldirs/no-name

laurigates/no-description