look-before-you-leap/skills/skill-review-standard/SKILL.md
Post-creation quality gate for skills. Runs structural validation, a functional with/without test, and trigger overlap analysis to produce a SHIP/REVISE/BLOCK verdict. Use after finishing a skill with skill-creator, or when reviewing any skill before shipping. Also use when the user asks to 'review my skill', 'check skill quality', 'is this skill ready to ship', or 'validate this skill'. Do NOT use for: creating skills from scratch (use skill-creator), learning skill conventions (use plugin-dev:skill-development), or reviewing application code.
npx skillsauth add miospotdevteam/claude-control skill-review-standardInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A quality gate that answers one question: should this skill ship?
Not a prose rubric. Not a checklist. A functional test backed by structural validation. If the skill doesn't add value over baseline Claude, it doesn't ship.
Announce at start: "Running the skill review quality gate."
Run the automated validation script on the target skill directory:
bash ${CLAUDE_PLUGIN_ROOT}/skills/skill-review-standard/scripts/validate-structure.sh <skill-directory>
The script checks:
If any FAIL: Stop here. Verdict is BLOCK. Report the failures and what needs fixing. Structural problems must be resolved before functional testing — there's no point testing a skill that references nonexistent files.
If all PASS/WARN: Continue to Phase 2.
This is the core of the review. A skill that doesn't add value over baseline Claude has no reason to exist.
Create 1 realistic test prompt — the kind of thing a real user would say that should trigger this skill. Make it substantive enough that a skill would genuinely help (not a trivial one-liner Claude can handle without any skill).
Present the prompt to the user: "I'll test the skill with this prompt. Does it look realistic?"
Spawn 2 subagents in parallel:
With-skill agent:
Read the skill at <skill-path>/SKILL.md, then follow its instructions to
complete this task: <test prompt>
Save your output to <workspace>/with_skill/output.md
Without-skill agent:
Complete this task using your best judgment (no special skill or rubric):
<test prompt>
Save your output to <workspace>/without_skill/output.md
Read both outputs. Answer these questions:
If the skill added clear value on at least one dimension without degrading the others, it passes. If the outputs are essentially equivalent — or the baseline was better — the skill fails this phase.
If no value added: Verdict is BLOCK. The skill doesn't justify its existence. Report what both outputs looked like and why the skill didn't help.
If value added: Continue to Phase 3.
Check whether the skill's description conflicts with other installed skills.
Read the available skills list from your context (the skill descriptions injected at session start). If not available, scan installed skill directories for SKILL.md frontmatter.
For each installed skill, check: could a user's prompt reasonably trigger both this skill and the other? Look for:
Flag any overlaps found. An overlap is blocking if both skills would produce conflicting guidance for the same prompt. An overlap is a warning if the skills have related but distinct scopes that could be clarified with better description boundaries.
Produce the review report in this format:
# Review: <skill-name>
## Structural Checks
<output from validate-structure.sh, one line per check>
## Functional Test
Prompt: "<the test prompt used>"
With skill: <1-sentence summary>
Without skill: <1-sentence summary>
Value added: Yes/No — <1-sentence why>
## Trigger Overlap
Checked against <N> installed skills.
<list any overlaps, or "No overlaps found.">
## Verdict: SHIP / REVISE / BLOCK
<1-2 sentences explaining the verdict>
After the verdict is determined, optionally run quantitative benchmarks to measure skill performance with statistical rigor. This phase is recommended for skills that will be used frequently or where trigger precision matters.
Use the eval scripts in scripts/:
SCRIPTS="${CLAUDE_PLUGIN_ROOT}/skills/skill-review-standard/scripts"
# Run 3 evaluation rounds with the test prompt from Phase 2
python3 "$SCRIPTS/run_eval.py" \
--skill-dir <skill-directory> \
--prompt "<test prompt from Phase 2>" \
--output-dir .temp/eval-results/ \
--runs 3
This runs the skill against the prompt multiple times and grades each run on structure (1-5), completeness (1-5), and quality (1-5).
python3 "$SCRIPTS/aggregate_benchmark.py" \
--results-dir .temp/eval-results/ \
--output .temp/eval-results/aggregate.json
Reports mean, stddev, min, max per dimension. Pass threshold: overall mean >= 3.5. High variance (stddev > 1.0) signals inconsistent skill behavior — investigate before shipping.
If trigger precision is a concern (overlaps found in Phase 3, or the skill fires on prompts it shouldn't):
python3 "$SCRIPTS/improve_description.py" \
--skill-dir <skill-directory> \
--dry-run
Reviews the description against all installed skills and suggests
improvements for better trigger accuracy. Remove --dry-run to apply.
python3 "$SCRIPTS/generate_report.py" \
--results-dir .temp/eval-results/ \
--skill-name <skill-name> \
--output .temp/eval-results/report.html
Produces a self-contained HTML report with pass/fail indicators, per-run scores, and the overall verdict. Share with the team or attach to the skill's PR.
This skill must NOT:
The user may proceed autonomously through all three phases. User confirmation is only needed for the test prompt (2.1) and when reporting the final verdict.
development
Use after discovery to write implementation plans with TDD-granularity steps. Produces plan.json (immutable definition, frozen after approval), progress.json (mutable execution state), and masterPlan.md (user-facing proposal for Orbit review). Every step is one component/feature; TDD rhythm (test, verify fail, implement, verify pass, commit) lives in its progress items. Consumes discovery.md from exploration phase. Make sure to use this skill whenever the user says discovery is done, exploration is finished, discovery.md is ready, or asks to write/create/draft the implementation plan — even if they don't mention plan.json or masterPlan.md by name. Also use when the user references completed exploration findings, blast radius analysis, or consumer mappings and wants them converted into actionable steps. Do NOT use when: the user says 'just do it' or 'no plan', resuming or executing an existing plan, during exploration or brainstorming (discovery not yet complete), debugging, or code review.
tools
End-to-end webapp testing with Playwright MCP integration. Use when: writing Playwright tests, E2E testing, browser testing, webapp testing, visual regression testing, accessibility testing with axe-core, testing user flows through a web UI, verifying frontend behavior in a real browser. Integrates with test-driven-development skill for test-first browser tests and engineering-discipline for verification. Do NOT use when: unit tests only (no browser UI involved), API tests without UI, mobile native testing (use react-native-mobile), testing CLI tools, or writing backend-only integration tests.
development
Test-Driven Development workflow enforcing red-green-refactor cycles. Use when writing new features, adding behavior, or implementing functions where tests should drive design. Requires explicit test-first prompting because Claude naturally writes implementation first. Integrates with writing-plans (TDD rhythm in Progress items) and engineering-discipline (verification). Do NOT use when: fixing a bug in existing tested code (use systematic-debugging), writing tests for existing untested code (characterization tests are a different workflow), refactoring without behavior change (use refactoring), or the project has no test infrastructure.
development
Use when encountering any bug, test failure, or unexpected behavior. Enforces root cause investigation before fixes. Four phases: investigate, analyze patterns, form hypotheses, implement. Prevents guess-and-check thrashing. Use ESPECIALLY when under pressure or when 'just one quick fix' seems obvious. Do NOT use for: learning unfamiliar APIs (use exploration), performance optimization without a specific regression, or code review without a reported bug.