plugins/spec-plugin/skills/assess-effectiveness/SKILL.md
Assess how the LATEST spec-plugin version is performing across every previous session that invoked it — aggregate run efficiency (thinking%, compactions, exploration-vs-skills, preload firing, fresh-per-story), process adherence, and recurring spec-quality issues — then propose concrete, evidence-backed improvements for the NEXT version (plugin skills/agents/hooks, and spec/process patterns). Read-only: proposes, never self-modifies. Not tied to a single run.
npx skillsauth add jaisonerick/spec-plugin assess-effectivenessInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
VER="$ARGUMENTS[0]"; case "$VER" in v[0-9]*) VA="--version $VER" ;; *) VA="" ;; esac
python3 "${CLAUDE_SKILL_DIR}/bin/run-metrics.py" $VA 2>/dev/null || echo "(metrics did not run — invoke manually: python3 \"\${CLAUDE_SKILL_DIR}/bin/run-metrics.py\")"
You assess how the latest spec-plugin version performs across all the sessions that invoked it — not one run — and propose evidence-backed improvements for the next version: primarily to the plugin (skills/agents/hooks), secondarily to spec/process patterns. The cross-session metrics for the target version are preloaded above.
You propose, never self-modify — your output is a review a human acts on. (This is the safe replacement for the removed self-modifying retrospective.)
The preload ran bin/run-metrics.py for the latest installed spec-plugin version: every transcript across ~/.claude/projects that loaded a spec-plugin skill at that version, with per-session turns · output-k · thinking% · compactions · explore-Bash:Skill · preload-fired · which-skill, plus an aggregate and the v11 baseline. Re-run with a wider window / pinned version when useful:
python3 "${CLAUDE_SKILL_DIR}/bin/run-metrics.py" [--version vNN] [--since "YYYY-MM-DD HH:MM"] [--all] [--dir <projects-subdir>]
A single bad session is noise; a pattern across the version's sessions is signal.
From the metrics, vs the baseline the script prints:
effort miscalibrated for a role tier.which-skill column).execute-task/validate-execution sessions → the skill sometimes isn't invoked, or the preload is failing — investigate the misses (a real gap we've hit before).How often the design actually held: skills invoked (appearing in the set means it did), preloads fired, exploration delegated, fresh-per-story, peer-to-peer QA. A low rate is a plugin or driving problem — separate a spec/plugin bug from a driving artifact (e.g. a resume run driven in the old persistent style doesn't fault the plugin).
This view spans versions, so there's no single DoD to grade. Instead sample the heaviest/most-anomalous sessions' lessons.md / qa/dod-audit.md (find them via the session's project/cwd) for issues that recur across runs — architecture↔as-built drift, DoD items that needed a REV, oversized stories. A recurring spec issue is a candidate for a build-stories / architect-version guidance change.
Do not read every transcript in your own context. Delegate to sub-agents (run them in parallel) and synthesize their compact findings:
lessons.md / qa for recurring spec-quality + adherence issues (lenses 2–3).Write spec-plugin-effectiveness-<version>.md in the CWD, and report a tight summary to the caller:
vN performs vs the v11 baseline (and the prior version, if known) across the five metrics.file + rationale + the metric it should move.development
Confirm whether a code symbol (method/class/field/endpoint/flag) actually exists and return its REAL signature + definition location — or the nearest match. Uses LSP/introspection, never grep-spelunking. Cheap and fast.
development
Walk one value or action end-to-end across every layer/hop — go-to-definition by go-to-definition, or with a debugger breakpoint — and report the real state transitions and where the contract/shape diverges. The workhorse for architecture sketches and cross-layer debugging.
testing
Bring a fresh worktree/checkout to a runnable state — verify base HEAD, copy gitignored files (.env), allocate per-agent DB/test env, install deps, run the smoke gate. Deterministic, mechanical. Reports a single ready/blocked verdict.
development
Find out how code ACTUALLY behaves by executing the real classes in a REPL — the real request/response shape, the real return value — without relying on a live staging environment. Beats writing a test for 'how does this behave?'.