skills/stocktake/SKILL.md
Audits all crucible skills for overlap, staleness, broken references, and quality. Quick scan or full evaluation modes.
npx skillsauth add raddue/crucible stocktakeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Audits all crucible skills for overlap, staleness, broken references, and quality.
Announce at start: "I'm using the stocktake skill to audit skill health."
/stocktake or asks to audit skills| Mode | Trigger | Duration |
|------|---------|----------|
| Quick scan | results.json exists (default) | ~5 min |
| Full stocktake | results.json absent, or /stocktake full | ~20 min |
| Efficiency report | /stocktake efficiency | ~5 min |
Results cache: skills/stocktake/results.json
skills/stocktake/results.jsonevaluated_at timestamp (compare file mtimes)skills/stocktake/results.jsonEnumerate all skill directories under skills/. For each:
Present inventory table:
| Skill | Files | Lines | Last Modified | Description | |-------|-------|-------|---------------|-------------|
Structural invariants (repo-level). Run the tracked invariant checker from the repo root and treat a non-zero exit as a stocktake failure to surface:
python3 scripts/check_i2_marker.py — the I2 engine-dispatch marker allowlist: the set of files carrying a column-0 `dispatch: delve-engine` body line must equal exactly {delve, temper} (a stray third dispatcher or a missing one fails). Added #336.(Other tracked checkers under scripts/check_*.py may be run here too as they are brought into alignment.)
Dispatch an Opus Explore agent with all skill contents and the evaluation checklist.
Each skill is evaluated against:
crucible: links resolve to existing skills?Each skill gets a verdict:
| Verdict | Meaning | |---------|---------| | Keep | Useful and current | | Improve | Worth keeping, specific improvements needed | | Retire | Low quality, stale, or cost-asymmetric | | Merge into [X] | Substantial overlap with another skill; name the merge target |
Reason quality requirements — the reason field must be self-contained and decision-enabling:
| Skill | Verdict | Reason | |-------|---------|--------|
skills/stocktake/results.jsonTriggered by /stocktake efficiency or by forge feed-forward when 10+ chronicle signals with efficiency data exist.
~/.claude/projects/<hash>/memory/chronicle/signals.jsonlmetrics.efficiency sub-object.Group filtered signals by skill. For each skill, compute:
(est_input_tokens + est_output_tokens) across runsduration_mdispatches_by_tier values)rework_pct across runs. If rework_pct is missing (pre-rework-tracking signal), display "—"If any skill has average rework >30%, append a note: "[skill]: rework >30% — consider reviewing dispatch templates or quality-gate prompts for this skill."
Output:
## Skill Efficiency Report
**Period:** <oldest signal date> to <newest signal date>
**Tracked runs:** N
**Disclaimer:** Estimates based on dispatch file sizes (chars/4). Actual token consumption may vary +/-30%.
### Per-Skill Summary
| Skill | Runs | Avg Est. Tokens (in+out) | Rework % | Avg Duration | Avg Dispatches | Trend |
|-------|------|--------------------------|----------|--------------|----------------|-------|
For each skill, compute dispatch tier distribution and categorize dispatches as review vs. implementation:
dispatches_by_tier averaged across runsNote: Review vs. implementation breakdown requires reading manifest entries (role field). If manifests are not available (only chronicle signals), report "N/A" for these columns.
Output:
### Dispatch Breakdown
| Skill | Opus % | Sonnet % | Haiku % | Review % | Impl % |
|-------|--------|----------|---------|----------|--------|
For each skill, compute:
total_input_chars / total dispatches — measures context per subagentreview dispatches / total dispatches * 100 — what fraction of work is quality assurance (requires manifest data; "N/A" if unavailable)Output:
### Structural Efficiency
| Skill | Avg Input/Dispatch | Context Distribution | Quality Overhead % |
|-------|--------------------|-----------------------|--------------------|
For each skill with sufficient data (3+ runs):
(total_input_chars + total_output_chars) per run — total context the pipeline touchedtotal_input_chars / total dispatches per run — how much context each subagent receives on averageavg input per dispatch / avg total context — lower values mean each subagent sees a smaller slice of the total, indicating effective context distributionreview dispatches / total dispatches — fraction of dispatches dedicated to quality assurance (requires manifest data; "N/A" if only chronicle signals available)Output:
### Baseline Comparison (Structural)
| Skill | Avg Total Context | Avg Input/Dispatch | Context Focus Ratio | Quality Investment |
|-------|-------------------|--------------------|---------------------|--------------------|
**Interpretation:** Context focus ratio measures how much of the total pipeline context each
subagent receives. Lower values mean more focused dispatches. Quality investment shows the
fraction of dispatches dedicated to review, red-team, and quality gates. These are structural
comparisons, not cost savings claims — they measure how the skill distributes work, not what
a monolithic alternative would cost.
Save efficiency report data to skills/stocktake/results.json under a new efficiency key (separate from the skill verdict cache):
{
"efficiency": {
"computed_at": "2026-04-07T10:00:00Z",
"signals_with_efficiency": 15,
"total_signals": 42,
"per_skill": {
"build": { "runs": 8, "avg_est_tokens": 52600, "avg_duration_m": 45, "trend": "stable" },
"debugging": { "runs": 5, "avg_est_tokens": 25000, "avg_duration_m": 22, "trend": "improving" }
}
}
}
skills/stocktake/results.json:
{
"evaluated_at": "2026-03-07T10:00:00Z",
"mode": "full",
"skills": {
"skill-name": {
"path": "skills/skill-name/SKILL.md",
"verdict": "Keep",
"reason": "Concrete, actionable, unique value for X workflow",
"mtime": "2026-01-15T08:30:00Z"
}
}
}
testing
Standalone instance-bug reviewer — runs a parallel finder fan-out + verify gate over a diff or a path and prints ranked, verified findings. Use when the user says "delve", "find bugs in this diff", "review this for bugs", "scan this file/subsystem for defects", "instance-bug sweep", or wants concrete reproducible defects (not a merge verdict, not systemic health). Works on a PR id, a base..head range, or a path, on any forge (GitHub, GitLab, Bitbucket, self-hosted).
testing
Render the Crucible calibration ledger weekly report — the honest "Crucible caught N silent bugs" headline, verdict breakdown, per-skill severity rates, and the inflation detector. Triggers on "/ledger", "weekly report", "weekly ledger", "caught N", "quality ledger", "calibration report", "render the ledger".
development
The Book of Grudges — cross-session bug graveyard. Every fixed bug is recorded as a structured "grudge"; before touching code, skills query the grudgebook for the files in scope and surface past regressions as forced "DO NOT REPEAT" context. Read mode (pre-flight) and write mode (on bug resolution / fix(*) PR). Machine-local, per-repo, never committed. Triggers on /grudge, "check grudges", "record a grudge", "any past bugs here", "regression oracle", "bug graveyard".
testing
Reconcile the Crucible calibration ledger — walk merged fix/hotfix branches to falsify the originating gating-verdicts, compute per-skill Brier calibration scores, and append a falsification log. Triggers on "/calibration-reconcile", "reconcile ledger", "reconcile calibration", "falsify verdicts", "brier score", "calibration reconcile", "compute brier".