skills/calibration-reconcile/SKILL.md
Reconcile the Crucible calibration ledger — walk merged fix/hotfix branches to falsify the originating gating-verdicts, compute per-skill Brier calibration scores, and append a falsification log. Triggers on "/calibration-reconcile", "reconcile ledger", "reconcile calibration", "falsify verdicts", "brier score", "calibration reconcile", "compute brier".
npx skillsauth add raddue/crucible calibration-reconcileInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Walks fix branches to falsify the gating-verdicts that should have caught
the bug, computes per-skill Brier calibration scores, and writes a
falsification log to the central ledger ~/.claude/crucible/ledger/
(override CRUCIBLE_LEDGER_DIR). The SKILL.md is a thin prompt wrapper; the
source of truth is the testable Python core at scripts/reconcile_ledger.py.
Skill type: Utility — direct execution, no subagent dispatch.
Manual /calibration-reconcile [--lookback-days N] (default N=30).
If CRUCIBLE_CALIBRATION_DISABLED=1, the CLI skips entirely (graceful
no-op exit 0) before any git or filesystem work. The forward-capture path
(scripts/ledger_append.py) honors the same switch, so the whole calibration
subsystem can be disabled with one env var.
Invoke the reconciler directly. Resolve scripts/reconcile_ledger.py by
absolute path from the plugin root (it self-locates its own modules via
__file__, so no PYTHONPATH is needed and it runs from any cwd):
python3 <plugin_root>/scripts/reconcile_ledger.py --lookback-days 30
It reads runs.jsonl and manual-attribution.jsonl from the central store and
writes falsification.jsonl (append-only, L-1/L-9) + brier-rolling.json
(gitignored) into that SAME dir.
Scope limit: this implements steps 1–6 ONLY. The predicted-falsifier predicate parser / "second pass" is Phase 7 and explicitly out of scope.
discover_candidates): fix/* hotfix/*
branches merged in the last --lookback-days (default 30) + regression-
labelled closed issues with a referenced commit (best-effort; if gh is
unavailable, skip silently). Each candidate carries touched_files (from
git show --name-only <merge-sha>) and merge_time.cross_cut_threshold_from): threshold = p90 of
fix-branch sizes over the prior 90 days (a DISTINCT window from the 30-day
candidate lookback); bootstrap to a fixed 20 until ≥30 samples exist. A
candidate with len(touched_files) > threshold is tagged cross_cut: true.
Cross-cut candidates may still get a confidence: low attribution but are
EXCLUDED from the Brier denominator.reconcile): for each candidate, find the EARLIEST ledger
entry where ALL of: (a) gated_files ∩ touched_files non-empty, (b) entry
timestamp < candidate merge_time, (c) artifact_type == "code", (d)
backfilled == false. If found, record a falsification entry marking it
falsified: true with falsified_by: {commit, reason, confidence, cross_cut}.high when intersection ≥1 file AND merge within 14
days of the verdict AND cross_cut == false; medium when 14–30 days AND
cross_cut == false; low when >30 days OR cross_cut == true OR
multi-file fixes (>5 files but ≤ the cross-cut threshold).read_manual_attribution): read
manual-attribution.jsonl first; entries there (keyed by
ledger_entry_hash) override the algorithm's attribution. Missing file ⇒
empty overrides (no error). An entry may carry an optional signal_type
(manual_override (default) | bad_implementation) — see
Non-code bad_implementation signal.CRUCIBLE_CALIBRATION_DISABLED=1 ⇒ no-op exit.Per-skill calibration is the Brier score over each skill's falsifiable sample:
brier = mean((confidence - actual)^2)
PASS NOT marked falsified, OR
any FAIL (at v1 every FAIL ⇒ actual=1, since predicted-falsifier predicates
are Phase 7).PASS that WAS marked falsified: true (looked
up in the L-9 latest-entry-wins reduction over falsification.jsonl via
scripts.ledger_reduce.reduce, the canonical helper cited above).artifact_type == "code"
OR a non-code verdict carrying a bad_implementation falsification (see
below); backfilled == false; verdict ∈ {PASS, FAIL} with
confidence >= 0.5; entry older than the 30-day grace period; AND if a
matching falsification entry exists, its cross_cut must be false.PASS/FAIL count. STAGNATION,
ESCALATED, ARCHITECTURAL, SUSTAINED_REGRESSION are excluded from both
numerator AND denominator.n; just don't gate). The advisory threshold is brier > 0.25
(consumed in Phase 6 — here we only compute and store).bad_implementation signalAuto-falsification (walkback + predicted_falsifier) is code-artifact-centric: a
verdict is only falsified when a later fix/revert touches a gated path or fires a
predicate. That can never reach a non-code verdict (a design-doc or plan
PASS), and it misses the case where a PASS was accepted but the resulting
implementation later proved bad for reasons no path-touching fix captures — a
design-level wrong call, downstream rework, or an abandoned approach.
To score those, a human adds a manual-attribution.jsonl entry with
signal_type: "bad_implementation":
{"ledger_entry_hash": "<sha256(run_id:skill)>", "falsified": true,
"confidence": "high", "signal_type": "bad_implementation",
"reasoning": "design PASS led to a full rework of the persistence layer"}
PASS it sets actual = 0 (the verdict was
overconfident); on a FAIL it is out-of-contract and silently ignored.design/plan/…): such a verdict is
admitted into the Brier sample only when it carries this signal — absent it,
a non-code verdict's outcome is unknown and stays excluded (we never assume a
non-code PASS was correct just because nothing falsified it). The usual filters
still apply (PASS/FAIL, confidence >= 0.5, outside grace, non-cross-cut), so a
Tier-B confidence: null verdict is render-counted but not Brier-scored.actual like any other PASS falsification./ledger breaks the falsified count down by source (walkback / predicate /
manual_override / bad_implementation).
falsification.jsonl — APPEND-ONLY (L-1/L-9). Each entry:
{ledger_entry_hash, falsified, falsified_by, confidence, reasoning, cross_cut}
— plus an optional signal_type (manual_override | bad_implementation) on
manual-attribution-sourced entries.brier-rolling.json (gitignored) —
{"<skill>": {"n": int, "brier": float, "last_updated": "<iso>"}, ...}.testing
Standalone instance-bug reviewer — runs a parallel finder fan-out + verify gate over a diff or a path and prints ranked, verified findings. Use when the user says "delve", "find bugs in this diff", "review this for bugs", "scan this file/subsystem for defects", "instance-bug sweep", or wants concrete reproducible defects (not a merge verdict, not systemic health). Works on a PR id, a base..head range, or a path, on any forge (GitHub, GitLab, Bitbucket, self-hosted).
testing
Render the Crucible calibration ledger weekly report — the honest "Crucible caught N silent bugs" headline, verdict breakdown, per-skill severity rates, and the inflation detector. Triggers on "/ledger", "weekly report", "weekly ledger", "caught N", "quality ledger", "calibration report", "render the ledger".
development
The Book of Grudges — cross-session bug graveyard. Every fixed bug is recorded as a structured "grudge"; before touching code, skills query the grudgebook for the files in scope and surface past regressions as forced "DO NOT REPEAT" context. Read mode (pre-flight) and write mode (on bug resolution / fix(*) PR). Machine-local, per-repo, never committed. Triggers on /grudge, "check grudges", "record a grudge", "any past bugs here", "regression oracle", "bug graveyard".
tools
Live-dispatch phase of temper eval harness. Reads stage-manifest.json from a pre-staged dispatch dir; fans Task-tool reviewer dispatches in parallel (max 6); writes per-seq result files; exits. Single bounded session. Pairs with `python -m skills.temper.evals.run_evals stage` and `score`.