skills/temper-eval-calibrate/SKILL.md
Wrapper skill that runs k iterations of (stage → collect-behavior → score) for
npx skillsauth add raddue/crucible temper-eval-calibrateInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Invocation:
/temper-eval-calibrate <run-id-prefix> [--k N] [--source SOURCE] [--fixture <id>]
[--write-baseline] [--compare-baseline]
[--trials-override N] [--max-parallel N]
[--timeout N]
Pre-conditions:
<run-id-prefix> matches ^[A-Za-z0-9_][A-Za-z0-9_-]{0,28}$ (I-9; aligned with _runid.py's _PREFIX_RE; reserves 3 chars for -<i> suffix)Post-conditions:
i in 1..k, skills/temper/evals/.calibrate-state/<prefix>-<i>-complete exists AND skills/temper/evals/.calibrate-state/last_run-<prefix>-<i>.json exists with run_id == <prefix>-<i>Task 13 (per-iter wiring) lands BEFORE this skill is invoked — verified by sequential task numbering. Per F-R4-1, Task 13 lands BOTH the --per-iter argparse declaration AND the main() wiring in a SINGLE atomic commit (Task 4 intentionally omits the argparse declaration to eliminate any silent-drop window). Because Task 13 is numbered ahead of Task 14, operators following the plan sequentially will have --per-iter wired before this skill's SKILL.md exists to be invoked. Defense-in-depth: if Task 13 is somehow skipped, score --per-iter will raise an argparse "unrecognized arguments" error (no silent drop) — the calibrate skill's Step 3e shell-out will fail loud rather than clobbering shared state.
<run-id-prefix> must match ^[A-Za-z0-9_][A-Za-z0-9_-]{0,28}$ (M-FE-3 R3); refuse with explicit error if not.--k (default 3) must be in [1, 99] inclusive; refuse if out-of-range. <!-- 2P-R4-2: k cap 99 ties to _PREFIX_RE's 3-char suffix reservation; k>=100 would overflow into a 4-char suffix and silently push the resulting run-id past the 32-char _RUN_ID_RE ceiling. -->STATE_DIR = skills/temper/evals/.calibrate-state/. If it does not exist, create it (mkdir -p).
i in 1..kRUN_ID = "<prefix>-<i>".
If BOTH of the following are true, skip iteration entirely (no stage, no collect, no score, no billing):
skills/temper/evals/.calibrate-state/<prefix>-<i>-complete existsskills/temper/evals/.calibrate-state/last_run-<prefix>-<i>.json exists AND its run_id field equals <prefix>-<i>Print: [skip] iteration <i>: already complete.
Shell out via Bash tool: python -m skills.temper.evals.run_evals stage "$RUN_ID" [--source ...] [--fixture ...] [--trials-override ...] [--timeout ...]
Pass through whatever flags the user supplied.
Execute the full procedure documented in skills/temper-eval-collect/SKILL.md Steps 1–8 against RUN_ID.
Version pin (M1 R8): this inline-execution reference is pinned to the Steps-1–8 shape at calibrate-skill author time. If skills/temper-eval-collect/SKILL.md is later edited (e.g. a Step 4.5 is added, or steps are renumbered), this calibrate skill MUST be re-validated against the new step set — there is NO automatic inheritance. Treat collect SKILL.md edits as breaking changes for the calibrate skill until re-verified.
This includes:
.tmp cleanup--timeout overridemanifest.jsonl per-dispatch appends.collect-status write with fsync (I-12)The inlined behavior satisfies I-7, I-8, I-10, I-12 identically to standalone /temper-eval-collect.
Shell out: python -m skills.temper.evals.run_evals score "$RUN_ID" --per-iter [--write-baseline] [--compare-baseline]
The --per-iter flag is REQUIRED here (F1): it tells score to write to skills/temper/evals/.calibrate-state/last_run-<prefix>-<i>.json instead of the shared last_run.json. Without --per-iter, iterations would clobber each other's output AND collide with operator-owned last_run.json artifacts. There is NO run-id-shape heuristic; the routing is explicit.
If score returns 2 (fatal), abort all remaining iterations and surface the error.
On successful score, write skills/temper/evals/.calibrate-state/<prefix>-<i>-complete with content complete-<ISO-8601>.
[iter <i>/<k>] RUN_ID=<RUN_ID> score=<rc>
After ALL k iterations succeed (every iteration's .calibrate-state/<prefix>-<i>-complete sentinel is present), copy skills/temper/evals/.calibrate-state/last_run-<prefix>-<k>.json to skills/temper/evals/last_run.json (the shared output the --compare-baseline workflow expects). This is unconditional on success — not optional.
Pre-copy verification (mirrors the sentinel-pair discipline in Step 3b):
last_run-<prefix>-<k>.json exists. If absent, refuse to clobber last_run.json and exit with the fatal error "iteration k completion sentinel present but per-iter file missing at <path>; refusing to consolidate.""per-iter file at <path> is malformed JSON; refusing to consolidate."run_id field equals <prefix>-<k>. If it does not, refuse with "per-iter file at <path> carries run_id=<actual>, expected <prefix>-<k>; refusing to consolidate."If ANY iteration failed (sentinel missing, score returned 2, or stage refused), do NOTHING in this step: leave last_run.json untouched so the operator can inspect the per-iter files manually without a clobbered shared state.
Rationale: the original "optionally copy" wording produced two equally-valid downstream states (shared file = last iter vs. shared file = stale) with no operator signal to disambiguate. The all-or-nothing rule eliminates the ambiguity. Revisit only if best-or-median canonicalization is genuinely needed (defer to a follow-up ticket).
This skill enforces: M-2 (k cap 1..99), I-9 (prefix regex), AC-12 (resume idempotency).
testing
Standalone instance-bug reviewer — runs a parallel finder fan-out + verify gate over a diff or a path and prints ranked, verified findings. Use when the user says "delve", "find bugs in this diff", "review this for bugs", "scan this file/subsystem for defects", "instance-bug sweep", or wants concrete reproducible defects (not a merge verdict, not systemic health). Works on a PR id, a base..head range, or a path, on any forge (GitHub, GitLab, Bitbucket, self-hosted).
testing
Render the Crucible calibration ledger weekly report — the honest "Crucible caught N silent bugs" headline, verdict breakdown, per-skill severity rates, and the inflation detector. Triggers on "/ledger", "weekly report", "weekly ledger", "caught N", "quality ledger", "calibration report", "render the ledger".
development
The Book of Grudges — cross-session bug graveyard. Every fixed bug is recorded as a structured "grudge"; before touching code, skills query the grudgebook for the files in scope and surface past regressions as forced "DO NOT REPEAT" context. Read mode (pre-flight) and write mode (on bug resolution / fix(*) PR). Machine-local, per-repo, never committed. Triggers on /grudge, "check grudges", "record a grudge", "any past bugs here", "regression oracle", "bug graveyard".
testing
Reconcile the Crucible calibration ledger — walk merged fix/hotfix branches to falsify the originating gating-verdicts, compute per-skill Brier calibration scores, and append a falsification log. Triggers on "/calibration-reconcile", "reconcile ledger", "reconcile calibration", "falsify verdicts", "brier score", "calibration reconcile", "compute brier".