skills/temper/SKILL.md
Iteratively review code changes for production readiness through fresh-eyes review loops. Use when completing tasks, implementing major features, or before merging — including when the user says "review this PR", "review my changes", "code review", "check the diff", or "is this ready to ship". Works on PRs from any forge (GitHub, GitLab, Bitbucket, self-hosted) or on raw git SHA ranges.
npx skillsauth add raddue/crucible temperInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
All subagent dispatches use disk-mediated dispatch. See shared/dispatch-convention.md for the full protocol.
The finder angles, the verify gate, the effort tiers, the cap semantics, and the eight-field output schema are NOT redefined here — they live in shared/delve-engine.md. /temper drives that engine: Round 1 dispatches it (bug-angle subset, high effort) to enumerate the tracked set T, and each later round drives it again over the fixed regions to hunt new gating findings (Track A, Step 4).
The severity and verdict vocabularies, and the gating rule T = {CONFIRMED, PLAUSIBLE} × {Critical, Important}, are the contract's — not temper's (I11). See shared/severity-verdict-contract.md; temper consumes that gating rule to build T and defines no scale or verdict of its own.
Like tempering steel after forging — iterative heat-and-quench cycles that set final hardness and elasticity — /temper runs successive fresh-eyes review rounds until the change converges. Round 1 drives the delve-engine fan-out (bug-angle subset) to enumerate a tracked set T of gating findings; each later round re-verifies whether every member of T is resolved against the fixed code and admits any new gating finding the fix introduced. The loop exits when T is fully resolved and no new gating finding entered.
Core principle: Review early, review often. Fresh eyes every round — but the per-round instrument is the engine's parallel finder fan-out + verify gate (plus a per-member re-verification pass), not one holistic reviewer. Convergence is the resolution status of an enumerated finding set, never a cross-round count comparison. (Distinct from /audit: temper drives delve-engine's instance-bug fan-out — one-reproduction defects; /audit runs systemic lenses — different machines.)
Renamed from /code-review (2026-05-17) to avoid collision with Claude Code's built-in /review command. Same iteration behavior; the argument shape and platform-agnostic PR support are new.
Temper reviews code diffs only. Use a different skill for:
/audit or /red-team./audit. (temper itself drives delve-engine's parallel fan-out for instance-bug enumeration — one-reproduction defects; it does not defer that wholesale to /audit. The boundary is the reproduction discriminator: one concrete reproduction is temper/delve's; a no-single-repro pattern is /audit's.)/inquisitor./security-review or /siege for a deep multi-agent security audit./quality-gate./quality-gateTemper and quality-gate share a loop shape (fresh reviewer each round, stagnation detection, escalate on architectural concerns). They differ in scope and caller:
/quality-gate is the generic iterative red-team loop over any artifact (design, plan, code, hypothesis, mockup). It is invoked by artifact-producing skills as their terminal gate./temper is the code-diff-specific instance — same loop shape, plus forge integration (PR metadata, optional post-back), plus the fix-verification convergence model. It is user-facing for ad-hoc review and is called by build / debugging / finish on diffs.temper drives shared/delve-engine.md for its finding enumeration — it is the engine's fix-verification loop driver, in contrast to /delve, which runs the same engine once and never loops or emits a merge verdict.
When in doubt: if the artifact is a code diff, use temper. If it is anything else (or you are inside an artifact-producing skill writing the gate), use quality-gate.
Mandatory:
Optional but valuable:
| Component | Required? | Purpose | Fallback if missing |
|---|---|---|---|
| git | Required | Diff resolution, SHA range, default-branch detection, per-round working-tree snapshots (git stash create, §3.8) | None — abort with clear error |
| shared/delve-engine.md | Required | The parallel finder fan-out + verify gate temper drives to enumerate T (R1) and hunt new gating findings (R2+ Track A) | None — abort; temper has no engine of its own |
| shared/severity-verdict-contract.md | Required | The T = {CONFIRMED, PLAUSIBLE} × {Critical, Important} gating rule and the severity/verdict vocabulary (I11 — temper defines none of its own) | None — abort; temper coins no scale or verdict |
| Forge CLI (gh / glab / bb) | Optional | PR metadata fetch + optional Step 5 post-back | Probe in order; if all missing, fall through to git-plumbing and ask user for description |
| crucible-consensus MCP server | Optional | External-model candidate feed via external_review (R1-only external_candidates, see External Model Review) | Skip silently — gather no external candidates (≡ external_review=skip) |
| crucible:test-coverage | Optional | Test-alignment audit when behavioral changes are made | Skip; recommend manually |
| crucible:checkpoint | Optional | Fallback per-round working-tree snapshot mechanism for the uncommitted-mode fix-delta derivation (§3.8); the primary mechanism is git stash create (no dependency), so checkpoint stays optional. Also a pre-fix rollback target when build wraps temper. | Skip silently — git stash create is the primary path |
/temper # auto-detect (see Step 1 case 3)
/temper 259 # PR identifier on the current forge
/temper https://... # PR URL on any forge
/temper main..HEAD # explicit SHA range
/temper a1b2c3..d4e5f6 # explicit SHA range
/temper 259 max_rounds=8 # override default 5-round circuit breaker
/temper 259 max_rounds=8 external_review=skip # skip redundant external_review on re-invocation
Argument shape: [PR-id-or-URL | <base>..<head>] [max_rounds=<N>] [external_review=skip]. No argument means auto-detect.
Determine what to review based on the argument:
Case 1 — PR number or URL. Fetch metadata (title, body, base ref, head ref). Forge detection is CLI-probe order, not hostname-literal (covers GitHub Enterprise Server and any other GH-flavored host). When the argument is a URL, parse the forge from the URL host first and use only the matching CLI (handles fork workflows where origin and upstream live on different forges). When the argument is a bare PR number, probe CLIs in order against the current origin.
If the argument is a URL, parse the forge from the URL host and try only the matching CLI:
github.com or any host gh authenticates against (GHE) → gh pr view <id> --json title,body,baseRefName,headRefName,author --repo <owner/repo-from-URL>gitlab.com or any GitLab host → glab mr view <id> --repo <project-from-URL>bitbucket.org or any Bitbucket host → bb pr view <id> --repo <slug-from-URL>git fetch <remote> <head-ref>); ask the user to paste the description.If the argument is a bare PR number, try CLIs in order against the current origin:
gh pr view <id> --json title,body,baseRefName,headRefName,author — covers GitHub and GitHub Enterprise (any host gh is authenticated against; verify with gh auth status --hostname <host> if needed).glab mr view <id> — covers GitLab and self-hosted GitLab.bb pr view <id> (or REST) — covers Bitbucket.git fetch <remote> pull/<id>/head (GitHub-style ref) or git fetch <remote> merge-requests/<id>/head (GitLab-style); ask the user to paste the description if they want it factored in.Distinguish CLI errors from missing CLIs. A CLI that exits non-zero with "404 / PR not found / authentication required" is not a missing-CLI fallback path. Surface the error to the user (e.g., "gh found the PR but auth failed — re-authenticate or paste the diff manually") and pause for instruction. Falling through silently on a CLI error would dispatch a review against the wrong scope.
Map the fetched metadata to <base>..<head> SHA range using git rev-parse <baseRef> and git rev-parse <headRef>.
Case 2 — SHA range (argument contains ..). Use as-is. Metadata is empty: no PR description, just the diff.
Case 3 — No argument (auto-detect). Precedence (first match wins):
git symbolic-ref -q HEAD returns non-zero), require an explicit argument — auto-detect is ambiguous in detached state. Abort with a one-line instruction telling the user to pass a SHA range.git symbolic-ref refs/remotes/origin/HEAD (handles main, master, trunk, or anything else the remote uses). Use <that-ref>..HEAD as the SHA range.origin/HEAD is set (rare; usually means the remote was never properly cloned), check which of origin/main, origin/master, origin/trunk exist. If exactly one exists, use its merge base with HEAD and narrate the fallback ("[temper] origin/HEAD unset; fell back to origin/<name> as the only main-like ref present"). If more than one exists (legacy repos with both origin/main and origin/master), abort with: "Multiple main-like refs found (origin/main, origin/master). Pass an explicit <base>..<head> range — the model has no signal which one this branch was cut from." If none exist, abort with the existing "Cannot determine default branch" message.Anti-rationalization: don't hardcode gh calls in the dispatch path. The skill is forge-agnostic — the CLI used is whichever the environment makes available. Skip metadata gracefully on missing CLIs; surface explicit errors on present-but-failing CLIs.
Classify the resolved diff before spending an engine dispatch on it. Empty / binary / submodule-only diffs are recognized and handled explicitly so they cannot produce silent false-Clean verdicts. (This preflight runs ahead of the R1 delve-engine dispatch — see Step 2's short-circuit ordering — so no engine cost is spent on a non-substantive diff.)
Run git diff --numstat <base>..<head> and inspect:
Clean — no changes to review immediately. Do not dispatch the engine. Callers (build, finish) see this as "Clean" but with Reason: empty-diff distinguishable from a substantive Clean.-\t-\t<path> indicating binary): note in the engine scope description / round metadata that the diff is binary-only and content cannot be inspected. Do not produce a Clean verdict against unreviewable content; surface Architectural — binary-only diff requires human review..gitmodules or submodule SHA pointers): note it in the round metadata and flag a Suggestion to inspect the submodule contents separately. Do not produce a spurious Clean.scope; note the binary files in the round metadata so they are not treated as reviewed.numstat): warn the user and offer to split per-commit or per-file. If the user proceeds anyway, note the over-cap in the round metadata and dispatch with a context-window degradation warning. This is a soft cap, not a hard block. Non-interactive callers (build / debugging / finish dispatching /temper): on >5,000-line diffs, proceed automatically with the over-cap note and emit a degraded-context flag in the round metadata. Interactive (standalone) callers retain the offer-to-split flow above.Empty-diff caller contract. Pipeline callers (build / debugging / finish) MUST treat Reason: empty-diff as a soft-warn — surface to the user ("temper found no changes between BASE and HEAD; confirm this is intended") before proceeding past the gate. The most common cause is uncommitted work, a wrong base, or detached-HEAD post-rebase. Ad-hoc / standalone callers may proceed silently (the user invoked /temper knowing the state).
TRound 1 is a recall pass + enumeration. Drive shared/delve-engine.md to find all gating defects in the diff, then build the tracked set T from its kept records.
Short-circuit ordering (two distinct short-circuits at two stages — the order is load-bearing):
Reason: empty-diff Clean (no reviewer dispatched) is preserved exactly. The R1 engine dispatch happens only after the preflight admits a substantive, reviewable diff.T short-circuit is a DISTINCT, LATER condition — reached only when the preflight already admitted a real diff AND the R1 engine ran successfully and enumerated nothing gating. It is not the empty-diff case (empty-diff = "nothing to review, no engine call"; empty-T = "the engine reviewed a real diff and found no Critical/Important CONFIRMED/PLAUSIBLE finding"). The guarded check below applies to this second condition only.The R1 engine drive. Dispatch shared/delve-engine.md once with:
scope = the full diff (<base>..<head> from Step 1).angles = the bug-finding subset only: line-by-line, removed-behavior, cross-file. (The four quality angles are excluded — they are capped non-gating per the contract and never enter T.)effort = high (recall-biased; the tier delve-engine §3 pins for a gating hunt).cap = set explicitly HIGH, well above the expected |T|. delve-engine §6: cap truncates the ranked output and does not guarantee a gating finding is preserved, so a Critical/Important above the cap would be silently dropped. Do not leave it at the default 10.The fan-out and the per-candidate verify gate run through the harness-adapter dispatch mechanism — temper issues no harness-specific call inline (I1). On a harness with no parallel-subagent primitive, the adapter's sequential fallback runs the angles one pass per angle (it warns once that recall may drop).
Build T from the kept records. From the engine's kept eight-field records, T = the gating 2×2 of the contract (§3): every kept record whose verdict is CONFIRMED or PLAUSIBLE and whose severity is Critical or Important. PLAUSIBLE@C/I gates even without a runnable repro — that is the recall hole this model closes. (Reference shared/severity-verdict-contract.md for the gating rule; temper copies no table.)
Identity key vs. adjudication payload (distinct). Each member of T has:
{file, line, summary, severity, verdict} — used to dedup T and carry a member forward across rounds. (A subset of the engine's eight-field record; failure_scenario, scope, effort are not needed to distinguish one member from another.){file, line, summary, failure_scenario, severity, verdict, scope, effort}), retained per member keyed by that identity. This is the adjudication input the R2+ per-member re-verifier (Track B, Step 4) receives — it needs failure_scenario to re-derive REFUTED-after-fix / a code-based downgrade against the fixed code. The identity key is the carry-forward handle; the eight-field record is the adjudication payload.Each member also carries a transient readjudicated boolean — per-round auxiliary state, NOT part of the identity key (two members with identical identity tuples are the same member regardless of readjudicated). Initialize a newly-enumerated R1 member to readjudicated = false.
Empty-T-at-R1 short-circuit (guarded — two-part sanity check). If the R1 engine drive (reached only past the Step 1.5 preflight) yields an empty T, short-circuit to Clean only after both:
T: do not short-circuit; surface it as a dispatch failure / re-attempt, never as Clean.cap did not truncate the GATING subset. Key on the Critical/Important kept-record count, not the total kept count: short-circuit only if the C/I kept-record count is strictly below cap AND T is empty. (A non-gating Minor/Suggestion tail filling cap is harmless — a gating finding is lost to truncation only when the C/I subset alone reaches cap.) If the C/I kept-record count equals or exceeds cap, treat it as gating-subset truncation — do not short-circuit.Scope of this guard — empty-T only. This C/I-saturation guard decides solely whether a genuinely-empty T may short-circuit to Clean. When T is non-empty, a cap-saturated C/I set means "proceed to fix a (large) T," never "re-run R1": the enumerated members are already a valid gating T and the loop proceeds to Step 3; R2+'s changed-region fan-out + cheap full-diff sweep (Step 4 Track A) re-hunts the whole range each round and admits any gating finding the R1 cap dropped.
Bounded re-run (no unbounded loop). When T is empty AND the C/I subset is cap-saturated, re-run R1 exactly once at a doubled cap:
T → proceed to fix T (Issues-Found path); no further re-run.T with C/I subset now strictly below the doubled cap → genuine empty-T, short-circuit to Clean.T but C/I subset still at/above the doubled cap → do NOT re-run again. Proceed with the (empty) T and emit a cap-saturation signal in the round metadata (analogous to degraded-context, Step 1.5) so the caller sees "the gating set could not be bounded under cap" — surfaced as a degraded/indeterminate verdict requiring human attention, never a clean Clean and never a runaway re-run loop. At most ONE doubled-cap re-run ever occurs.If the C/I count is strictly below cap on the first run, no gating finding was truncated, so an empty T is genuine even if the non-gating tail consumed the rest of cap — short-circuit to Clean directly, no re-run.
Per-invocation dispatch-id (concurrency isolation). Every /temper invocation generates a unique dispatch-id at Step 1: temper-YYYYMMDDTHHmmss-<6-char-nonce>. Generate via a cryptographic RNG (e.g., python -c 'import secrets; print(secrets.token_hex(3))' for 6 hex chars). If a dispatch file path already exists on disk, regenerate the nonce and retry — never overwrite. The dispatch file path and the metadata.dispatch_id field both include this id, so concurrent invocations (e.g., user-initiated overlapping with build's Phase 4) cannot collide. Round numbering remains per-invocation; the dispatch-id disambiguates (skill, round) traceability tuples in the external_review MCP and in any session-log consumers. (The R2+ Track-B per-member dispatches extend this stem with a -m<NN> member suffix — see Step 4.) Dispatches are disk-mediated per shared/dispatch-convention.md: write the filled prompt/inputs to a dispatch file (one file per dispatch-id), then dispatch a Task subagent that reads that file — never paste inputs directly into the Task tool prompt.
Temper's core principle ("fresh agent every round, no anchoring beyond the enumerated T") is convention-plus-mechanism. Each round uses a fresh agent — no reviewer reuse — with one deliberate, documented exception: the R2+ per-member re-verifier (Track B) must receive T (the enumerated tracked set is its input — that is fix-verification). The exception is scoped to T only.
The R2+ Track-B verifier receives, across the boundary, only these inputs and nothing else:
{FIXED_BASE_SHA} / {FIXED_HEAD_SHA}, see Step 4).T: its full originating eight-field delve-engine record keyed by the 5-field identity, plus the transient readjudicated flag.It must not receive:
T records cross the boundary — not "everything the last round said").temper-reviewer.md) to read diff/code content only, not git log. The no-git log anchoring guard still holds; this shifts the boundary from orchestrator-side redaction (unenforceable, since the verifier runs its own git) to verifier-side discipline.This boundary is what keeps round-N independent of round-N-1 apart from the sanctioned T carry. Step 5's optional post-to-PR happens after a round completes; on subsequent rounds the fresh agent is dispatched against the fixed code, and PR comments (which now contain prior findings) are excluded from the metadata fetch.
T membership, NOT a count. For each member persist its five-field identity ({file, line, summary, severity, verdict}), its full eight-field record, and its readjudicated flag. Convergence keys on the resolution status of these enumerated members — never on a Critical+Important count compared across rounds.T. A PLAUSIBLE@C/I member gets the same fix priority as a CONFIRMED one — it is a real regression the verifier could only call PLAUSIBLE for lack of a runnable repro, not a doubtful finding (per the contract / delve-engine). Fix priority follows severity, not verdict.T; see Severity / Verdict Vocabulary below).PLAUSIBLE@C/I is not discharged by fixer prose — it discharges only via the §3.3 paths re-derived against the fixed code (Step 4).After fixing the members of T, each round R ≥ 2 runs two SEPARATE dispatch tracks against the FIXED code. They are not the same dispatch and not the same gate. Both use fresh agents (no reviewer reuse); the only prior-round input that crosses the freshness boundary is the enumerated T (§3.7, Freshness Boundary above).
Fixed-region derivation (the scope Track A receives). Per round R, compute the incremental changed-region set = the regions the fixer touched since the prior round's verification pass, PLUS a cheap full-diff regression backstop:
HEAD@R != HEAD@R-1): fix delta = diff(HEAD@R-1 .. HEAD@R).HEAD@R == HEAD@R-1): snapshot the working tree at each round boundary via git stash create (it builds a tree/commit object capturing tracked working-tree modifications and prints its SHA without touching HEAD, the index, or the working tree — the fixer is never disturbed; crucible:checkpoint is the documented fallback). Fix delta = diff(snapshot@R-1 .. working-tree@R). Record each round's snapshot SHA in the per-invocation round metadata keyed by (dispatch_id, round) (the same structure carrying readjudicated flags and round verdicts); the snapshot is a dangling object addressed by SHA, kept out of refs/. Mode is selected per round boundary by checking whether HEAD advanced, so a run may switch modes round-to-round.base..head range, kept cheap (it is a backstop), to catch a regression outside the touched hunks that a narrow fix-delta scope would miss (the exact hole the old count-delta model had).git gc. No snapshot survives past the invocation.Drive shared/delve-engine.md over the round-R changed-region set above (incremental fix delta + cheap full-diff sweep) as scope, bug-angle subset, with cap set explicitly HIGH — well above the expected count of new gating findings a single round's fix can introduce (mirror R1's cap reasoning; the default cap=10 is insufficient for a gating hunt). The fan-out proposes new candidates; each passes through delve-engine's own verify gate (one verifier per deduped candidate). This is delve-engine's only R2+ job: hunting NEW findings — the engine has no per-member re-verification input, so re-verification of T does not route through it.
A new candidate the gate assigns CONFIRMED/PLAUSIBLE @ Critical/Important is admitted to T with readjudicated = false (the new-member admission gate). Raw fan-out output is never admitted directly; an unverified, REFUTED, or below-C/I candidate stays out of T.
Track-A cap-saturation signal. Apply the same gating-subset truncation check Step 2 applies at R1: if the Track-A Critical/Important kept-record count reaches cap (equals or exceeds it), the admitted new-finding set may be truncated — a new gating finding the fix introduced could sit above the cap. In that case the round is not read as a clean "no new gating finding entered"; surface a cap-saturation signal in the round metadata. The Clean condition's "no new gating finding entered" (Done When) is satisfiable only when Track-A's C/I kept-record count is strictly below cap (gating set provably un-truncated). (Key on the C/I subset, not the total kept count — a non-gating tail filling cap is harmless.)
t ∈ T (temper-owned per-member dispatch)Re-verification is temper-owned, separate from the engine fan-out. For each member t ∈ T, dispatch one fresh temper-reviewer.md adjudicator (one per member) to adjudicate that single member against the WHOLE round-R fixed tree. temper-reviewer.md re-applies the contract's verdicts/severity (CONFIRMED / PLAUSIBLE / REFUTED + severity, per shared/severity-verdict-contract.md §2) to the fixed code — it defines no verdict vocabulary of its own (I11) and needs no delve-engine input. It is a temper template, dispatched through the harness-adapter subagent mechanism, fed the member's full eight-field record — not a delve-engine verify-gate adjudicator.
Track-B adjudication range = the WHOLE FIXED TREE, never the incremental delta (the most important Track-A-vs-Track-B distinction). For each member the adjudicator re-checks t.failure_scenario against original base .. round-R fixed-tree:
original base .. working-tree-snapshot@R (the git stash create snapshot SHA), NOT diff(snapshot@R-1 .. working-tree@R).original base .. HEAD@R.A member's failure_scenario may live in code the fixer did not touch this round; if Track-B re-verified against only the incremental delta, that member's code would be absent from the diff and the adjudicator would falsely conclude RESOLVED-by-absence, re-opening the recall hole. Re-verifying against the whole fixed tree guarantees a member whose defect persists in unchanged code is correctly re-affirmed STILL-GATING.
Per-member dispatch contract (Track-B slots — see §5 of the plan / temper-reviewer.md):
{MEMBER_RECORD} — the full eight-field delve-engine record for the single member being adjudicated, JSON-serialized, keyed by its 5-field identity. failure_scenario is the construct re-checked against the fixed code.{FIXED_BASE_SHA} = the original base (the same base R1 enumerated against — NOT the prior round's HEAD/snapshot).{FIXED_HEAD_SHA} = the round-R fixed-tree ref — the git stash create snapshot SHA (uncommitted mode) or HEAD@R (committed mode).{READJUDICATED} — the member's transient readjudicated flag carried across the boundary; the adjudicator sets the flag on emitting its per-member outcome (feeds the §3.5 defer-once-only bookkeeping).One-dispatch-file-per-member rule. N members in a round ⇒ N per-member dispatch files, each adjudicating exactly one member — there is no aggregate "review all of T in one dispatch." This preserves fresh-eyes isolation per member.
Per-member dispatch-id scheme. Each per-member dispatch extends the per-invocation stem (temper-YYYYMMDDTHHmmss-<6hex>, Step 2) with a zero-padded member-index suffix -m<NN> (e.g. temper-20260603T101500-a7f3c2-m01, …-m02). Sharing the round stem but differing in -m<NN> means N concurrent per-member dispatch files cannot collide on disk; if a generated path already exists, regenerate-and-retry / never-overwrite as for the base id. Each per-member metadata.dispatch_id carries the full …-<6hex>-m<NN> id.
(R1 dispatches no temper-reviewer.md — it is pure delve-engine enumeration. Track A never uses temper-reviewer.md; temper-reviewer.md is the Track-B per-member re-verifier only.)
temper-reviewer.md re-applies the contract verdicts/severity to the fixed code and emits, per member, one of:
T.PLAUSIBLE@C/I only) — the adjudicator actively re-derives the contract's REFUTED verdict against the fixed code (the suspect construct is provably gone). A DISCHARGE — member leaves T. (Same REFUTED verdict from the contract, re-applied; temper coins no new verdict.)severity per the contract's scale to a tier below C/I against the fixed code → it leaves the gating 2×2 (folds to Minor/Suggestion, reported verbatim). A DISCHARGE — member leaves T.T.PLAUSIBLE@C/I only) — the adjudicator can neither re-derive REFUTED nor downgrade → the member is escalation-eligible (architectural / human-ack path, §3.3).PLAUSIBLE@C/I members (§3.3)A repro-less PLAUSIBLE@C/I could otherwise trap T non-empty forever. Each round the adjudicator RE-ADJUDICATES it against the fixed code. It discharges ONLY by becoming REFUTED-after-fix or by a code-based severity downgrade below C/I. It NEVER discharges on fixer rationale alone — prose-discharge of a repro-less PLAUSIBLE is the exact recall hole this redesign closes.
A member the adjudicator can neither re-derive-REFUTED nor downgrade is escalation-eligible. Escalation is NOT a DISCHARGE: such a member is neither RESOLVED nor DISCHARGED, so it stays LIVE in T and BLOCKS Clean. It does not silently leave T to permit a Clean in the same run — instead the loop routes to the terminal Architectural verdict (handing it to human-ack) once the previously-seen unresolved subset is solely escalation-eligible (branch table below). It leaves the loop by escalating, never by silent accept. (The fixer-rationale-the-verifier-accepts path remains available only for genuinely-CONFIRMED findings the fixer argues are false positives — the verifier still adjudicates the rationale — and does not apply to repro-less PLAUSIBLE@C/I.)
New-member admission gate. A new gating finding enters T in round R only after Track A's verify gate assigns it CONFIRMED/PLAUSIBLE @ C/I; it initializes readjudicated = false.
"Previously-seen" boundary. A member is previously-seen in round R iff it was in T at the END of round R-1. A member admitted in round R is NOT previously-seen until R+1 — so in its admitting round it cannot trip Stagnation. R1's enumerated T is the end-of-R1 set; members survive into R2 as previously-seen. R1 has no prior round, so Stagnation can never fire on R1.
Evaluate the round (the Stagnation / branch-table logic is subordinate to escalation):
T is RESOLVED or DISCHARGED (RESOLVED, or DISCHARGED via §3.3 — REFUTED-after-fix or a code-based downgrade-below-C/I only) AND no new gating finding entered this round. An escalation-eligible unresolved member is neither RESOLVED nor DISCHARGED, stays live in T, and BLOCKS Clean — temper never emits Clean while a live escalation-eligible member is in T (I6). Such a member leaves the gate by the loop routing to Architectural, never by silent removal so a Clean can be emitted.
T resolved that round. Even if every previously-seen member RESOLVED/DISCHARGED, the "no new gating finding entered" condition fails. The new member enters with readjudicated = false and becomes Clean-blocking and Stagnation-eligible only from R+1. Clean is reachable only in a round that both resolves/discharges all carried members and admits no new gating finding.T has unresolved members; loop continues. (Reported into the round report, but loop-continuing — not a terminal verdict.)T does NOT shrink across two consecutive (non-deferred) rounds — no previously-seen member became resolved/discharged. Judged only on the previously-seen unresolved subset; fires regardless of newly-admitted members (resolving one old member while admitting one new is progress; resolving zero old members while admitting new ones trips Stagnation). The earliest evaluation is R2-vs-(end-of-R1). Termination is guaranteed by the round cap (Max-Rounds), NOT by Stagnation — Stagnation is an early-exit optimization. A Defer round does NOT count in the consecutive-round sequence (it is skipped; the next non-deferred round compares against the last non-deferred prior evaluation). Earliest a member whose only prior round was its defer can FIRE Stagnation is R+2 (eligible to contribute a non-shrink observation at R+1; two consecutive non-deferred non-shrinking evaluations are required to fire).readjudicated: deferring sets the flag; a member whose flag is already set does NOT defer again. Not a merge verdict the user sees.Before declaring Stagnation, branch on WHY the non-shrinking previously-seen unresolved subset did not shrink (these rows are EXHAUSTIVE over composition of a non-empty, non-shrinking subset — the empty case is Clean / Issues-Found, not here; the verdict is total, no silent fall-through to Max-Rounds):
| Subset composition (non-empty, non-shrinking) | Verdict |
|---|---|
| SOLELY escalation-eligible members (repro-less PLAUSIBLE@C/I the adjudicator can neither re-derive-REFUTED nor downgrade) | Architectural (needs human adjudication — not wheel-spinning) |
| SOLELY members not yet re-adjudicated this round (readjudicated == false) | Defer one round (give each its re-adjudication pass) |
| MIXED escalation-eligible + not-yet-re-adjudicated, with NO genuinely-stuck member | Defer one round (next round reduces to SOLELY escalation-eligible → Architectural, or shrinks normally) |
| Any subset containing ≥1 genuinely-stuck member (readjudicated == true AND not escalation-eligible AND still unresolved), regardless of what else it contains | Stagnation (real churn: a member that could resolve via fix but the fixer keeps failing) |
readjudicated == true && !escalation-eligible && unresolved — it has had its re-verification pass against the fixed code and is still gating, and is not escalation-eligible. Fully observable (no counterfactual). A member still readjudicated == false is awaiting its first re-verification pass (→ Defer), not genuinely stuck.readjudicated == false). "Fix-attempted" cannot distinguish a defer-eligible member from a stuck one (Step 3 fixes all unresolved C/I every round, so a carried member was already fix-attempted in its admitting round).readjudicated == false, never by freshly-admitted ones.The loop is bounded by 5 rounds by default — the termination guarantee (Stagnation is only an early-exit optimization). At round max_rounds without a terminal verdict, escalate to the user with T's unresolved members:
"Temper reached the {max_rounds}-round cap without resolving every member of the tracked set
T. Unresolved members: [{file:line — summary — severity/verdict} for each live member ofT]. To extend, re-invoke/temper <scope> max_rounds=N— this starts a fresh review loop with a higher cap (the new loop re-enumeratesTfrom scratch; round counting restarts at 1, and the fresh agents have no anchoring from prior rounds beyond a re-enumeratedT). If the remaining members appear structural rather than fixable in another loop, escalate to design / plan instead."
Callers (build Phase 4) treat the cap escalation as a soft block: the diff is not approved; the user decides whether to extend, refactor, or accept the remaining members. Overridable via trailing max_rounds=<N>; defaults to 5 to keep runaway protection on by default.
The outcomes split into four terminal merge verdicts (the loop settles on exactly one and STOPS) and two non-terminal loop-continuation outcomes (the loop CONTINUES). Do not call this "five terminal verdicts."
Terminal merge verdicts (loop STOPS):
T is RESOLVED or DISCHARGED (§3.3 only), AND no new gating finding entered this round. Caller may proceed.PLAUSIBLE@C/I the adjudicator can neither re-derive-REFUTED nor downgrade), or any round emits an architectural concern. Caller escalates immediately, regardless of round number (subordinate-to-escalation: this pre-empts the Stagnation antecedent).max_rounds reached without a terminal verdict. The termination guarantee. Caller escalates with T's unresolved members.Non-terminal loop-continuation outcomes (loop CONTINUES):
T has unresolved members (or a new gating finding was admitted this round). Reported into the round report but loop-continuing — it is emitted (so it appears in the caller-visible verdict set) yet does not terminate the loop.readjudicated). An internal continuation state the user does not see as a merge verdict; it suppresses Stagnation that round and consumes one round against Max-Rounds.This step is an output convenience, not part of the review contract — findings are complete after Step 4 regardless of whether they're posted. It exists for users who want the local review surfaced on the PR for asynchronous collaborators.
If the user explicitly asks ("post this to the PR", "leave a review comment"), publish using whichever CLI fits the forge:
gh pr review <id> --comment --body-file <findings.md>glab mr note <id> -m "$(cat findings.md)"bb pr comment <id> --file findings.md (or REST)Confirm success explicitly. Check the CLI's exit code. On non-zero exit, classify the failure mode and respond per the table below — do not silently skip:
| Failure mode | Response |
|---|---|
| Auth-fail (gh auth status failure / token expired) / rate-limit (403) / network error | Paste-mode with retry guidance: "Posting failed with <error> — re-authenticate / wait and retry, or paste the body manually below." |
| PR closed-without-merge | Paste-mode with conditional guidance: "PR is closed; if you intend to reopen, paste the body. Otherwise the findings remain in your session." |
| PR merged or deleted | Do not offer paste-mode. Surface the findings locally: "The PR is no longer postable (merged / deleted). Findings remain in your session for reference." |
Never post without an explicit user instruction. Findings live in the user's session by default.
temper drives shared/delve-engine.md through the harness-adapter fan-out mechanism (shared/harness-adapter.md §4, §7), disk-mediated per shared/dispatch-convention.md — never a harness-specific call inline (I1). Round 1 drives the engine (high effort, bug-angle subset) to enumerate the tracked set T; Rounds 2+ re-hunt the changed range the same way. Where a harness has no parallel-subagent primitive, the adapter's sequential fallback (§5) runs the angles as multiple sequential passes, warning once that recall may drop.
temper is one of the exactly two files that dispatch delve-engine directly (the other is delve). The canonical engine-dispatch marker line follows; the I2 allowlist test keys on it with the anchored pattern grep -rn '^dispatch: delve-engine':
dispatch: delve-engine
temper defines no severity scale and no verdict vocabulary of its own (I11). The four-tier severity scale (Critical / Important / Minor / Suggestion), the verify-gate verdicts (CONFIRMED / PLAUSIBLE / REFUTED), and the gating rule are the contract's — see shared/severity-verdict-contract.md (canonical-included in the header; this section references it, it does not copy its tables).
T)Convergence keys on the tracked set T, computed by temper from the engine's {severity, verdict} kept records per the contract §3:
T= { CONFIRMED, PLAUSIBLE } × { Critical, Important }
A kept finding enters T iff its verdict is CONFIRMED or PLAUSIBLE and its severity is Critical or Important. A PLAUSIBLE@C/I gates even without a runnable repro (the recall hole this model closes). Minor and Suggestion are both non-gating — they never enter T (reported verbatim, never dropped). See the contract's full verdict × severity matrix; temper adds nothing to it. There is no count-delta mapping and no Suggestion-folding-into-Minor: convergence is the resolution status of T's enumerated members, never a count.
Two parallel vocabularies for temper's merge verdict exist for historical reasons; temper recognizes both as synonyms:
| Canonical | Accepted synonym | |---|---| | Clean | Approved | | Issues Found | Needs Fixes | | Architectural | Architectural Concern ≡ Escalate |
The left column is canonical; the right-column synonyms remain accepted to avoid breaking older callers.
These are temper's loop outcomes — distinct from the contract's per-finding verify-gate verdicts above. Four terminal merge verdicts: Clean, Stagnation, Architectural, Max-Rounds. Two non-terminal loop-continuation outcomes: Issues-Found (reported but loop-continuing) and Defer-one-round (internal continuation state). Each is defined in Done When (Step 4).
When enabled, external_review is a candidate source for the R1 verify gate, not a parallel scored reviewer. External findings inject into delve-engine's external_candidates input on Round 1 only; the same verify gate adjudicates them (CONFIRMED / PLAUSIBLE / REFUTED + severity) before any can enter T. They never bypass the gate and never run as a separate scored pass.
external_review=skip (default-on stays): no external candidates; delve's fan-out is the only candidate source.external_candidates feed; the verify gate adjudicates them cross-origin with the internal fan-out candidates. R1-only.Prose → external_candidates DRAFT transform (the wiring step). temper's external_review step emits free-form prose findings; delve-engine's external_candidates input (delve-engine §2) takes a list of DRAFT records, each {file, line, summary, severity} with NO verdict. So temper converts each external prose finding into one draft record: extract file/line from its location, summary from its one-line what-and-where, and a severity DRAFT hint from its stated severity. Per delve-engine §2/§5: any inbound verdict is discarded (the verify gate is the sole verdict authority) and the inbound severity is a draft hint only that the gate re-assigns per the contract — so a malformed/over-stated external severity cannot inject an authoritative Critical. The drafts merge into delve-engine's pre-dedup candidate pool and are adjudicated alongside the internal fan-out candidates.
Cadence (R1-only) + documented limitation. External candidates are gathered once per /temper invocation, on Round 1 only — to second-opinion the initial finding set without multiplying external-API cost across the fix loop. R2+ do not re-run external_review: an external candidate REFUTED on R1 cannot re-enter T; fix-introduced regressions are caught by delve's own R2+ fan-out (changed-region scan + cheap full-diff sweep, Step 4 Track A), not by re-running external_review.
Re-invocation skip rule: Re-invoking /temper via max_rounds=N after a stagnated run normally triggers another R1 external candidate gather. To avoid redundant external API spend on essentially-the-same diff, pass external_review=skip. (No automatic same-diff detection — skip is explicit-only.)
Gather external candidates by calling external_review with:
prompt: contents of skills/shared/external-review-prompt.mdcontext: the same diff and requirements context the R1 engine drive receivesskill: "temper" (top-level argument for per-skill toggle enforcement)metadata: {"skill": "temper", "round": 1, "dispatch_id": "<from Step 2>"} (traceability)Per-skill toggle: The server checks the skill argument against skills.temper in the external review config. If false, the server returns unavailable and temper gathers no external candidates. Server hyphen-normalization: mcp-servers/crucible-consensus/server.py normalizes hyphens to underscores in the skill name before lookup, so a hyphenated skill name (red-team) and its underscored form (red_team) resolve to the same toggle. Today temper has no hyphen — the contract works trivially — but the rule is documented here for future renames.
Config-rename note for opt-out users. The toggle key was renamed from code_review to temper on 2026-05-17. If you previously set skills.code_review: false to opt out, rename the key to skills.temper: false to preserve your opt-out. Otherwise the toggle inherits the default True.
Graceful degradation → skip (gather no external candidates, ≡ external_review=skip). When the external source is degraded, temper silently gathers no external candidates and the engine run proceeds on its own fan-out:
external_review tool not available (MCP server not running): gather none.status is "unavailable" (no config or disabled): gather none.status is "partial" (some models failed): feed the available external findings as drafts; note which models failed.Either way the R1 engine drive proceeds — the external feed never blocks or delays it; on external failure the fan-out stands alone.
For GitHub PRs only, Claude Code's built-in /ultrareview <PR> runs a deeper multi-agent review in a cloud sandbox. After a local /temper round, suggest /ultrareview to the user when either of these holds:
shared/delve-engine.md §4), OR"Distinct angles" means the engine's finder angles, not severity tiers — the trigger is breadth of issue surface, not depth of any single issue.
/ultrareview is GitHub-specific; do not suggest it for GitLab / Bitbucket / other-forge PRs.
[Just completed Task 2: Add verification function]
You: Let me request review before proceeding.
[Step 1: Resolve scope]
- No argument given; HEAD is on branch `feat/verify`
- gh pr view (current branch) → no PR yet
- origin/HEAD → main; resolved range: origin/main..HEAD
- BASE_SHA=$(git rev-parse origin/main)
- HEAD_SHA=$(git rev-parse HEAD)
[Step 1.5: Preflight]
- numstat: 4 files changed, 87 added, 12 deleted — text diff, in-cap, proceed
[Step 2: Round 1 — drive delve-engine (bug-angle subset, effort=high, cap=20), dispatch-id temper-20260603T150500-a7f3c2]
Engine ran non-error; C/I kept-count (2) < cap → T genuine, not truncated.
Enumerate T (2 members):
t1 verify.py:40 CONFIRMED / Important — no error handling for empty input
t2 verify.py:55 PLAUSIBLE / Important — progress callback may fire after cancellation (no runnable repro)
Minor: 1 (magic number) — non-gating, not in T.
Each member: readjudicated=false.
You: [Fix both members of T — t2 (PLAUSIBLE@Important) gets the same fix priority as t1 (CONFIRMED)]
[Step 4: Round 2 — Track A (delve-engine over fix delta + cheap full-diff sweep, cap=20) + Track B (one temper-reviewer per member, against the WHOLE fixed tree)]
Track B — re-verify carried members:
t1 → RESOLVED (empty-input guard now present)
t2 → REFUTED-after-fix (DISCHARGE: the post-cancellation callback path is provably gone) — leaves T
Track A — NEW gating finding admitted to T (readjudicated=false):
t3 verify.py:48 CONFIRMED / Important — the new guard swallows a real I/O error
All carried members resolved/discharged, BUT a new gating finding entered this round.
→ Verdict: Issues-Found (NOT Clean — a new-admit round is never Clean). Loop continues.
You: [Fix t3]
[Step 4: Round 3 — Track A + Track B]
Track B — re-verify carried member:
t3 → RESOLVED (I/O error now re-raised)
Track A — no new gating finding admitted; C/I kept-count (0) < cap (un-truncated).
Every member of T resolved/discharged AND no new gating finding entered.
→ Verdict: Clean. Proceed to Task 3.
The Round-2 verdict is Issues-Found, not Clean, even though both carried members resolved — Track A admitted a new gating finding (t3), and a new-admit round is never Clean. Clean is reached only in Round 3, which resolves the carried member and admits no new gating finding.
When behavioral changes were made, consider dispatching crucible:test-coverage after temper completes. This catches stale tests, missing coverage, or assertion drift introduced by the fixes.
Caller context determines who runs it:
crucible:test-coverage automatically — temper does not./temper directly) is responsible for the hand-off. Recommend crucible:test-coverage when the review noted behavioral changes that might affect existing tests, when the diff modified functions with dedicated test files, or when the reviewer said "tests should be updated" without specifics.This is the single canonical statement of the rule; the workflow sections below cross-link rather than restate.
Build pipeline (Phase 4): Temper runs after each task. Build dispatches crucible:test-coverage automatically (see Test Alignment).
Standalone plan execution: Temper after each batch (3 tasks). The user dispatches crucible:test-coverage afterward if behavioral changes were made.
Ad-hoc development: Temper before merge, when stuck, after a complex bug fix. The user dispatches crucible:test-coverage afterward if behavioral changes were made.
Migration note — pre-rename retrospectives. crucible:forge retrospectives written before 2026-05-17 are tagged with code_review. Forge's consult-past-lessons step does not auto-alias; if you want the old lessons to surface for temper, query both keys. (Out of temper's scope to fix; flagged here so users know.)
Never:
T records cross the freshness boundary (see Freshness Boundary)T's enumerated members, never a count deltaPLAUSIBLE@C/I member on fixer prose — it discharges only via REFUTED-after-fix or a code-based downgrade re-derived against the fixed code (§3.3)readjudicated)gh (or any single forge's CLI) in the dispatch path — temper is forge-agnosticIf the reviewer is wrong:
See template at: temper/temper-reviewer.md
testing
Standalone instance-bug reviewer — runs a parallel finder fan-out + verify gate over a diff or a path and prints ranked, verified findings. Use when the user says "delve", "find bugs in this diff", "review this for bugs", "scan this file/subsystem for defects", "instance-bug sweep", or wants concrete reproducible defects (not a merge verdict, not systemic health). Works on a PR id, a base..head range, or a path, on any forge (GitHub, GitLab, Bitbucket, self-hosted).
testing
Render the Crucible calibration ledger weekly report — the honest "Crucible caught N silent bugs" headline, verdict breakdown, per-skill severity rates, and the inflation detector. Triggers on "/ledger", "weekly report", "weekly ledger", "caught N", "quality ledger", "calibration report", "render the ledger".
development
The Book of Grudges — cross-session bug graveyard. Every fixed bug is recorded as a structured "grudge"; before touching code, skills query the grudgebook for the files in scope and surface past regressions as forced "DO NOT REPEAT" context. Read mode (pre-flight) and write mode (on bug resolution / fix(*) PR). Machine-local, per-repo, never committed. Triggers on /grudge, "check grudges", "record a grudge", "any past bugs here", "regression oracle", "bug graveyard".
testing
Reconcile the Crucible calibration ledger — walk merged fix/hotfix branches to falsify the originating gating-verdicts, compute per-skill Brier calibration scores, and append a falsification log. Triggers on "/calibration-reconcile", "reconcile ledger", "reconcile calibration", "falsify verdicts", "brier score", "calibration reconcile", "compute brier".