internals/skills/strict-policy/SKILL.md
Operationalization of CLAUDE.md R1-R5 — the engineering-discipline rules that come BEFORE runtime verification. Covers: (R1) RCA on every failure via /ov-internals:root-cause-analyzer; (R2) no "pre-existing" / "out of scope" / "follow-up PR" classifications; (R3) no code duplication, generic over ad-hoc; (R4) no ad-hoc workarounds; (R5) hard cutover deletes the deprecated path AND every stale reference in the same commit. MUST be invoked when a failure / warning / anomaly surfaces, when the same pattern is about to land in a second surface, when a sleep / retry / magic-number is tempting, or when a cutover commit is about to ship.
npx skillsauth add overthinkos/overthink-plugins strict-policyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
R1–R5 in CLAUDE.md are the engineering-discipline gate. They sit ABOVE the runtime-verification rules R6–R9 and ABOVE the final acceptance gate R10. The order is deliberate: discipline failures are what produce the runtime failures that R6–R9 catch — so the rules that prevent discipline failures must come first.
This skill is the operational reference for R1–R5. Each section below restates the rule, lists what it forbids precisely, lists what it permits, explains why it matters, and notes how it interacts with the other rules. The rules exist because these specific failure modes actually happened — see CHANGELOG.md for the commit-referenced incidents that motivated each one.
A violation of any R1–R5 rule (or any of R6–R10, or the "Prioritize Clean Architecture Above All Else" section in CLAUDE.md) FORBIDS commit. There is no "downgrade tier and ship anyway" path. The agent fixes the violation in the same working tree and re-runs all verification, OR escalates to the operator and STOPS. No commit ships at any tier with a known violation. See CLAUDE.md "AI Attribution" section.
The rule. Every failure, error, anomaly, or warning surfaced by ANY tool (build, test, validator, runtime, eval, deploy, lint, hook) triggers IMMEDIATE invocation of /ov-internals:root-cause-analyzer BEFORE any remediation attempt. The first occurrence is the investigation trigger; there is no second-occurrence threshold.
Forbidden first responses. "probably a flake" / "rerun and see" / "transient" / "intermittent" / "works on retry" / "environmental" / "let me try once more" / "maybe the network was slow" / "let me clear the cache and re-run". These are confessions, not defences. Each is a way of avoiding the analysis the rule mandates.
What is permitted. The /ov-internals:root-cause-analyzer agent's 8-step process. After running it, if the analyzer concludes the root cause is genuinely external (network partition, upstream registry outage, kernel bug with a tracked upstream report), the conclusion is documented in the conversation with evidence — never assumed. Only after that documented conclusion is "retry" an authorized response.
Why it matters. Failures that look transient often aren't. A test that fails 1-in-N times because of a race condition will fail 1-in-(N/scale) times under load — pretending the failure is "flaky" hides the race instead of fixing it. R1 forces the investigation early, when the symptoms are simple, before the bug accumulates obscuring complications.
Interaction with other rules. R1 is the first response to ANY failure surfaced by any other rule's verification step. R7 (mandatory end-to-end gate) produces failures that R1 must investigate. R8 (generated-artifact invariants) produces failures that R1 must investigate. R10 (disposable + fresh-rebuild) produces failures that R1 must investigate. R1 is also the first response to a self-detected anomaly mid-session — including during planning, exploration, or normal coding.
The rule. ALWAYS validate ANY HIGH-RISK assumption empirically on a live disposable: true bed (/ov-internals:disposable) in the planning / early-coding phase, BEFORE the edits that depend on it. R1 runs RCA on every failure AFTER it happens; RDD runs the validation FORWARD so wrong assumptions, unnecessary cautions, and erroneous root-cause theories never survive into the final plan or code. Never accept the skills, CLAUDE.md, or the current code as automatically correct — docs drift and code has bugs; for a high-risk call, reality is the only ground truth. RDD is the COMPLEMENT of skills-first (R0), not a substitute for a lookup: you still load the skill first, you just don't treat its high-risk claims as proven until a bed confirms them. Never trust, verify.
Risk — not documentation status — is the trigger. Low-risk orientation ("roughly what does this layer do") is a zero-risk skill lookup (R0) — do NOT burn a bed on it. High-risk (being wrong invalidates the plan, is costly / hard-to-reverse, or would mislead RCA) is proven on a bed REGARDLESS of what any doc or code asserts. The archetypal high-risk unknown: whether a SPECIFIC layer composition, at the LATEST available versions the resolver picks, builds / deploys / runs TOGETHER — no skill can certify a never-composed combination.
Forbidden internal-voice triggers. "the layers probably compose, I'll find out at R10" / "the newest version is surely drop-in" / "each layer works alone, so the stack is fine" / "the skill says so, it's true" / "the code does X, so it's safe" / "I'll add a guard to be safe" (without proving the danger is real) / "let me bed-test what the skill already says" (for a LOW-risk item). Each either defers proving the riskiest unknown to the most expensive moment, or wastes a bed on a settled one.
What is permitted / required. Building the real composition, running the bed, inspecting the running deployment, reading the emitted artifact, or a focused ov eval image / ov eval live probe — riskiest-assumption first — to PROVE a high-risk claim before editing on it. When the bed contradicts a skill or CLAUDE.md, the DOC IS STALE — fix it in the same change (skills are living documents).
Why it matters. The most expensive bug is an edit built on a false premise validated only at the end — the cutover is large, the dependent edits many, and the disproof arrives after hours of work. RDD validates the premise when disproving it is cheap. It also kills over-engineering: a "guard to be safe" added without a proven need is an unvalidated CAUTION — the over-engineering twin of an unvalidated assumption.
Interaction with other rules. RDD precedes R1: validate forward so RCA is rarely needed, and so the RCA that does run reasons from a real bed, not a guess. It feeds R10 (/ov-eval:eval): the riskiest assumptions are proven on a disposable bed early, so the final fresh-rebuild gate confirms a design already de-risked. Canonical definition in CLAUDE.md "Risk Driven Development (RDD)". Its enforcement surfaces are testing-validator standard #9, the root-cause-analyzer forbidden-rationalization, and the soft End-of-turn / post-execution checklists — never a blocking gate, because "highest-risk, validated early" is a judgment, not a mechanical invariant.
The rule. Every issue surfaced during the active cutover — failing test, validator warning, runtime crash, deprecated-marker hit, dead-code reference, stale doc paragraph — is fixed in the SAME working tree as the cutover, OR escalated to the operator for explicit re-scoping. The classifications "pre-existing", "unrelated to this change", "out of scope", "follow-up PR", "tracked separately", "we'll get to it later" are FORBIDDEN.
Forbidden phrasings. "this was already failing before my change" / "unrelated to this PR" / "out of scope for this cutover" / "I'll file a follow-up issue" / "this is tracked separately" / "we can address that later" / "noted but not required for this commit" / "intentionally deferred to keep the diff focused".
What is permitted — classify BLOCKING vs NON-BLOCKING first. A blocking issue (the current change is incorrect, incomplete, or unsafe without it) is, by DEFAULT, fixed in the same working tree — the AI does not ask permission to fix what it found:
A non-blocking issue (the current change is correct AND complete without it, and it is genuinely separable from this change) takes a third path:
main, and no indefinite parking. The discriminator: would shipping the current cutover WITHOUT this fix leave the tree correct and the cutover's claim true? Yes → non-blocking (own immediate-next cutover); No → blocking (paths 1–2); unsure → blocking. Objective test for "separable": the current cutover's OWN R10 (eval-coverage + fresh-rebuild) passes and proves its claim WITHOUT the fix — the fix is neither exercised by nor changes the verdict of this cutover's test coverage; a fix that would alter this cutover's R10 result or eval-coverage gate is BLOCKING. Mislabeling a blocking issue "non-blocking" to ship faster — or carving the current change's OWN scope into two cutovers — is the forbidden split; a genuinely separate concern getting its own cutover is not.Why the escape hatch is closed. Deferring a surfaced test failure as "pre-existing, unrelated" leaves brokenness on main for the window between the deferral and the eventual fix — when fixing in place would have cost a few lines and 30 seconds of attention. R2 closes this escape hatch absolutely. (See CHANGELOG.md for the incident that motivated this.)
Why it matters. "Pre-existing" is the most common AI-agent escape hatch — it lets the agent claim the work is complete while leaving brokenness in the tree. Every "pre-existing" deferral compounds: the next cutover finds two pre-existing issues, the third finds three, and the codebase entropy grows monotonically. R2 forces the agent to either pay the small fix-cost now OR escalate to a human; it removes the silent third option of "leave it".
Interaction with other rules. R2 covers both approved-plan phasing AND incidentally-surfaced issues mid-session: no deferral, neither planned nor incidental.
The rule. On the FIRST surface where the same pattern, predicate, filter, transform, or guard appears in two places, refactor to ONE shared abstraction in the SAME working tree. Every fix MUST apply cleanly to ALL surfaces it logically covers, not just the surface that prompted the report.
Forbidden patterns. Sibling-layer naming (<name>-host, <name>-pod, <name>-bootc when they share content); parallel filter functions in adjacent files; per-call-site re-implementations of the same predicate; copy-pasted YAML stanzas across multiple layer.yml files; copy-pasted Go function bodies with one-token differences; "let me just patch this one consumer for now and unify later"; "the abstraction is unclear, so I'll duplicate for now and refactor when the pattern firms up".
What is permitted. ONE shared abstraction, in the obvious shared location. If the abstraction is unclear, the answer is to think harder, ask the operator, or escalate — not duplicate. If the cost of the refactor is high enough to warrant a discussion, escalate; do not silently duplicate.
Why it bites. Sibling-layer duplication (a <name>-host spawned for every host-vs-container difference instead of extending the one layer with init-system-aware logic) crystallizes into divergent surfaces that drift in their package lists, eval probes, and service definitions — and the eventual unification deletes far more than the original duplicate. The canonical fix is ONE compile-time filter, not a per-call-site band-aid: when the same predicate appears on N targets, it collapses to one shared filter, applied to all N in the same commit. (See CHANGELOG.md for the worked examples that motivated this.)
Why it matters. Duplication has compounding cost. Two divergent copies become three, then four, then eight; each copy hides bugs the others fixed. The cost of the unification grows superlinearly with the number of copies. R3 enforces unification at copy-count = 2 — the cheapest possible moment.
Interaction with other rules. R3 is paired with the architectural-philosophy framing in CLAUDE.md "Prioritize Clean Architecture Above All Else" — which now contains three labeled sub-paragraphs (No duplication on first surface, Generic over ad-hoc, No workarounds) that mirror R3 + R4 from the architectural angle. Both framings are binding.
The rule. Sleep loops, retry-on-flake harnesses, magic-number tuning, hardcoded paths chosen because "the standard one was busy", environment-specific shims, and "works on my machine" fixes are FORBIDDEN. Every fix must apply cleanly across every supported environment.
Forbidden patterns. sleep 5; retry (race-condition cover-up); for i in 1..3 do try; done (retry loops disguising flake); hardcoded ports chosen because "8080 was busy" (port-allocation bug disguised as a config); environment-specific paths like /Users/$USER/... in shipped code; default-fallbacks that hide a missing config (silent fallback to a wrong value); "this is what worked when I tried it locally" (single-environment validation).
What is permitted instead.
| Forbidden pattern | Authorized replacement |
|---|---|
| sleep N; check | Synchronization primitive: file lock, readiness probe, condition variable, deterministic ordering |
| Retry on flake | Identify the race; fix the race; remove the retry |
| Magic number | Named constant, sourced from config, validated on load |
| Hardcoded port | Port allocation from a registry, or service-discovery lookup |
| Environment-specific path | Standard XDG/FHS path resolved at startup |
| "Works on my machine" fix | Cross-environment validation before the fix ships |
The rule is preventive. R4 exists to forbid the patterns BEFORE they crystallize. Each forbidden pattern is the kind of "quick fix" that, once accepted, becomes tribal knowledge ("oh, that test always needs a sleep; that's just how it is"). R4 closes the door before the tribe forms.
Why it matters. "Temporary" fixes never get removed. Every sleep loop in the codebase was added with a "this is just for now" justification that was never revisited. R4 forbids the pattern at addition time, before the temporary becomes permanent.
Interaction with other rules. R4 is paired with R3 in the architectural-philosophy sub-paragraph "No workarounds" in the "Prioritize Clean Architecture" section. R4 violations also typically violate R1 — the workaround is an attempt to dodge the failure rather than RCA it.
The rule. When a cutover introduces a replacement, the SAME commit deletes (a) the deprecated code path, (b) every comment / TODO / DEPRECATED marker referencing the old path, AND (c) every reference, comment, docstring, error message, skill paragraph, migration help-text, test fixture, or hook string naming a deleted identifier. After commit, git grep '<deleted-id>' returns ONLY historical mentions in CHANGELOG.md or migration help-text.
The acceptance test. A cutover is not "clean" until you can run git grep '<deleted-id>' and have ZERO live (non-historical) hits. Historical hits in CHANGELOG.md and migration command help-text that names the legacy form for the user's benefit are permitted — these intentionally preserve the old name to help users migrate. Everything else is a violation.
The regression classes R5 prevents. Two motivate the rule. First, a silent-skip regression: deleting the old artifact (e.g. image.yml) while the replacement path quietly drops a stage it used to wire produces an artifact that builds but misbehaves at runtime — which is why the acceptance test is "rebuild from the new config, run the resulting image, observe the service reach steady-state", not just "it compiles". Second, stale references: a rename that doesn't sweep every mention in the same commit leaves a code search returning matches that imply the retired thing is still live. (See CHANGELOG.md for the incidents that motivated this.)
What is permitted in historical contexts. CHANGELOG.md entries and migration-command help-text that names the legacy form for the user's benefit ("rename qc to ov-cachyos"). The grep self-test distinguishes these via context.
Why it matters. Stale references confuse new contributors and AI agents. A code search for qc that returns matches in deploy.yml suggests the deployment is still live; a search that returns matches only in CHANGELOG.md suggests it was retired. R5's grep self-test enforces this distinction.
Interaction with other rules. R5 covers stale references everywhere, not just the deleted artifact itself. R5 is the cleanup discipline that R3 enables — once you've refactored to the unified abstraction, R5 ensures every old reference points to the new one.
R1–R5 are authoring discipline — what you do during the work.
R6–R9 are artifact discipline — what the produced artifact must be.
R10 is live-system discipline — what the deployed-and-running system must do.
The three layers compose. A cutover that violates R3 (duplication) but passes R7 (end-to-end gate) is still a violation — the duplication is a future bug. A cutover that passes R3 but fails R10 (fresh-rebuild verification) is still a violation — the artifact failed live. The three layers are AND-gated; all must pass.
Per CLAUDE.md "AI Attribution" section: a violation at any layer FORBIDS commit. The four-tier table describes the proof level the agent has when committing IS permitted; a known violation means committing is NOT permitted, regardless of tier. The agent fixes the violation or escalates to the operator — never both downgrade and ship.
/ov-internals:cutover-policy — the operationalization of R5 for schema/API/rename changes specifically. Cutover-policy is one of strict-policy's children — R5 governs all hard-cutover behavior, and cutover-policy operationalizes it for the most common case./ov-internals:root-cause-analyzer agent — the R1 mandatory-invocation target. The agent's 8-step process is the only authorized first response to a failure./ov-internals:disposable — R10's verification target. Strict-policy R5 (cutover) cooperates with R10 (verification) to ensure the post-rename state is both clean and live./ov-internals:skills — the meta-skill for skill maintenance. R5's stale-reference sweep includes skill paragraphs; the skills meta-skill has a "When to Update Skills" row dedicated to R5 self-test failures.MUST be invoked when:
The skill is also a useful read before starting a non-trivial cutover — both as a refresher and as a way to internalize the forbidden-internal-voice triggers in each rule's "Forbidden patterns" / "Forbidden phrasings" lists.
development
Claude Code multi-agent support in Overthink — sub-agents, dynamic workflows, and agent teams, and how each drives the existing `ov eval` disposable beds to test and verify. MUST be invoked before authoring or invoking an ov sub-agent / dynamic workflow / agent team, wiring agent-lifecycle hooks, or asking "which primitive should drive the R10 beds?".
tools
Mounts a virtiofs share tagged `workspace` at /workspace inside a VM guest via a systemd .mount unit. Use when a kind:vm entity shares a host directory into the guest and you need it auto-mounted (and re-mounted at every boot).
development
MUST be invoked before any work involving: the `kind: android` schema kind, a `target: android` deploy, the `apk:` layer package format (installing Android apps declaratively), AndroidDeployTarget, an in-pod emulator OR a remote/physical adb-endpoint device, or nested `pod → android` deployment. The first-class Android device + app surface that sits above `ov eval adb`/`appium`.
tools
Use when committing, branching, pushing, merging, tagging, creating PRs, or approving/merging PRs with gh — the feat/-branch, R10-gated, never-force-push landing workflow across the main repo + the plugins submodule + image/<distro> submodules. Covers sync-to-upstream, branch/worktree pruning, the fork+PR path for contributors without write access, and cross-repo @github landing order.