Dev Workflow

Usage

/dev-workflow --init                             # Project setup (detect check/test commands)
/dev-workflow [-i N | --iterations N] <task>    # Execute workflow (default)
/dev-workflow --resume <state-file> [-i N]      # Resume next subtask from a decomposition state file

Prerequisites

Reviewer skill (reviewer setting, default: ask-peer): Required for plan/code review. Supported: ask-peer, ask-claude, ask-codex, ask-gemini, ask-copilot, ask-agy. If a Skill() call for the configured reviewer fails, attempt once more before declaring unavailable. If still unavailable, present the user with three explicit fallback options, each with its own resume semantics: (a) switch to another supported reviewer from the list — re-invoke the current review step with the new reviewer immediately (the original reviewer is not retried); (b) self-review — perform the review inline and advance past the current step (no later retry of the original reviewer); (c) pause at the current gate until the skill is installed — name the specific step where the original reviewer call will be retried once the skill is available. Do not silently advance past a review pass without the user knowing their options.
rules-review skill: Required for rules compliance review (Step 7.5). If a Skill(rules-review) call fails, attempt once more before declaring unavailable. If still unavailable: skip Step 7.5 with a message that names the fallback (Step 8 reviewer as a lightweight backup) and the resume point (re-run rules-review manually after the session or re-run the workflow once the skill is installed).
extract-rules skill: Required for rule update. If a Skill(extract-rules) call fails, attempt once more before declaring unavailable. If still unavailable: skip Step 11 with a message that names the fallback (no rule updates this run) and the resume point (invoke extract-rules manually after the session to capture rule changes).
Cleanup skill (Step 6 Tidy): The Step 6 cleanup pass prefers the built-in simplify skill. Invoke Skill(simplify); if the call fails (skill-not-found or equivalent — a Claude Code version that lacks the built-in simplify), attempt once more, then emit a one-line note naming the fallback (e.g. simplify unavailable — falling back to in-house tidy) and fall back to the bundled Skill(tidy). "Available" is defined by the observable call outcome (a successful call), not by introspecting the in-context skill list — this mirrors the reviewer / rules-review / extract-rules bullets above so the orchestrator follows it deterministically. This bullet is the single source of truth for the simplify→tidy resolution; Step 6 references it rather than restating the definition. The fallback proceeds without a user gate (unlike the reviewer bullet's three-option prompt): tidy is a functionally-equivalent cleanup pass, so swapping it for simplify does not change outcomes materially enough to warrant a user decision — whereas a reviewer swap changes review quality and so warrants one. After simplify (or the tidy fallback) returns, judge the result semantically and proceed per § No-Stall Principle.

Configuration

Settings files (YAML frontmatter only, merged across layers):

~/.claude/dev-workflow.local.md — User global defaults (lowest priority)
.claude/dev-workflow.md — Project shared settings (git tracked, team-shared)
.claude/dev-workflow.local.md — Personal overrides (gitignored, highest priority)

Merge strategy per key type (summary — the canonical operational definition, including the null/empty-clears and absent-inherits rules, is the Step 1: Load Settings "Overlay" procedure; keep the two in sync):

Scalar (reviewer, review_iterations, subagent_model, task_decomposition, interactive_commits, compact_rules, custom_instructions, language): higher layer wins (replaces) when the key is present; a key absent from a higher layer inherits from lower layers (see the inherit note below). When review_iterations carries a map value ({plan, code}) it is still a scalar key here — a higher layer's value replaces the lower layer's wholesale, with no per-key cross-layer merge (an absent map key is not back-filled from a lower layer; it falls to default 3 at resolution time). The subagent_model map ({<tier>: <model>}) is the same scalar/map class — a higher layer's map replaces the lower layer's wholesale (no per-key cross-layer merge), and an absent tier key falls to its built-in per-tier default at resolution time (sonnet for trivial / simple, inherit for moderate / complex)
List (check_commands): append — lower-layer items first, then higher-layer items, duplicates removed (keep first occurrence)
List-replace (test_commands): higher layer's list replaces lower layer's list as a whole (no item-level merge or dedup). Defaults to ["Skill(run-tests)"] when unset
hooks: deep-merge at the hooks level — each sub-key (on_complete) is merged as a list (append, deduplicated)

Keys absent from a higher layer inherit from lower layers. Only specify keys you want to override or extend.

---
reviewer: "ask-peer"
review_iterations: 3
subagent_model:
  trivial: sonnet
  simple: sonnet
task_decomposition: true
interactive_commits: true
compact_rules: false
custom_instructions: "Always use TDD. Write tests before implementation."
language: "ja"
check_commands:
  - "pnpm run lint:fix"
  - "pnpm run format"
  - "pnpm run typecheck"
test_commands:
  - "Skill(run-tests)"
hooks:
  on_complete:
    - "Skill(work-complete)"
self_retrospective:
  feedback: "owner/repo"        # or "/abs/path", "~/rel", "./rel"
---

reviewer: Reviewer skill name (default: ask-peer). Choose from: ask-peer, ask-claude, ask-codex, ask-gemini, ask-copilot, ask-agy. Unsupported values fall back to ask-peer
review_iterations: Max iterations for Plan Review (Step 3) and Code Review (Step 8) (default: 3). Two accepted forms: (i) a scalar positive integer applies the same cap to both phases (e.g. review_iterations: 2); (ii) a map {plan: N, code: M} sets the Plan Review cap (Step 3) and the Code Review cap (Step 8) independently (e.g. {plan: 1, code: 3}). Each map key must be a positive integer; an absent or invalid key falls back to default 3 for that phase only (per-key validation — see Step 1 sub-step 4). Adopting the map form is the user's explicit opt-in; the scalar form and an absent key are fully backward-compatible (both phases default to 3). Can be overridden per invocation with -i N / --iterations N, which overrides both phases with the same value regardless of the config form
subagent_model: Optional. A map from difficulty tier (trivial / simple / moderate / complex) to the model the workflow's Agent-tool subagent dispatches run on — sonnet / opus / haiku, or inherit (use the session model). It governs (i) the workflow's direct Agent dispatches (Step 7's two background launches, Step 11.5) and (ii) the model propagated via the Model: argument to the named callees the workflow dispatches (Step 7.5 rules-review; the Step 3 / Step 8 inline reviewer when the resolved reviewer is Claude-family — see Step 1's reviewer-family classification; external-CLI reviewers are not affected). Resolved once in Step 2 from the assessed tier (see Step 2's Adjust N for the resolution chain and the -i-path handling). Built-in default = {trivial: sonnet, simple: sonnet} (moderate / complex inherit). Behavior change: under this default, Trivial and Simple tasks run their subagent dispatches on sonnet instead of the session model. To opt out (restore all-inherit on the low tiers), set subagent_model: {trivial: inherit, simple: inherit} in .claude/dev-workflow.md or ~/.claude/dev-workflow.local.md. Invalid values / unknown tier keys warn and fall back to the built-in per-tier default. hooks.on_complete skill entries' models are independent of this key — the workflow never passes Model: to hooks.on_complete callee skills; each callee's model is set skill-side and is unaffected by subagent_model. Per-subagent effort is out of scope — the Agent tool exposes only model.
task_decomposition: Whether Step 1.5 runs the auto-decomposition check in Normal sub-mode (default: true). Set to false to treat Normal sub-mode requests (/dev-workflow <task>) as single tasks — Step 1.5 is omitted from the task list and the decomposition judgment is skipped entirely. --resume <state-file> is unaffected and still executes existing state files. Non-boolean values fall back to true with a warning
interactive_commits: Whether Step 10 (Interactive Commits) runs after hooks.on_complete (default: true). When true, after Step 9 (Completion Hooks) the workflow proposes commit groupings and messages, then iterates per-commit with the user. When false, Step 10 is omitted from the task list and never executes — the workflow ends with an uncommitted tree as before. Non-boolean values fall back to true with a warning. To opt out, set interactive_commits: false in .claude/dev-workflow.md or ~/.claude/dev-workflow.local.md
compact_rules: Whether Step 11 sub-step 3 (Char-count compaction gate) runs (default: false). The compaction mode added in v1.38.0 is currently experimental — when false (the default), sub-step 3 is skipped entirely: Skill(extract-rules) --compact is never invoked, the gate is never opened, and compaction_applied_count / below_threshold_failed_files stay at their initial values so § Completion's compaction reminder is automatically omitted. When true, the workflow invokes Skill(extract-rules) --compact and may enter the Step 11 compaction approval gate (USER APPROVAL GATE). Non-boolean values fall back to false with a warning. To opt in for a specific project, set compact_rules: true in .claude/dev-workflow.md or .claude/dev-workflow.local.md
custom_instructions: Free-form development instructions applied as guiding principles across planning, implementation, review, and tidy phases (e.g., "Always use TDD", "Prefer functional style"). Optional. .claude/rules/ and explicit user requests take precedence if they conflict
language: Optional. Output language code (e.g. ja, en) for user-facing prose produced by this skill — Step 4 plan body (Overview / Decisions / Design / Test plan / Risks / Unknowns content), user-gate preambles (Step 4 / Step 7.5 / Step 8), Step 2 difficulty-assessment log, Step 10 commit-plan / per-commit gate output (subjects, body, diff blocks framed in the resolved language; verbatim git output and file paths remain English), Completion summary, and Step 11.5 finding Description / Suggested fix direction paragraphs. Resolution: merged skill config → Claude Code settings (~/.claude/settings.json → language field) → default ja. null / empty string / non-string values fall through to the next resolution step. For the localization boundary between translated concepts and verbatim identifiers, see references/plan-format.md § Localization granularity. See references/self-retrospective.md §2.1 Language handling / §5 Contract note for the Step 11.5 scope contract. No Step 11.5 output unless self_retrospective.feedback is also configured
check_commands: Static checks (lint, format, typecheck, etc.). Always run all in order
test_commands: Defaults to ["Skill(run-tests)"]. Each entry must be a Skill(<name>) string (no shell commands). Entries run sequentially during Step 7. Run --init to generate or update run-tests; additional structural-check skills can be appended in project config (e.g. for bundle-sync drift detection, custom marketplace structure validators, or other repository-specific checks)
hooks: Execute skills/commands at specific workflow timing points
- on_complete: Runs as Step 9 (immediately after Step 8 Code Review). Entry format: Skill(<name>) or shell command string
- Entries not covered by allowed-tools require user approval
self_retrospective: Optional. Emits sanitized improvement signal for the dev-workflow-bundle skills (dev-workflow, ask-peer, extract-rules, rules-review) at Step 11.5 (between Step 11 and Completion). Raw conversation stays in-session; only abstracted text leaves
- feedback: Destination string. Auto-detected:
  - Starts with /, ~/, ./, or ../ → local directory path → retrospective written as a markdown file under that directory
  - Matches ^[\w.-]+/[\w.-]+$ → GitHub owner/repo → retrospective submitted via gh api POST to /repos/<feedback>/issues
  - Any other string (including empty) → warn and skip Step 11.5
- If feedback is unset, Step 11.5 is not registered as a task and never executes — the workflow behaves as before
- Step 11.5 runs whenever self_retrospective.feedback is configured, regardless of the Step 2 difficulty assessment — difficulty gates the review-iteration counts N_plan / N_code (Step 3 / Step 8) and the difficulty-skip matrix (Step 6 Tidy / Step 7.5 Rules Compliance on Trivial / Simple), but not the self-retrospective. Even Simple / Trivial tasks emit a retrospective when feedback is set; when nothing notable surfaced, the retrospective is simply short
- Agent tool usage: Exactly two steps directly spawn subagents via the Agent tool — Step 11.5 (for jsonl scan + sanitization) and Step 7's two concurrent background launches, the per-pass rules-review launch and the per-pass code review (run_in_background dispatches for test-phase overlap; see Step 7's "Concurrent rules-review launch" and "Concurrent code review launch" paragraphs for why Agent rather than Skill()). This is two steps / three dispatch sites (Step 7 carries two launches, Step 11.5 one); each of the three dispatch sites passes the Step 2-resolved subagent_model as the Agent model parameter (omitted when the resolution is inherit). All other steps delegate to named skills (Skill(ask-peer), Skill(run-tests), Skill(rules-review), Skill(simplify) / Skill(tidy), etc.) and must not invoke Agent directly. (The Step 3 / Step 8 inline reviewer's subagent_model propagation rides the named Skill(<reviewer>) call's Model: argument — it is not a direct Agent spawn and does not count against the "two steps" above.)

Mode Detection

--init → Init Mode (-i / --iterations is ignored)
--resume <state-file> → Execution Mode (Resume sub-mode; see Step 1.5)
Otherwise → Execution Mode (Normal sub-mode)

Init Mode

Read references/init-mode.md and follow the procedure.

Note: Skills generated by --init (e.g. run-tests) are recognized from the next session onward. Do not run /dev-workflow <task> in the same session as --init.

Execution Mode

No-Stall Principle

Once the workflow has started (after Step 1.5 resolves the effective task), it must run to Completion without pausing, except at the explicit user-gate points enumerated below. Every other step — including every skill invocation, every no-op outcome, every "nothing to report" result — must be judged semantically by the agent and passed through automatically. Do not rely on exact-phrase matching; if the skill result reads as a successful completion (fixes applied, no changes needed, no violations, no new rules, or any equivalent "success / no-op" outcome regardless of wording), treat it as success and proceed to the next step.

Explicit user-gates (the only permissible pause points):

Each bullet names the gate and points to the authoritative definition site. When editing either the enumeration or the definition, update both together.

Step 1.5 task-decomposition proposal dialogue — yes / adjust / no confirmation (Normal sub-mode; defined in Step 1.5 dispatch and references/task-decomposition.md § B. Normal sub-mode)
Step 1.5 leftover-subtask picker dialogue — selecting which subtask to run when more than one leftover in_progress subtask is runnable (Resume sub-mode; defined in references/task-decomposition.md § A. Resume sub-mode)
Step 4 plan approval (defined in Step 4: Finalize Plan)
Step 5 probe → real-implementation user-observation gate — when the Plan explicitly stages a probe / intermediate-artifact step before its real-implementation replacement: hold the workflow at the boundary until the user signals observation completion (defined in Step 5's "User-observable artifact protection gate" paragraph). Fires conditionally per the Plan's content — non-probe-staged plans never enter this gate
Step 7 pre-execution scope-narrowing stop — when a check_commands entry is assessed as a repo-wide auto-fix tool, the working tree has unrelated existing changes, and scope narrowing is not feasible given the tool's interface: stop and ask the user for direction (options: run accepting full-width effect, skip, or provide an alternative scoped invocation) (defined in Step 7: Check / Test)
Step 7 scope-drift stop — when check_commands writes non-trivial changes outside the task-scope snapshot (trivial = whitespace-or-comment-only formatting on ≤ 5 lines attributable to the formatter/linter that just ran — those proceed automatically with a one-line note): warn and wait for user direction (defined in Step 7: Check / Test)
Step 7 check/test fail-stop — failure after 3 retries: report the error and stop (defined in Step 7: Check / Test). Note: this is an error-stop, not a pause for user decision
Step 7.5 persistent-violations decision — rule violations still present after the 2nd review cycle (defined in Step 7.5: Rules Compliance Review)
Step 8 unresolved-findings decision — reviewer-reported actionable findings still unresolved after the N_code-th iteration (defined in Step 8: Code Review)
Step 10 commit-plan approval gate — accept the proposed commit grouping (subjects + file lists) for the working-tree changes; fires once on the initial plan and re-fires whenever a Mid-loop adjust file-regrouping / split-adding branch rebuilds the un-landed portion of the plan (defined in references/interactive-commits.md § Propose commit plan)
Step 10 per-commit accept gate — accept each individual commit (subject / body / files / diff) before it lands; repeats N times where N is the approved commit count (defined in references/interactive-commits.md § Per-commit loop, judged per § Approval token closed list inside Step 10)
Step 10 fold-or-defer gate — after a pre-commit hook auto-modifies the working tree following a zero-exit commit, ask the user whether to amend the just-landed commit (fold) or leave the changes uncommitted for a later iteration (defer); judged per the dedicated 5-branch → fold / defer / cancel / re-present-as-adjust classifier in references/interactive-commits.md § Post-commit auto-modify cycle bound (the 5 input branches extend § Approval token closed list's 4 buckets with an additional defer-direction branch; this gate is not the per-commit-accept-gate enum — cancel routes via Mid-loop cancel and ambiguous adjust responses re-enter the gate via § Mid-loop adjust branch f, both in the same reference)
Step 10 ambiguous-adjust clarifier — when a Mid-loop adjust request cannot be classified into branches a–e, ask the user a clarifying question and re-enter the gate that issued the request — this gate is itself the disposition for branch f of Mid-loop adjust — closed-list branches (in references/interactive-commits.md; categorization vocabulary depends on which gate originated the request)
Step 11 compaction approval gate — when Skill(extract-rules) --compact returns top-level status: "compacted", present per-file diff (chars_before / chars_after / iterations_used / applied_edits_count / structural_notes / per_file_status / below_threshold) per § User-gate summary preamble and wait for accept/reject/adjust/cancel per the Step 11 local closed list (defined in references/update-rules.md § Char-count compaction gate). cancel aligns with Step 10's Mid-loop cancel semantic (no revert; see references/interactive-commits.md § Mid-loop cancel); adjust uses Step 11's own three-case closed list (per-file disposition / clarification / other), not Step 10's branch f
Completion execution-time deferral/exclusion gate — when executing a decomposed subtask, if in-scope work items were excluded / deferred / discovered-unassigned during implementation or testing, ask the user to promote each uncovered item to a tracked subtask entry: (a) add as a new pending subtask (with depends_on if sequencing matters), (b) fold into an existing pending subtask's scope, or (c) explicitly accept as permanently out of parent-task scope (defined in Completion's "Execution-time deferral/exclusion gate" paragraph). Fires conditionally — only on decomposed-subtask runs that surfaced uncovered items
Completion subtask PR URL prompt — when executing a decomposed subtask, ask for optional PR URL before resuming (defined in Completion)

Fatal errors are out of scope for this principle: configuration-file absence, malformed state file, irrecoverable skill / tool failures, and similar infrastructure-level errors halt the workflow with a diagnostic regardless of whether they appear in the list above. The No-Stall Principle governs successful step outcomes (including no-op successes); it does not force the agent to push through genuine errors.

At any point not listed above — including after Skill(simplify) / Skill(tidy), Skill(rules-review), Skill(extract-rules), Skill(run-tests), and reviewer skills return, and including collecting the background rules-review result (Step 7.5 sub-step 1) and the background code-review result (Step 8 sub-step 1), both launched in Step 7 — the agent must never wait for the user to say "continue" / "続けて". Semantic judgment of the returned result is sufficient.

No-summary turn at review-return boundaries. When a reviewer or sub-skill returns a result that is semantically "nothing actionable" (no findings, no violations, no changes needed — regardless of the exact wording or the length of the response), the immediately next turn must begin with a tool call (a TaskUpdate to mark the iteration as completed, or the next step's invocation), not with a prose summary of the review outcome. Category-by-category verdict lists, conclusion paragraphs, and "shall I proceed?" sentences are the stall pattern — emit them only in the Completion summary (the ### Completion section that runs after Step 11.5), never at review-return transition boundaries. This applies to: Skill(ask-peer) / Skill(ask-claude) / Skill(ask-codex) / Skill(ask-gemini) / Skill(ask-copilot) / Skill(ask-agy) returning no actionable findings at Step 3 or Step 8 (at Step 8, whether returned inline or collected from the Step 7 background launch), Skill(simplify) / Skill(tidy) returning no changes, Skill(rules-review) returning no violations (whether returned inline or collected from the Step 7 background launch), Skill(extract-rules) returning no new rules at Step 11, and any other sub-skill whose result is treated as success.

Callee verdict transcription is not a turn boundary. When a sub-skill (Skill(simplify) / Skill(tidy) / Skill(rules-review) / Skill(extract-rules) / Skill(run-tests) / reviewer skills / any other callee) returns a fenced JSON verdict, status token, or structured summary, and the orchestrator's response re-transcribes that block (verbatim or paraphrased) in its own output, the transcribed block does not end the orchestrator's turn. The same agent must immediately issue the next tool call in the same turn — the next sub-step's invocation, the next iteration's dispatch, the next phase's transition, the next Step's first tool call. Specifically forbidden: inserting a "shall I proceed?" sentence after the transcribed verdict; emitting "ここまでで一区切り" / "ここまでで完了です" prose summaries between the verdict and the next action; ending the response on the verdict block and waiting for the user to say "continue" / "続けて". This rule extends the "no-summary turn" rule above to the case where the sub-skill returned an actionable result and the orchestrator's response carries the verdict's content forward — the verdict transcription itself is informational, not terminal. (For skill development this covers Pattern A iteration loop verdict returns where the orchestrator re-renders the JSON before re-dispatching, orchestrator multi-callee chains where one callee's verdict feeds the next callee dispatch, sequential sub-step completion marking, and hook-chain continuations.) Sub-step completion prose ("Step N complete", "(d) verify-diff returned converged") follows the same rule: completion reports in prose are not turn-end signals; the next sub-step's first tool call must follow in the same turn.

Progress Visibility

Before any subagent-backed skill call (Skill(<name>) invocations including run-tests, ask-peer, simplify, tidy, rules-review, extract-rules) or any shell command expected to take ≥ 30 seconds, emit a brief status message naming what is starting — e.g. "Starting test run via run-tests…" or "Calling ask-peer for plan review (iteration 1 of N)…". Emit the message as prose in the same assistant turn that issues the tool call, not as a separate preceding turn. This lets the user distinguish an agent in active progress from one that has stalled. After the step returns, proceed immediately to the next step per the No-Stall Principle — do not emit a separate acknowledgment turn. When Step 7 launches the background rules-review and/or the code review as background Agents, each dispatch is a subagent-backed call and emits its status line here; collecting the background results later (rules-review at Step 7.5 sub-step 1, code review at Step 8 sub-step 1) is a non-stalling return-boundary — proceed without a separate acknowledgment turn.

Mid-chain visibility (chained sub-skill calls or extended interpretation between tool calls). When a workflow phase issues sub-skill calls in a chain or spans extended internal interpretation / preparation across multiple tool calls (for skill development this includes a pre-implementation feasibility-check phase that fires several sub-skill dispatches in sequence, or a routine skill that issues several sub-skill dispatches back-to-back), the single pre-call status message rule above does not fully cover the user-visibility window. Extend the rule with a "current-location" line emitted at semantic checkpoints between dispatches — one short sentence naming the current phase and the next action ("Finished verify-diff for Finding 1; next: skill-review polish on the same file"). To keep the addition from re-introducing stall, three constraints bind its shape: (a) emit the current-location line as prose in the same turn as the next tool call, never as a standalone turn that waits for user input; (b) restrict the content to current phase name and next action only — review-result summaries, decision rationales, and "shall I proceed?" sentences stay out; (c) the rule does not apply to short same-turn chains of a few tool calls that complete inside a single turn — only to phases where the gap between user-visible signals would otherwise span multiple turns. Intent: in chained sub-skill phases (feasibility checks, routine dispatch loops, multi-call interpretation work) the user keeps seeing "this is alive and moving", while the No-Stall Principle's confirmation-prohibition stays intact.

Workflow artifacts (cross-step fixed exclusion)

Files this workflow itself creates and maintains as in-session state — plan documents under .claude/plans/, decomposition state files written by Step 1.5 or Step 10, and other workflow-internal staging artifacts placed under .claude/ by this skill — are cross-step fixed exclusions from any per-step changed-file enumeration (Step 6 Tidy scope, Step 7.5 rules-review diff input, Step 10 Interactive Commits' commit grouping, sub-skill dispatch payloads, scope checks). The exclusion is structural — the workflow owns these files as its own operational substrate — and is not gated on whether the path appears in .gitignore, whether a formatter ignore-file aligns, or whether the user happens to be touching them in this run. Steps that build a changed-file set, a diff-scope set, or a commit grouping must apply this single shared exclusion rather than re-deriving the rationale per step against ad-hoc justifications. If a future change adds another in-session-state path, extend this canonical list once rather than threading the exclusion through per-step prose (for skill development this is the canonical workflow-artifact set; sub-skills the workflow dispatches that maintain their own in-session state under .claude/ follow the same convention).

Step 1: Load Settings

Read settings from up to three layers and merge (type-aware):
```
merged = {}
if ~/.claude/dev-workflow.local.md exists:  overlay its frontmatter onto merged
if .claude/dev-workflow.md exists:          overlay its frontmatter onto merged
if .claude/dev-workflow.local.md exists:    overlay its frontmatter onto merged
```
"Overlay" = for each key present in the file:
- Scalar keys: merged[key] = file[key] (replace) — this includes review_iterations when its value is a map ({plan, code}): the whole map replaces the lower layer's value with no per-key cross-layer merge (a map key absent from the higher layer is not back-filled from the lower layer)
- List keys (check_commands): append file[key] items after merged[key], then deduplicate (keep first occurrence)
- List-replace keys (test_commands): merged[key] = file[key] — the higher layer's whole list replaces the lower layer's (no item-level merge or dedup)
- hooks: deep-merge — for each sub-key (e.g. on_complete), append and deduplicate the list
- null or empty ([], {}) explicitly clears the key — lower-layer value is discarded, not inherited
- Key absent from the file: left untouched (inherit from lower layers) If a file's YAML frontmatter is malformed (parse error), warn the user naming the file, skip that layer, and continue with remaining layers.
If none of the three files exist, prompt user to run /dev-workflow --init and stop
Resolve reviewer from config. If not specified or not in the supported list (ask-peer, ask-claude, ask-codex, ask-gemini, ask-copilot, ask-agy), use ask-peer. Reviewer-family classification (the single definition referenced by the Step 3 / Step 8 inline-reviewer subagent_model propagation): Claude-family = ask-peer / ask-claude — model-controllable (ask-peer via its Model: argument applied to its internal Agent dispatch; ask-claude via the claude -p --model flag); external-CLI = ask-codex / ask-gemini / ask-copilot / ask-agy — these run their own non-Claude models and are not subagent_model-controllable (never receive a propagated model)
Resolve the review iteration counts — N_plan (Plan Review, Step 3) and N_code (Code Review, Step 8). A scalar config, the -i option, and the default all set both values equally; only the map config form makes them differ:
1. If -i / --iterations option is present and is a positive integer, set both N_plan and N_code to it (the option overrides both phases)
2. Else if config review_iterations is present:
  - scalar positive integer → set both N_plan and N_code to it
  - map ({plan, code}) → N_plan = plan if it is a positive integer else default 3; N_code = code if it is a positive integer else default 3 (per-key validation, independent per phase; warn on each absent/invalid key)
  - any other value (non-positive or non-integer scalar, list, string, or any non-map / non-scalar type) → warn and set both to default 3 (any map takes the map branch above — an empty map, or a map with no valid plan / code key, already resolves to default 3 per phase there)
3. Else use default 3 for both Wherever a later step says "N" without a phase qualifier, Step 3 references resolve to N_plan and Step 8 references to N_code.
Parse hooks from config. Warn and ignore if hooks.on_complete has invalid format. For review_iterations, emit the invalid-value / invalid-map-key warnings as sub-step 4's resolution defines (sub-step 4 owns the case analysis and the default-3 fallback per phase). Parse custom_instructions from config (optional, string). Warn and ignore if not a string. Parse task_decomposition from config (optional, boolean, default true). Warn and fall back to true if present but not a boolean. Parse interactive_commits from config (optional, boolean, default true). Warn and fall back to true if present but not a boolean. Parse compact_rules from config (optional, boolean, default false). Warn and fall back to false if present but not a boolean. Parse subagent_model from config (optional, map of tier → model). Warn and ignore a non-map value (the built-in per-tier default applies); for each tier entry, warn and drop an unknown tier key or a value outside the enum sonnet / opus / haiku / inherit (that tier then falls to its built-in default at resolution time). The merged map is consumed by Step 2's subagent_model resolution. Parse language from config per the Configuration bullet above. For ~/.claude/settings.json, silently accept missing file / absent key / null value; warn once per Step 1 settings-load pass on malformed JSON, non-string, or empty string. The resolved language is available to Step 11.5. Parse self_retrospective.feedback from config (optional, string). Warn and ignore if not a string or if empty string "". When feedback matches the owner/repo pattern (^[\w.-]+/[\w.-]+$), additionally run gh auth status as an early warning only — if auth fails, warn but do not block the run
Determine execution sub-mode: Resume if --resume <state-file> was provided, otherwise Normal. Step 1.5 branches on this
Register all workflow phases with the Task tools, including review iterations — issue one TaskCreate per phase below (each returns an auto-numbered taskId). Do NOT skip any phase:
- Step 1.5: Task Decomposition (Normal sub-mode only, AND only when task_decomposition is true — omit this entry entirely in Resume sub-mode or when task_decomposition is false, since in either case the step has nothing to do at registration time)
- Step 2: Create Plan
- Step 3: Plan Review
- Step 3-1 through Step 3-N_plan: Plan Review - iteration 1 through N_plan (generate N_plan items based on resolved N_plan)
- Step 4: Finalize Plan
- Step 5: Implement
- Step 6: Tidy
- Step 7: Check / Test [check: {check_commands} | test: {test_commands}]
- Step 7.5: Rules Compliance Review
- Step 8: Code Review
- Step 8-1 through Step 8-N_code: Code Review - iteration 1 through N_code (generate N_code items based on resolved N_code)
- Step 9: Completion Hooks (only if hooks.on_complete is configured)
- Step 10: Interactive Commits (only if interactive_commits is true; single row — per-commit iteration is handled inline within Step 10 because the commit count is not known until the proposal phase)
- Step 11: Update Rules
- Step 11.5: Self-Retrospective (only if self_retrospective.feedback is set and parses as a valid destination — see Step 11.5 for detection rules; if unset/invalid, omit this entry) Tool availability (Task tools vs TodoWrite): these steps name the Task tools (TaskCreate / TaskUpdate / TaskList), the default since Claude Code v2.1.142. Where the Task tools are unavailable (e.g. the VSCode extension, or Claude Code before v2.1.142), use the equivalent TodoWrite operations instead — the status values (pending / in_progress / completed) and the register-all-upfront semantics are identical, and a TaskList-by-subject status read becomes a read of the TodoWrite list. allowed-tools grants both, so use whichever the environment exposes. Registration mechanics (Task tools): issue every TaskCreate in a single upfront burst (one tool-call batch) so all phases are registered before Step 2 begins. Two conditional cases: (i) conditionally-omitted phases (the list items above carrying a condition) are omitted by not issuing their TaskCreate; (ii) N-reduced excess iteration tasks (Step 3-x beyond resolved N_plan / Step 8-x beyond resolved N_code) are still TaskCreated here at the resolved ceiling (Step 3 ceiling = N_plan, Step 8 ceiling = N_code), then marked completed via TaskUpdate by Step 2's Adjust N by difficulty. Mark each task in_progress (via TaskUpdate {taskId, status}) when starting and completed when done. Task-handle resolution convention: every later "mark Step N as in_progress / completed" instruction in this skill is shorthand for resolve that Step's task — by its registration-time captured taskId, or by subject via TaskList — then TaskUpdate {taskId, status}; the per-step lines name tasks by their human-readable subject and do not restate this resolution path. Registering all phases upfront gives the user visibility into overall progress and prevents steps from being accidentally dropped. Phase-boundary self-audit: at every top-level Step transition (not the iteration sub-rows Step 3-i / Step 8-i, which are governed by the Return-point no-stall reminders below), before issuing the first tool call that advances into a new Step's procedure, name the Step number you are entering, resolve the prior Step's task by subject via TaskList, and verify it is completed — if it is still pending or in_progress, return to the unfinished Step first instead of advancing. This guards against silent phase-skipping (e.g. jumping from Step 5 Implement to Step 7 Check / Test without running Step 6 Tidy, only to discover the gap during a later phase) that the task registration alone cannot prevent. Implementation sub-tasks in Step 5 are additions, not replacements. Note: Unless -i / --iterations was explicitly specified, Step 2 may reduce N_plan / N_code based on task difficulty.
Context-compaction recovery: if the session context was compacted (prior turns summarized) before reaching this step in the current turn, re-read the configuration files from disk rather than relying on the summary — verify each step's skip conditions (e.g. whether self_retrospective.feedback is set, whether hooks.on_complete is configured, whether interactive_commits is true, whether compact_rules is true) from the actual merged config, not from compacted context.

Step 1.5: Task Decomposition

This step decides whether the user's request should be split into multiple smaller subtasks (each delivered as its own PR), or — in Resume sub-mode — picks the next subtask from an existing state file under .claude/plans/dev-workflow.<slug>.md.

State-file semantics are critical (a malformed or mis-routed file silently corrupts subtask boundaries), so the full procedure lives in a dedicated reference. Dispatch:

Resume sub-mode (--resume <state-file> was provided): read references/task-decomposition.md and follow section A. Resume sub-mode from top to bottom.
Normal sub-mode + task_decomposition: true (the default): read references/task-decomposition.md and follow section B. Normal sub-mode.
Normal sub-mode + task_decomposition: false: no decomposition work. Set the "effective task" to the original request and proceed to Step 2 without creating a state file. Step 1.5 is not registered as a task in this case (see Step 1), so there is nothing to mark completed. You do not need to read the reference file.

EnterPlanMode is reserved for Step 2 — any decomposition proposal in Step 1.5 is a plain yes/no dialogue, not a plan.

After section A or B completes, the "effective task" is set for Step 2 onward: the selected subtask when decomposed, otherwise the original request.

Step 2: Create Plan

Record the current commit as base-commit (git rev-parse HEAD) for later diff comparison. Initialize the difficulty-skip ledger here: set difficulty_skipped_steps = [] (a cross-step list of human-readable records — <step> skipped (<tier> tier) — that § Completion's difficulty-skip reminder renders). This init lives at Step 2 entry, outside the -i-gated Adjust N sub-step below, so the variable is well-defined on every path: when Adjust N is skipped (because -i / --iterations was given) or no tier qualifies for a skip, the ledger simply stays empty and § Completion omits the reminder. Same purpose as the compaction_applied_count State-variable contract (well-defined on the skipped path) but a different technique: compaction initializes inside its conditionally-skipped sub-step and relies on a prose contract; this ledger is physically hoisted out of the skippable sub-step instead — do not relocate it into Adjust N expecting prose to cover the -i path, since the init statement would then not run under -i. Also initialize the subagent_model cross-step variable here (same hoist rationale): set subagent_model = inherit (no model override). The built-in tier → model map (§ Configuration) is consulted only after a tier is assessed in Adjust N; the pre-assessment value — and the value on the -i path, where Adjust N is skipped and no tier is assessed — is always inherit, so every downstream Agent dispatch / Model: propagation omits the model (current behavior). Read sites: Step 7's two background launches, Step 7.5's sequential rules-review, Step 11.5, and the Step 3 / Step 8 inline reviewer (Claude-family only).
EnterPlanMode
Analyze the task and codebase, create implementation plan. Apply custom_instructions to shape plan priorities and structure. Follow the structure defined in references/plan-format.md — Overview / Decisions / Design / Test plan required; Risks / Unknowns optional. When the work is sequential, Design defaults to an ordered, numbered list of implementation steps (see references/plan-format.md § Template, the source of truth). Section-level content rules live in the reference file; do not re-derive them here.
- If a state file exists (this run is executing one subtask of a decomposed parent): the "effective task" = the current in_progress subtask. Frame the plan around just this subtask while keeping the full parent task and other subtasks as background context so the plan stays consistent with the overall direction. Do not plan work belonging to other subtasks. See references/plan-format.md § Subtask / Resume handling for how Decisions is scoped in this case
- TDD-conflict resolution: if custom_instructions includes a TDD-style requirement (e.g. "Always use TDD", "write tests before implementation") AND the current task is adding tests for existing behavior (characterization tests, coverage tests, or relocating existing tests — keywords: "add tests for", "characterize behavior", "test coverage", "move tests", "固定する", "追加する") rather than driving new implementation, declare explicitly in Plan Overview or Risks that this subtask is TDD-loop-external: tests describe and fix already-implemented behavior, not specification of new behavior. This resolves the apparent conflict: the TDD guideline governs feature-implementation subtasks; characterization and coverage subtasks are outside the TDD loop by design.
- Version/identifier string replacement tasks: if the core operation is replacing a specific version string, identifier, or constant across the project (e.g. version bump, rename, migration), grep the entire repository for the old value before drafting the plan. Include the complete list of affected files in the Design section — missing even one location is the primary regression source for this task class
Simplicity self-audit: Before proceeding to Step 3, read references/simplicity-self-audit.md and audit the plan against its checklist.
Plan self-check: Run the checklist in references/plan-format.md § Step 2 self-check against the plan. This is the author's first-pass judgment on Decisions content; fix any failures before Step 3.
No code changes in this phase
Adjust N by difficulty (skip if -i / --iterations was explicitly specified): A typo fix doesn't need 3 rounds of review. Based on the plan just created, assess task difficulty and reduce the iteration counts to avoid unnecessary iterations — the configured value is a ceiling, not a target. The same difficulty cap is applied independently to N_plan and N_code (the two values that may differ only when review_iterations is a map; otherwise they are already equal):
- Trivial (a genuinely self-evident change with a single unambiguous solution — a typo fix, a one-line edit, a config value change): N_plan = N_code = 0 — Step 3 (Plan Review) and Step 8 (Code Review) are skipped entirely. Difficulty-skip matrix (Trivial): additionally skip Step 6 Tidy and Step 7.5 Rules Compliance Review — at this tier the cleanup pass and the rules-compliance walk are low-yield, and the Step 4 plan-approval gate plus Step 7 check_commands / test_commands remain the safety net. Conservative tie-break: classify as Trivial only when the solution is truly unique and obvious; if the change spans more than a trivial edit, the solution is not uniquely determined, or there is any doubt at all, fall to Simple or above so internal review is retained. The same external-library major-bump exception described under Simple applies here too (such a change is never Trivial)
- Simple (typo fix, config tweak, straightforward bug fix with obvious solution): N_plan = N_code = 1 — unless the change touches an external library's config file or type-level API AND that library had a recent major-version bump (primary check: git diff <base-commit> of the package manifest; if absent in this run, judge from other context since the bump may predate this run); then classify as at least Moderate. Similar qualitative risks (external config-DSL rewrites, etc.) follow the same rule. Purely cosmetic edits (comments, whitespace, auto-formatting) do not trigger the exception — use judgment. Difficulty-skip matrix (Simple): additionally skip Step 6 Tidy only (a pure quality-cleanup pass, correctness-neutral); Step 7.5 Rules Compliance Review still runs, since rule violations do not correlate with change size
- Moderate (multi-file within one module, feature following existing patterns): N_plan = min(2, N_plan), N_code = min(2, N_code) — no step is skipped (difficulty-skip matrix applies to Trivial / Simple only)
- Complex (cross-module, new patterns, API changes, significant refactoring): keep N_plan and N_code — no step is skipped
Step 9 (Completion Hooks) is never skipped by the difficulty-skip matrix at any tier — hooks.on_complete is a project-configured open list whose callee set varies per project, so difficulty-gating it would make behavior project-dependent; the matrix covers only the whole-step-skippable Step 6 / Step 7.5.

File count is a hint, not the sole criterion. If adjusted, mark excess task iteration items (Step 3-x beyond N_plan, Step 8-x beyond N_code) completed via TaskUpdate. When Trivial reduces both to 0, mark every Step 3-x / Step 8-x iteration item AND the top-level Step 3: Plan Review / Step 8: Code Review rows as completed — both steps are skipped entirely (their entry-point guards in Step 3 / Step 8 recognize this pre-completed state and pass straight through; only Trivial produces 0, and it zeroes both counts together, so the "Trivial → both steps skipped" coupling holds). Difficulty-skip matrix marking (Step 6 / Step 7.5): apply the same pre-completed-mark + entry-point-guard mechanism to the whole-step-skippable quality steps. Keyed on the assessed tier alone (no config flag): for Trivial, mark the top-level Step 6: Tidy AND Step 7.5: Rules Compliance Review rows completed; for Simple, mark only Step 6: Tidy completed; for Moderate / Complex, mark neither. For each row marked completed here, append one record to difficulty_skipped_steps (e.g. Step 6 Tidy skipped (Trivial tier) / Step 7.5 Rules Compliance Review skipped (Trivial tier)) so § Completion's difficulty-skip reminder can render it (the skip is never silent). Step 9 (Completion Hooks) is never marked here (see the Step 9 note above). Resolve subagent_model here (after the tier is assessed, in the same Adjust N pass): set subagent_model = the merged-config subagent_model map entry for the lowercased assessed tier name (trivial / simple / moderate / complex) when that key is present and valid, else the built-in default for that tier (sonnet for Trivial / Simple, inherit for Moderate / Complex), else inherit. A resolved value of inherit means downstream dispatches omit the model (current behavior). This resolution is skipped under -i / --iterations (Adjust N does not run), leaving subagent_model at its sub-step 1 inherit init. Log the assessed difficulty and effective N_plan / N_code in the resolved language (see §Configuration; default ja). The Step 11.5 task row is not affected by the difficulty assessment — it stays pending regardless, since the self-retrospective is gated only on self_retrospective.feedback.
Do not present the plan to the user or ask for approval/confirmation — presenting an unreviewed plan wastes user time and risks approval of a suboptimal approach. This prohibition extends to confirmation-seeking transition sentences such as "if this design looks good, I'll proceed to Step 3 (Plan Review)", "shall I move on to Plan Review?", or any equivalent ask-for-go-ahead phrasing — these read as natural conversation but constitute the same approval-gate that wastes user attention on an unreviewed plan. The moment Step 2 ends, advance directly to Step 3 without emitting any user-facing message about the plan or the transition. The user will see the plan in Step 4 (internally reviewed in Step 3, unless the task was assessed Trivial — N_plan=0 — in which case Step 3 is skipped and the plan reaches Step 4 unreviewed).

Step 3: Plan Review

This step is an internal review — the reviewer refines the plan before the user sees it, so the user receives a higher-quality plan in Step 4. Do not present the plan to the user or ask for feedback during this step.

Difficulty exception (Trivial / N_plan=0). When Step 2's difficulty assessment set N_plan = 0 (a Trivial task — Trivial zeroes both N_plan and N_code), this entire step is skipped: its task rows (top-level Step 3: Plan Review and every Step 3-x) were already marked completed by Step 2's Adjust N by difficulty. This skip is gated on task difficulty, not on the presence of user-provided analysis — the analysis-substitution prohibition below still applies in full to every Simple / Moderate / Complex task (N_plan ≥ 1).

Always run (for N_plan ≥ 1). Step 3 is not skippable on the grounds that the user's task prompt contained design analysis, prior-session handoff material, or review-like commentary. User-provided analysis is upstream planning content the user wrote — it is not an independent bias-free peer review pass and does not substitute for the reviewer dispatch. Handling rules (closed list):

(i) The Step 3 reviewer skill is always invoked.
(ii) User-provided analysis (long task descriptions that themselves argue for the approach, embedded justification in handoff docs, etc.) is fed into the reviewer skill's dispatch payload as additional context so the reviewer can build on it rather than re-derive it.
(iii) An explicit user override in the task prompt ("you may skip Step 3 for this run", or equivalent) is the only analysis-driven path to skipping (distinct from the difficulty exception above). When this fires, record a warning in the Completion summary so the user has a visible signal that the bias-free review pass was bypassed.

The existing per-iteration "No actionable findings" semantic-judgment skip continues to work — that is a reviewer-side decision (the reviewer ran and returned no actionable feedback), not a Step-skip.

If N_plan = 0 (Trivial), skip this step entirely — its rows are already completed (see the Difficulty exception above), so do not re-mark them in_progress and proceed directly to Step 4. The following in_progress marking and per-iteration processing apply only when N_plan ≥ 1.

Mark Step 3: Plan Review as in_progress. Process each pending iteration item (Step 3-1 through 3-N_plan) in order:

Mark the iteration item as in_progress. Call the reviewer skill resolved in Step 1 (e.g. Skill(ask-peer)): Review the plan. subagent_model propagation (inline reviewer): when the resolved reviewer is Claude-family (per Step 1's reviewer-family classification) and subagent_model is a model id, propagate it — pass Model: <subagent_model> to ask-peer, or include --model <subagent_model> in the dispatch instruction to ask-claude. External-CLI reviewers and an inherit resolution carry no model (current behavior). Step 3 is always inline, so there is no background-launch path to double-apply against here. Pre-dispatch dispatch-boundary reminder: Issue the Skill(<reviewer>) call in the same turn as any accompanying status prose — never produce a standalone status turn before the Skill() call, as that creates a stall point. Reading the reviewer's SKILL.md is preparation, not dispatch; the Skill() call is the dispatch.
- Instruct reviewer to read all files under .claude/rules/ for project conventions, references/plan-format.md for the Decisions (a)+(b) criterion and § Step 3 (f) content-quality rubric, references/simplicity-self-audit.md for the Step 2 audit checklist that category (a) below verifies, and references/review-categories.md § Plan review categories for the full per-category rubric of the six categories below (resolve these references/*.md links to concrete readable paths when composing the request — the reviewer lacks the skill-directory context)
- Request feedback organized into six categories (labels only — full rubric per the read-instruction above): a. Scope & feasibility b. Approach & alternatives c. Completeness d. Incrementality e. External library primary-source verification f. Presentation & attention allocation (content quality)
- If custom_instructions is configured, include the instructions text in the review request and have the reviewer verify alignment and report conflicts
- If a state file is active (executing a subtask from a decomposition), include the current subtask's scope in the reviewer request: list the subtask's title and description, then list what the other subtasks cover (to define out-of-scope). Instruct the reviewer that missing functionality belonging to other subtasks is not an actionable finding for this review — only findings scoped to the current subtask qualify. Omit this when no state file is active (single-task execution has no defined out-of-scope boundary).
- Reviewer should only report actionable findings. If none, explicitly state "No actionable findings"
Judge the reviewer's response semantically: if the reviewer reports nothing actionable — no actionable findings, no improvements to apply, no review points raised, or any other "nothing to report" outcome regardless of exact wording — mark this and remaining iteration items as completed (skip). Mark Step 3: Plan Review as completed and proceed to Step 4 automatically. Per the No-Stall Principle, do not wait for user input and do not rely on exact-phrase matching; trust semantic judgment since the reviewer skill's phrasing varies (especially Skill(ask-peer) and other free-form-prose reviewers whose verdicts are natural-language Markdown rather than a fixed token).
Otherwise: autonomously apply improvements or reject inapplicable points with reason — do not ask the user for judgment on individual review findings. Mark this iteration item as completed.
- Approach-reconsideration self-audit on high findings count (iter 1 only): at the iter 1 → iter 2 boundary, before applying findings individually for iter 1's output, count the reviewer's findings by severity. If either threshold trips — Critical ≥ 3 OR (Critical + Major) ≥ 10 — additionally scan the findings list for any item that surfaces an approach-level alternative (typical phrasings: "X の方が筋がよい", "existing X と統合できる", "switch to <sibling>", "use <existing-mechanism> instead", or any equivalent "the plan should adopt a different overall approach" framing). If at least one such approach-alternative finding is present, do not proceed with mechanical apply-and-iterate — instead, treat the findings cluster as a signal that the plan's Approach selection itself is the root cause. Rewrite the plan with the approach-alternative finding's direction promoted into the Decisions section (Recommendation / Alternative swap or insertion-direction new Decision item, per the rewrite class), add a new review iteration item Step 3-(N_plan+1), and return to Step 3 to re-review the rewritten plan. The remaining iter-1 findings are carried forward as context for the next reviewer. When the threshold trips but no approach-alternative finding is present (mechanical-fix-level findings only), proceed with the usual per-finding apply-and-iterate path. This audit applies only at the iter 1 → iter 2 boundary; later iterations have already exercised one or more apply cycles and approach-reconsideration after that point is the Step 4 user-gate's responsibility (general principle: high finding density paired with an approach-level alternative finding is a structural signal, not a quality signal — keep applying mechanical fixes and the plan still fails at Step 4 user gate).
- Prose-integrity self-check (post-fix): after applying a fix that edits plan prose adjacent to its target line (Decisions / Design / Test plan / Risks / Unknowns paragraphs), re-read the surrounding paragraph as a single unit before continuing — verify no sentence is cut mid-word, no logical connective is broken (the connectives however / therefore / because / but / etc. still anchor real clauses), and the paragraph's overall logic still holds. Mechanical fix patches see only the line-level diff and routinely leave the surrounding prose semantically broken in ways the next iteration's reviewer flags as a Major finding, costing an extra iter.
- If the plan was modified: continue to the next pending iteration item (back to step 1). Plan modifications often introduce new gaps or ripple effects that the previous reviewer had no chance to see — the re-review round-trip is cheap compared to shipping a plan that looks fine to the author but has an unvetted section. Don't short-circuit even when the fixes feel airtight
- If all points were rejected (no modifications): mark remaining iteration items as completed (skip — there is nothing new for the next reviewer to look at) Continue to the next pending iteration item with:
- the updated plan
- a summary of changes made and rejections with reasons
- an iteration-scope instruction: from iteration 2 onward, the reviewer's primary verification scope is the plan changes applied since the previous iteration (conveyed by the summary of changes above — no separate diff artifact is provided) plus landing confirmation of the previous iteration's findings — the full-coverage pass (re-verifying every plan section, decision, and cited reference from scratch) belongs to iteration 1 only. The reviewer must still escalate back to full re-verification when content outside that primary scope raises a new concern, so coverage is reordered, not reduced
- the same six-category structure (a–f), .claude/rules/ reference, and "No actionable findings" requirement
Return-point no-stall reminder: At each iteration boundary (regardless of reviewer outcome — findings reported, "No actionable findings", any non-error result), the next action — the next iteration's reviewer dispatch when more iteration items remain, or the Step 4 transition when this was the last iteration or "No actionable findings" was returned — must be issued in the next tool call. Do not insert an interstitial summary or acknowledgment turn between iterations; the abstract enumeration in § No-Stall Principle is intentionally duplicated here so the rule fires at the decision moment.
If all N_plan iteration items are completed and actionable feedback still remains, carry the unresolved points forward to Step 4.

Mark Step 3: Plan Review as completed.

Step 4: Finalize Plan (USER APPROVAL GATE)

Before presenting, verify via TaskList that Step 3: Plan Review and every Step 3-x iteration item are completed — ExitPlanMode is the effective exit from Plan Mode, so issuing it while any Step 3 item is still pending or in_progress skips the internal review entirely. If any Step 3 item is not completed, emit a one-line inline note to the user naming all incomplete items (e.g., Plan review found incomplete (Step 3-2 still pending) — running the remaining review pass before presenting the plan., substituting the actual incomplete iteration item label(s)) then return to Step 3 to process it (do not flip the row to completed without doing the review work). Exception: when Step 2's difficulty assessment set N_plan=0 (a Trivial task) and therefore pre-marked all Step 3 rows completed, that completed state is the intended skip — not an unrun-review bug — so proceed to ExitPlanMode normally. 1.5. Prose-language self-audit: Before calling ExitPlanMode, verify that explanation prose in the plan body (Overview narrative, Decisions rationale, Design descriptions, Test plan steps, Risks/Unknowns paragraphs) is written in the resolved language. Schema tokens (Overview / Decisions / Design / Test plan / Risks / Unknowns), step labels, enum values, identifiers, and quoted code strings stay in their original form regardless of language. Audit both directions: (a) if any explanation sentences are in a different language than the resolved language, and (b) if concept words outside the verbatim-preserve scope — ordinary nouns, adjectives, conjunctions, verb phrases — are over-preserved in the source language rather than rendered in the resolved language (per references/plan-format.md § Localization granularity's Negative-direction rule). Revise any failures now per references/plan-format.md § Localization granularity before proceeding to sub-step 2. Re-entry coverage: this audit must re-run on every entry into Step 4 — both the initial entry and any re-entry triggered by sub-step 1's "return to Step 3" path or sub-step 3's material-change path — since revisions during Step 3 iteration may introduce prose in a language different from the resolved language.
This is the first time the user sees the plan. Write the full plan body to the Plan Mode plan file with the Write tool (the ExitPlanMode approval modal renders that file's contents), and present a condensed view in chat per the two-tier protocol in references/plan-format.md § Step 4 presentation order — internally reviewed in Step 3 for N_plan ≥ 1 (include any unresolved review points from Step 3); for a Trivial task (N_plan=0) Step 3 was skipped, so present the plan as unreviewed and rely on this user-approval gate as the sole review. Render the chat view in this order: a. ## Plan header as a visual boundary. b. The > Review guide line (per references/plan-format.md § Review guide line) followed by the condensed plan body, following references/plan-format.md § Localization granularity in the resolved language (see §Configuration; default ja): Overview in full (including Highlights when present), Decisions in full, and Design as a file-list only (files to change + one line of what-changes each). Test plan and Risks / Unknowns are not rendered in chat — they live in the full plan file and surface via the preamble's verification approach / known risks slots. Section headings render at ### level (one below the ## Plan container); sub-sections (Title, Goal, Scope, Decision N, Implementation, etc.) at ####. c. Horizontal rule (---) separator. d. Summary preamble per references/plan-format.md § User-gate summary preamble. e. Guidance line per references/plan-format.md § Step 4 guidance lines (verbatim, no paraphrasing, no concatenation). f. Call ExitPlanMode in the same turn, immediately after the guidance line. ExitPlanMode triggers the approval modal (which renders the full plan file) — if it is not called, the user sees the plan text but has no way to approve. Delaying ExitPlanMode to a subsequent turn is the primary cause of Step 4 appearing stalled.

Section headings (Overview / Decisions / Design / Test plan / Risks / Unknowns) and the Step 4 guidance line stay English.
Collaborate with the user to refine the plan as needed (normal Plan Mode interaction). Categorize each user response into one of the four buckets below via semantic judgment (per § No-Stall Principle's "do not rely on exact-phrase matching" rule — example phrasings are illustrative, not literal discriminators):
- accept: explicit affirmative — "OK" / "approve" / "looks good" / "進めて" / any semantic equivalent. Begin implementation.
- swap-decisions (Decisions Recommendation/Alternative swap on one or more specific items — "Decision 1 を Alternative に", "swap the recommendation on the language flag", "use the alternative for Decision N", "Decision N と M は Alternative で残りはそのまま"): re-render the plan with the specified Recommendation / Alternative pairs swapped on the named Decisions items, leave other items unchanged, run the read-back sub-step below, then re-present the plan (re-enter the gate). When the user names multiple Decisions in one message, list every affected item on the read-back line so partial-coverage misses cannot slip through.
- rewrite-approach (Approach / Design / Scope-level material change — "switch from independent skill to extending sibling mode", "split this into two subtasks", "scope down to only the canonical site", or any change that does not fit a clean Decisions swap): add a new review iteration item (Step 3-(N_plan+1)), run the read-back sub-step below, return to Step 3 to re-review the modified plan, then re-enter Step 4 from sub-step 1 (so sub-step 1's task completion check on the new Step 3-(N_plan+1) item and sub-step 1.5's prose-language re-entry-coverage audit both run before re-presenting at sub-step 2). Trivial (N_plan=0) re-activation: if the task had been assessed Trivial (N_plan=N_code=0) so Step 3 was skipped, an Approach-level material change means the task is no longer trivially self-evident — re-run Step 2's Adjust N by difficulty against the rewritten plan to re-derive the difficulty assessment itself (it will no longer be Trivial) and the effective N_plan / N_code (re-running the independent-cap logic on both values, not a single value). Updating the difficulty assessment — not just the counts — is required because every downstream gate keys on the assessment, not on a bare count: Step 4's unreviewed-plan presentation and the references/plan-format.md Trivial conditional both read "the task was assessed Trivial"; leaving the stale Trivial label in place would keep announcing "sole review". Then re-mark the task rows for the re-derived difficulty: register Step 3-1 … Step 3-N_plan (and the Step 8-1 … Step 8-N_code rows) as fresh pending, clear the previously-skip-completed top-level Step 3: Plan Review / Step 8: Code Review rows back to pending. The difficulty-skip matrix is re-derived in the same pass: reset difficulty_skipped_steps = [] and re-run Adjust N, which recomputes which of Step 6: Tidy / Step 7.5: Rules Compliance Review the new tier skips and re-populates the ledger from scratch (no find-and-remove of individual records). Clear any previously-skip-completed Step 6 / Step 7.5 row back to pending when the higher tier no longer skips it — the same re-pending treatment applied to the Step 3 / Step 8 rows. Without this re-derivation the Step 3 entry-point guard would skip the new review item (it skips whenever N_plan=0), Step 4's completion check would loop on the unprocessed item, and a Step 6 / Step 7.5 row left stale-completed would silently skip a quality step the higher tier now requires.
- withdraw: explicit halt — "stop" / "cancel" / "abort" / "やめる" / "取り下げ". Exit the workflow with no further steps; do not proceed to implementation.
Read-back sub-step (mandatory before applying any swap-decisions / rewrite-approach interpretation): emit a one-line summary of the interpreted change in the resolved language (e.g. Decision 1 を Alternative に切り替え、Decisions 2 と 3 は Recommendation のまま保持します — このまま反映してよろしいですか？) and wait for the user to confirm before re-rendering. The read-back is the gate-of-origin's own resolution branch; do not nest a separate ExitPlanMode call inside it. If the user's confirmation response itself reads as another swap-decisions / rewrite-approach / withdraw instruction, treat the read-back as un-confirmed and re-classify under the four buckets above. The read-back catches multi-Decisions instructions with partial coverage and Approach-level instructions that masquerade as Decisions swaps — both are common failure modes that silently lose user-specified scope when interpreted without read-back.

NOT approval (interrogative or non-committal — "look good?" / "どう？" / "これでいい？"): treat as ambiguous — ask the user to confirm whether they intended an affirmative or to surface a change request, then re-classify the response under the four buckets above. Do not silently advance.

After the user accepts (accept bucket), begin implementation.

Step 5: Implement

Plan entry self-check — user-side manual action extraction: before issuing the first implementation tool call, scan the approved Plan body (Overview / Decisions / Design / Test plan / Risks / Unknowns) for embedded user-side manual actions — environment-prerequisite probes the user must run themselves, configuration values the user must add to a config file outside the agent's write surface, external authentications / API keys / hook installations / OS-level changes the user must perform manually, manual verification steps the user must execute against external systems (for general software development this includes "run <command> and confirm output", "edit ~/<config> to set X=Y", "log in to <external dashboard> and authorize Z"; for skill development this includes ~/.claude/settings.json edits, hook installation, external CLI installation, or workspace-level config the user must place outside the repo). When at least one such manual action is present, emit a short independent block at the top of Step 5 — separately from any other implementation prose — listing each manual action verbatim with the Plan section it came from, before proceeding with the first implementation tool call. The block ensures the user sees the manual action items distinctly rather than discovering them buried inside long-running plan execution. When no manual actions are present (purely agent-executable plan), skip this block. The block is informational — Step 5 continues without waiting for user input on the manual actions themselves (the user-side observation gate at the probe → real-implementation boundary handles cases where the workflow needs to wait).
Follow the plan, track progress with the Task tools (TaskUpdate). When the Design is an ordered, numbered list of implementation steps (per references/plan-format.md § Template), you MAY register each step as an implementation sub-task and execute them in order, marking each completed as it lands — recommended for long ordered plans, optional for short ones. This is consistent with Step 1's "Implementation sub-tasks in Step 5 are additions, not replacements" rule and does not change the Phase-boundary self-audit (which governs only top-level Step transitions). Apply custom_instructions throughout implementation
Respect prior in-session edits: content the user explicitly removed earlier in this session (comments, guards, logs) must not reappear. Treat deletion as authoritative, not as a gap to fill. This discipline applies when applying plan steps, when applying Step 6 tidy output, and when applying Step 8 review fixes — the reviewer/tidy subagents only see the diff and cannot enforce this themselves
Late-stage scaffolding self-audit: when implementation introduces a structural element that was not present in the Step 2 plan — a new sub-step, an additional enum value, a new branch arm, an additional call site that invokes the same callee at a new location, a new recovery / fall-through path (for skill development this includes a new SKILL.md sub-step, an additional status enum in a return contract, a new error class, a new disposition mapping row) — re-apply the same Step 2 § Simplicity self-audit rigor to the newly introduced element before moving on. Sub-checks (i)–(iv) fire only when a new structural element is introduced; sub-check (v) fires unconditionally for every .md file edit in the diff: (i) sibling symmetry — when the new element parallels existing sibling elements, verify same fields / same disposition values / same error-class coverage; (ii) error-path symmetry — for any success path introduced, trace its corresponding failure path explicitly (counter increment vs. non-increment, success-only vs. failure-included); (iii) boundary-value coverage — for any predicate, threshold, or count introduced, trace the boundary cases (empty input, all-same-classification, mixed-classification) and verify the predicate truth value matches design intent; (iv) reference-site sweep — if the new element is referenced from prose elsewhere in the file, verify those references use stable phrase anchors (not raw sub-step numbers / branch letters); (v) Markdown block-element structural integrity — for each edited .md file (unconditional: applies regardless of whether a new structural element was introduced), scan newly added or changed content for adjacent block elements (list items, paragraphs, code fences, headings, blockquotes) that lack a separating blank line; when a gap is found, insert the missing blank line immediately (in the same Edit call) before moving on; missing blank lines cause the following element to be parsed as a lazy continuation of the preceding block and rendered merged — the gap typically surfaces in skill-review as a mechanical edit (for skill development this includes SKILL.md, references/*.md, and README.md edits). The reviewer / tidy subagents see only the diff and cannot enforce this self-audit, so it must run in the main thread at Step 5 — late-stage scaffolding correctness gaps surfacing first at Step 8 Code Review iter 1 indicate this audit was skipped.
Final-pass literal-value full-repo grep: at Step 5 completion (after all planned edits are applied and before advancing to Step 6 Tidy), for each literal value the plan replaced or introduced — numeric constants (threshold values, version numbers, magic numbers), token strings (status enum values, config keys, identifiers), file path fragments, or any other literal whose semantics are tied to a specific value — grep the entire repository (not just the plan's enumerated sweep targets) for the old value and confirm zero hits. The Plan's Test plan typically enumerates known sweep sites, but narrative examples embedded in prose (illustrative numbers in descriptive text, story-style usage examples in SKILL.md / references/*.md / README files) routinely sit outside the enumerated list and silently retain the old value through mechanical search-and-replace passes. Multi-stage structure: (i) enumerated sites — the explicit list from the Plan's Test plan, verified one by one; (ii) final-pass full-repo grep — git ls-files | xargs grep -l <old-value> (or equivalent) for any residual hits outside the enumerated list, with each hit reviewed in context and either updated (if it carries the old value's semantics) or marked as out-of-scope (e.g. a different concept that coincidentally uses the same literal); (iii) alias and derived-form sweep — for rename and migration tasks, additionally grep for mechanically-derivable aliases and derived forms of the old value (abbreviated forms, synonyms, or alternate identifiers the codebase uses interchangeably to refer to the same concept); when the derived-form set can be enumerated upfront, list and grep each form before declaring the sweep complete; grepping only the exact old value misses same-concept usages expressed under an alternate spelling (for skill development this includes aliased import names, short-form identifiers referenced in SKILL.md narrative examples, and config-key abbreviations). The reviewer / tidy subagents see only the diff and cannot enforce the full-repo sweep, so it runs as a Step 5 completion gate (for skill development this includes literal numeric thresholds cited in references/*.md narrative examples, version strings in README usage snippets, and example values in compaction / extract-rules-style descriptive prose). Authoritative-tool cross-check for load-bearing enumeration claims: when the completeness of the enumeration is mechanically verifiable by a downstream authoritative tool — a compiler or type checker reporting all affected call sites for a changed type or interface, a language server returning all references for a renamed symbol (for skill development, when no compiler or language server is available, the authoritative check is a structured manifest audit: e.g. jq against marketplace.json to enumerate all skills array entries affected by a renamed skill, or a targeted Grep scoped to SKILL.md / plugin.json / marketplace.json for all hook-firing paths or all subagent dispatch routes that a configuration change affects — when no such structured verification is available, the grep-only pass is sufficient) — cross-check the grep results against that tool's output; if the tool reports additional sites that grep missed, treat those as unresolved hits and apply the same two-option disposition defined above for the full-repo grep pass (update if it carries the old value's semantics; mark out-of-scope if it coincidentally uses the same literal for a different concept) — and complete this resolution before presenting the enumeration as confirmed. Search filter prefix/anchor errors can silently drop matches with no error signal; authoritative-tool verification catches these missed cases. When both grep and an authoritative tool are available, treat the tool output as the oracle. Non-literal-replacement tasks skip this audit.
Pre-write path scope check (Write / Edit / new-file path safety): before every Write / Edit / similar file-creation tool call whose file_path argument does not match a path that already exists in git ls-files output (typically: new files generated by Step 5 / Plan rewrite / staging document creation / new test fixtures / new CHANGELOG entries — file paths that the tool will create rather than modify-in-place), run a two-stage path verification before issuing the tool call: (i) repo-root containment — verify the absolute resolved path sits under git rev-parse --show-toplevel (no ../ escape from the working directory, no absolute path leading outside the repo); (ii) prefix sanity — verify the path's leading directory matches an expected location for its content class (.claude/plans/ for plan documents, skills/<name>/ for skill content, src/ or tests/ or equivalent for code, .triage/ or tmp/ for staging, etc.). If either check fails, abort the tool call with a fail-loud diagnostic naming the resolved path and the expected prefix set, rather than silently creating the file. The allowed-tools permission grant alone does not prevent parent-directory landing (Write accepts any string file_path), so a procedural pre-check is the only structural defense against typo-induced orphaned files (for general software development this includes accidental migration / config / test-fixture writes landing one directory up; for skill development this includes .claude/plans/<slug>.md typos depositing files at ../<slug>.md, marketplace.json paired-bump operations writing to the wrong manifest, or staging documents landing outside .triage/ / .claude/). If a tool call has already created a file in the wrong location, instruct the user to delete it manually — the workflow's auto-mode classifier cannot reach files outside the project scope, so manual cleanup is the only path.
User-observable artifact protection gate at probe → real-implementation boundary: when the Plan explicitly stages an implementation as probe / intermediate-artifact → real-implementation replacement (e.g. "first emit a debug-instrumented version for user to observe, then replace with the production implementation", "scaffold a placeholder file the user will manually inspect, then overwrite with the final content", "log expected probe output as a verification step, then remove the logging"), do not advance to the real-implementation step until the user has signaled observation completion. The probe-output observation gate is the only user-side wait state permitted inside Step 5 — every other Step 5 sub-step proceeds autonomously per § No-Stall Principle. When the probe is committed to disk and the user has not yet acknowledged observation, hold the workflow at this boundary and emit a one-line wait prompt in the resolved language (e.g. Probe artifact deployed at <path> — please observe its output before the workflow replaces it with the final implementation. Reply when ready to proceed.). Resume the real-implementation step on any non-empty user reply. When no probe → real sequence is in the Plan (typical case — purely incremental implementation), this gate does not fire (for general software development this includes debug-log-instrumented scaffolds replaced by clean production versions, mock-data fixtures replaced by real-data fetches; for skill development this includes verbose-tracing skill versions replaced by streamlined final versions). The gate exists to prevent the probe artifact from being silently overwritten before the user has had a chance to inspect it — a failure mode the No-Stall Principle's autonomy guarantee otherwise creates.

AskUserQuestion option design (applies to the probe gate above and any future user-state-query call in this workflow): when the workflow uses AskUserQuestion (or any equivalent multi-option user-query tool) to query the user about a plan-derived state — probe-execution outcome, manual-verification result, environment-prerequisite check, or any equivalent state confirmation — the options list MUST include a meta-confusion branch alongside the result enumeration. Concretely, do not present only outcome categories (e.g. success / failure / skipped); also include an option phrased as "the procedure / expected outcome is not yet understood (please re-explain)" in the resolved language (e.g. language: ja: 手順 / 期待結果がまだ把握できていない（要再説明）; language: en: the procedure / expected outcome is not yet clear (please re-explain)). The meta-confusion branch absorbs the "I cannot answer the question as posed" state — without it, the user is forced into Other free-text and the workflow consumes an extra clarification turn re-explaining what was already in the Plan. General principle: user-state queries enumerate outcomes AND leave a fallback for the premise-not-conveyed case, never outcomes alone (for general software development this includes deployment-readiness queries, migration-completion confirmations, external-system-state checks; for skill development this includes probe-result queries inside this Step 5 gate, callee-execution-outcome confirmations, manual-config-applied verifications).
Derived-value claim deferral: when deliverable prose embeds a value derived from content that later phases can still change — a size claim about a generated artifact, an item or step count, or any other body-derived figure (for skill development this includes char-count claims about SKILL.md / references/*.md and step-count mentions in CHANGELOG entries or descriptive prose) — do not finalize that value during Step 5. Keep it as a clearly-marked provisional value (e.g. render the figure as <provisional — finalized at Step 10 entry> so the placeholder is grep-able at the application point) and compute + write the final figure exactly once at the last gate where the source content is settled — the plan-deferred bookkeeping application point at Step 10 entry (the deferred-bookkeeping paragraph at the top of references/interactive-commits.md, applied before its § Collect changes step collects the working tree), after Step 6 Tidy, Step 7.5 fixes, Step 8 review fixes, and any Step 9 hooks.on_complete working-tree modifications have all landed. Re-verifying and re-correcting the figure after every downstream phase that touches the body is the anti-pattern this item forbids — each chase is an avoidable rework turn. When interactive_commits: false (Step 10 is omitted and execution proceeds directly from Step 9 to Step 11 — see § Step 10: Interactive Commits), the Step 10 entry gate never occurs: finalize the figure at the same settledness point — immediately after Step 9 completes or is skipped, before proceeding to Step 11 — so the provisional marker never survives into the final tree.

Step 6: Tidy

Implementation often introduces unnecessary complexity that's easier to spot in a dedicated pass after the code works.

Difficulty exception (difficulty-skip matrix). When Step 2 marked Step 6: Tidy completed under the difficulty-skip matrix (Trivial or Simple tier — see Step 2's Adjust N by difficulty), the row is already completed: do not re-mark it in_progress; proceed directly to Step 7. The Phase-boundary self-audit (§ Step 1 registration mechanics) treats this pre-completed row as the intended skip exactly as it does the Trivial Step 3 / Step 8 skips, not an unrun-step bug.

The Step 6 cleanup callee is resolved per the Cleanup skill bullet in § Prerequisites (built-in simplify preferred, bundled tidy as fallback). The phase is named "Tidy" after that in-house fallback skill; when simplify is available it — not tidy — is the primary callee.

Cross-layer review handoff ledger. Step 6 (cleanup), Step 7.5 (rules-review), Step 8 (code review), and any review-class hooks.on_complete entries (an entry is review-class when it is a Skill(<name>) entry whose skill reviews or inspects the change and reports findings — judge semantically from the skill's name and purpose; plain shell-command entries are never review-class and receive no ledger) run sequentially against the same deliverable but share no state by default — without a handoff, the same structural concern is re-raised and re-judged independently by each layer, and a finding one layer deferred or applied only partially resurfaces later as scattered per-site fixes. From this step onward, keep a lightweight in-memory ledger of each review layer's dispositions: findings deferred (with the reason), findings applied (with the sites covered), and known leftover sites or residual concerns. Include the ledger as a short context item in each subsequent review layer's dispatch payload (the rules-review dispatch, whether Step 7's background launch or its Step 7.5 sequential fallback; the Step 8 review payload, where it complements that payload's same-layer continuation item; and review-class hooks.on_complete callees). When the ledger has no recorded dispositions yet (no prior layer deferred, applied, or left anything over), omit the ledger item from that payload entirely — do not render an empty placeholder. When a later layer re-surfaces a concern the ledger records as deferred or partially applied, resolve it once: sweep all remaining sibling sites in one pass when they are enumerable and within this task's scope; otherwise (sites outside this task's scope, or a sweep too large for this run) record the leftover explicitly in the plan's Risks — do not let each layer independently re-apply the same structural fix to a different subset of sites.

Pre-dispatch rename-sweep self-audit: if the Implement diff (since <base-commit> recorded in Step 2) includes a term-rename operation — a search-and-replace across the project that swapped a step name, callee name, config key, identifier, or domain concept for a new one — sweep the changed-path SKILL.md / references/*.md / README prose for synonyms and derived forms of the rename target before dispatching the Step 6 cleanup skill (the resolved simplify or tidy), and fix any residue inline. General principle: mechanical search-and-replace leaves synonym / derived-form residue that the substitution alone cannot catch — gerund forms when a verb is renamed, nominalizations and related-noun forms when an action is renamed, conceptual paraphrases of the original term in surrounding description text when a step or concept is renamed (for skill development this includes renaming a procedural verb leaving its -ing form in description prose, renaming a step leaving the prior step-concept paraphrase in cross-section reference text, or renaming a callee leaving the old concept noun in doc-comment / SKILL.md narrative). Detect at this Step 6 so the Completion-time integrity check (Step 8 reviewer / hooks.on_complete) remains a backstop rather than the primary detection point. Non-rename diffs skip this audit.
Dispatch the cleanup skill (resolved per § Prerequisites' Cleanup skill bullet): review changed code for reuse, quality, and efficiency, then apply cleanup edits.
- Primary — Skill(simplify): invoke Skill(simplify). Its argument interface is unverified (built-in skill, no on-disk SKILL.md), so do not assume any tidy-specific field — pass no scope argument (Base ref / --base-commit; simplify auto-scopes to the changed working-tree code), and when custom_instructions is set, pass it only as a short best-effort natural-language hint (simplify may ignore it; do not name a Custom instructions field it may not expose). Omit it entirely when custom_instructions is unset or empty (per § Step 2's Sub-skill natural-language argument minimalism note).
- Fallback — Skill(tidy): Do not pass Base ref / --base-commit <sha> — tidy's default working-tree mode is the intended scope here (covers tracked + staged + untracked changes per tidy's § Invocation contract); passing Base ref would switch tidy to committed-history mode and silently drop untracked files from the cleanup scope, even though sibling Steps (Step 7's test_commands, Step 7.5's Skill(rules-review)) invoke their callees with --base-commit <sha>. This Base ref asymmetry rationale is scoped to the tidy path only — the simplify path above passes no scope argument regardless. Pass the workflow's custom_instructions config value through tidy's natural-language Custom instructions field (omit the field entirely when custom_instructions is unset or empty — do not render (none) / empty string / fabricated default). General principles: (i) when a caller-skill dispatch field is driven by an optional config key, state the absent-key behavior inline on the dispatch line rather than relying on cross-reference to the config-parse step; (ii) when a caller depends on a callee's default-mode behavior for scope correctness and sibling steps use a different argument convention, name the asymmetry on the dispatch line as load-bearing rather than implicit — the executor cannot rely on a default-by-omission when sibling steps create an extrapolation pull toward the explicit form.
Regardless of the outcome — whether the cleanup skill (simplify, or the tidy fallback) applied fixes, reported no actionable findings, or returned any other non-error result — mark Step 6: Tidy as completed and proceed to Step 7 automatically. Per the No-Stall Principle, do not wait for user input.
If the cleanup skill result is not observable (e.g. context compaction occurred during or immediately after the call): inspect git diff <base-commit>. If the diff contains changes clearly attributable to a cleanup pass, treat Step 6 as completed and proceed to Step 7. Otherwise (no cleanup-attributed changes visible, or ambiguous), re-execute the Step 6 cleanup skill once (the resolved simplify, or tidy if simplify is unavailable) — inspection-and-fix-class skills are idempotent — then proceed to Step 7.

Step 7: Check / Test (max 3 retries)

Run check_commands in order (always run all)
- On failure, fix and retry (do not proceed to test execution)
- Pre-execution scope-narrowing: before running each check command, assess whether it is a repo-wide auto-fix tool — a command that writes to files across the repository regardless of which files are in the task scope (e.g. a project-wide formatter, linter with --fix / --write, or bulk document transformer). If the command is a repo-wide auto-fix tool and the working tree contains files changed outside the task-scope snapshot (unrelated existing changes), narrow the command's scope to the task-scope snapshot files before running (e.g. pass the snapshot file list as explicit path arguments if the tool supports it). If scope narrowing is not feasible given the tool's interface, stop and ask the user for direction before running the command — options: run the command accepting the full-width effect, skip the command, or provide an alternative scoped invocation. The Scope-drift guard below is the second safety net for cases where pre-execution assessment is not feasible.
- Scope-drift guard: before each command, record git diff --name-only <base-commit> as the task-scope snapshot (the file set scoped to this task at the start of Step 7). After the command, re-check — any file newly appearing outside that snapshot was written by the command (auto-fix/write behavior sweeping unrelated drift). If scope drift is detected, classify the out-of-scope changes before acting: if all of the following hold — (i) the out-of-scope diff is whitespace or comment changes only (no code-skeleton changes: no non-blank, non-comment lines added or removed), (ii) the total changed line count across all out-of-scope files is ≤ 5, and (iii) the changes are attributable to the formatter or linter that just ran (the command is a known formatter/linter, e.g. lint:fix, format, prettier, black) — then proceed automatically without a user-direction stop: emit a one-line note (e.g. Scope-drift note: <file>(s) received whitespace-only formatting from <command> — proceeding) and continue to the next command. Otherwise (non-trivial drift): warn the user (list both the in-scope files and the newly-appeared out-of-scope files), do not auto-revert / git checkout / delete the out-of-scope changes (leave the working tree as the command left it for user inspection), leave Step 7: Check / Test as in_progress, and wait for user direction. This is a step-internal stop directive — one of two allowed non-completing exits from the check_commands phase (the other being the pre-execution scope-narrowing infeasibility stop above) — and is consistent with the No-Stall Principle, which permits explicit step-defined stops

Two read-only analyses can be launched concurrently here: the Step 7.5 Skill(rules-review) (below) and the Step 8 code review (see the Concurrent code review launch paragraph that follows). Both only return findings — the main thread applies any fixes later (rules-review fixes in Step 7.5, reviewer fixes in Step 8) — so overlapping their analysis with the test phase is a pure wall-clock optimization. test_commands is never backgrounded: a backgrounded callee must have an inline fallback for the nested-Agent-unavailable case, which rules-review (its SKILL.md § 5) and the default reviewer ask-peer (its SKILL.md § Process 1) both have, but run-tests does not. This extends to long-running test entries: do not attempt to offload a Skill(<name>) test command to a one-off background subagent to avoid context accumulation — a subagent that receives a minutes-long command will typically background it internally and return an empty verdict; the main-thread synchronous Skill() invocation is the reliable dispatch path regardless of test duration.

Concurrent rules-review launch (per pass). After check_commands pass and before running test_commands, optionally launch the Step 7.5 Skill(rules-review) concurrently so its read-only analysis overlaps the test phase. A pass here is a Step 7 entry that a Step 7.5 sub-step 1 collect will follow: the initial Step 7 entry, and each full Step 7 → Step 7.5 re-entry triggered by Step 8's post-fix re-run (the "Always re-run Step 7 and Step 7.5" bullet). The Step 7-only re-run inside Step 7.5's fix flow is not a pass: the rules-review call that follows it is the fix flow's direct 2nd-cycle invocation, which has no collect branch — a launch there would be an orphan dispatch with no collector (overlapping that path stays out of scope). This paragraph's pass definition is the single definition the bullets below, the Concurrent code review launch paragraph, the Step 7.5 collect, the Step 8 sub-step 1 collect, and Step 8's "Always re-run Step 7 and Step 7.5" bullet refer to.

Initialize tracking (at every Step 7 entry — pass or not — unconditionally before the availability branch, so the unavailable / skip / re-run paths never read an uninitialized variable): rules_review_launched = false and rules_review_stale = false. Re-initializing on a non-pass entry (the Step 7-only re-run inside Step 7.5's fix flow) is harmless: that entry occurs only after the current pass's Step 7.5 sub-step 1 collect has already consumed the result. Lifecycle: this bullet is the only init site; the only set sites are the If available bullet (rules_review_launched) and the discard bullet below (rules_review_stale); the skip / unavailable paths set neither; Step 7.5 sub-step 1 is the only read site.
Availability detection: inspect the current tool list the same way rules-review SKILL.md § 5 detects Agent availability — do not make a speculative call. The capability gated here is specifically background dispatch (Agent with run_in_background), not a bare foreground Agent. Positive criterion: background dispatch is available when the Agent tool is exposed (top-level or via ToolSearch) AND the session offers an Agent run_in_background parameter or equivalent async-dispatch mechanism — the common case in a standard interactive session, so default to parallel. Treat it as unavailable only when one of these two signals holds (closed list — if neither holds, choose parallel): (a) Agent is absent (this also covers this skill running inside a non-recursing subagent, surfacing as nested Agent being unavailable); (b) Agent is exposed but the session offers no background/detached dispatch capability (e.g. an older Claude Code). This two-item list is the single definition of "unavailable" the If unavailable branches below refer to. Do not treat "unsure" as "unavailable": if a background-dispatch capability is present, choose parallel.
If available (and this Step 7 entry is a pass per the definition above — note that on a Trivial task Step 2 pre-completes Step 7.5 under the difficulty-skip matrix, so no Step 7.5 sub-step 1 collect follows and the entry is not a pass; skip the launch, same orphan-avoidance as the code-review launch's N_code=0 handling): dispatch a background subagent (Agent with run_in_background: true, subagent_type: general-purpose, plus model: <subagent_model> when the Step 2-resolved subagent_model is a model id — omit model when it is inherit) instructed to run Skill(rules-review) --base-commit <sha> — including the cross-layer review handoff ledger as a short context item (per § Step 6's Cross-layer review handoff ledger paragraph; omit when the ledger has no recorded dispositions) — and return the findings report verbatim, applying no edits (the main thread applies fixes in Step 7.5). On a successful dispatch set rules_review_launched = true. Emit a Progress Visibility status line in the same turn. Include in the dispatch payload a note that nested Agent is unavailable in this subagent context — execute Skill(rules-review) directly without probing for sub-subagent availability (the § 5 inline fallback applies automatically; no runtime discovery is needed). Collect the report from the background Agent's completion notification (no extra tool needed).
If unavailable (per the Availability detection criterion above): skip the launch (rules_review_launched stays false) — Step 7.5 invokes Skill(rules-review) sequentially as before (fully backward-compatible).
If test_commands then fail and you fix them (the diff changes): discard both background results — the rules-review result (set rules_review_stale = true; Step 7.5 sub-step 1 then falls back to a fresh sequential Skill(rules-review) dispatch) and the code-review result (set code_review_stale = true; see the next paragraph) — since the prior analyses are now stale. No-op on the no-launch path: setting rules_review_stale = true when rules_review_launched == false has no effect — Step 7.5 sub-step 1's collect branch is gated on rules_review_launched == true, so the unconditional set is safe on every path. Handle disposition: a stale (or never-collected) background rules-review result is simply ignored at the Step 7.5 collect point — no explicit cancellation of the background subagent is owed.

Concurrent code review launch (per pass). In the same window — after check_commands pass and before running test_commands — optionally launch the Step 8 reviewer (the reviewer skill resolved in Step 1, e.g. Skill(ask-peer)) as a background subagent so its read-only analysis also overlaps the test phase. This is a sibling of the rules-review launch above and shares its mechanics — availability detection, dispatch shape, staleness vocabulary, and the pass definition (the rules-review paragraph above is the single home of that definition).

Initialize tracking (at every Step 7 entry — pass or not — unconditionally before the availability branch, so the unavailable / skip / re-run paths never read an uninitialized variable): code_review_launched = false and code_review_stale = false. Re-initializing on a non-pass entry (the Step 7-only re-run inside Step 7.5's fix flow) is harmless: it rewinds the code_review_stale = true that Step 7.5 sub-step 3.a just set, but code_review_launched is also reset to false, and the Step 8 collect branch requires code_review_launched == true — both states route Step 8 sub-step 1 to a fresh sequential dispatch, so the routing is equivalent. Lifecycle: this bullet is the only init site; the only set sites are the If available bullet (code_review_launched) and the staleness set-sites named in the Staleness bullet below (code_review_stale); the skip / unavailable paths set neither; Step 8 sub-step 1 is the only read site. Because every continuation path to a next Step 8 iteration item passes through a Step 7 re-entry (and therefore through this re-initialization), each pass's launch is collected at most once (the Step 8 sub-step 1 collect bullet names which iteration collects).
Availability detection: use the rules-review launch's detection above verbatim — its positive criterion (default to parallel in the common interactive case), its two-item closed list defining "unavailable", and its "do not treat 'unsure' as 'unavailable'" directive all apply here unchanged. The gated capability is background dispatch (Agent with run_in_background), not a bare foreground Agent.
If available (and this Step 7 entry is a pass per the shared definition): launch only when a pending Step 8 iteration item remains to collect the result after this pass — on the initial pass this means N_code ≥ 1 (iteration 1 collects; when N_code = 0 / Trivial, Step 8 is skipped entirely and no iteration item exists); on a re-run pass it holds only when the fix-applying iteration k satisfies k < N_code (a re-run triggered from the final iteration k = N_code leaves no pending iteration item, so a launch there would be an orphan dispatch with no collector — skip it; same orphan-avoidance vocabulary as the rules-review paragraph's non-pass rationale). Dispatch a background subagent (Agent with run_in_background: true, subagent_type: general-purpose, plus model: <subagent_model> when the Step 2-resolved subagent_model is a model id — omit model when it is inherit) instructed to run Skill(<reviewer>) with the same payload Step 8 sub-step 1 would compose for the next pending iteration item — sub-step 1's review-payload definition (including its rubric-link resolution note and, on a re-run pass, the continuation item and the iteration-scope instruction) is the single parametric source; do not restate its list here. Omitting the continuation item on a re-run pass would hand the reviewer a context-free diff and re-surface already-rejected findings. The reviewer returns its report verbatim, applying no edits. On a successful dispatch set code_review_launched = true. Emit a Progress Visibility status line in the same turn. Include in the dispatch payload a note that nested Agent is unavailable in this subagent context — execute Skill(<reviewer>) directly without probing for sub-subagent availability (the reviewer's inline fallback applies automatically; no runtime discovery is needed). The result is collected at Step 8 sub-step 1, not here. The Step 7-only re-run inside Step 7.5's fix flow is not a pass and does not re-fire this launch; the rules-review paragraph's orphan rationale does not transfer here (the code-review collect point is Step 8, where a collector exists even for that path) — overlapping that path simply stays out of scope.
If unavailable (per the Availability detection criterion above): skip the launch (code_review_launched stays false) — Step 8 dispatches the reviewer sequentially as before (fully backward-compatible).
Staleness — discard owned by Step 8 sub-step 1: this background result is speculative. The discard decision lives at Step 8 sub-step 1 (it reads code_review_stale); this paragraph only names the set sites. code_review_stale is set true whenever an edit lands between this pass's launch and the Step 8 collect point that changes the diff the reviewer analyzed: (i) a test_commands failure fix during Step 7 (the discard bullet above), or (ii) any fix Step 7.5 applies (see Step 7.5). Both set-site descriptions are pass-independent and apply to every pass unchanged. The condition is broader than the rules-review launch's (whose collect point is Step 7.5, before Step 7.5's own fixes land); the code-review launch's collect point is Step 8, after Step 7.5's fixes land, so Step 7.5 fixes also count. No-op on the no-launch path: when code_review_launched == false (the launch was skipped or unavailable), setting code_review_stale has no effect — Step 8 sub-step 1's collect branch is already gated on code_review_launched == true, so the no-launch path dispatches the reviewer fresh regardless of the flag's value; the unconditional code_review_stale = true set-sites are therefore safe to execute on every path. Handle disposition: a stale (or never-collected) background result is simply ignored at the Step 8 collect point — no explicit cancellation of the background subagent is owed.

After launching (or skipping) both, run test_commands in the main thread per sub-step 2 below; the background rules-review and (when launched) the code review proceed concurrently.

Iterate over test_commands in order. For each entry (which must be of the form Skill(<name>)), invoke that skill with --base-commit <sha> (from Step 2) via $ARGUMENTS. Each invocation must return a structured summary with one of three statuses (SUCCESS / TEST_FAILED / EXECUTION_ERROR); a TEST_FAILED or EXECUTION_ERROR from any entry halts the loop immediately and triggers the retry path in sub-step 3 — subsequent entries do not run on the failing pass.
- Each test skill handles scope decision and test execution internally via subagent (when applicable)
- Returns structured summary: SUCCESS / TEST_FAILED / EXECUTION_ERROR
- Bulk-vs-split execution: when the change is cross-cutting (shared components, mirrored services, or parallel handlers) and the test suite includes long-duration categories (E2E, integration tests with external dependencies), prefer passing scoped or split arguments rather than requesting a single bulk run. A single command bundling long-running jobs makes intermediate progress opaque and failure recovery harder — scope-targeted execution lets each category succeed or fail independently.
- Shared-path re-run scope: when a fix touches a shared path — a utility, helper, or function invoked by multiple distinct test suites (for skill development this includes subagent dispatch shared forms, hook wiring, state-file processing, or any cross-suite path) — include all suites that exercise that path in the re-run scope, not just a representative suite. A green representative suite proves only the paths it exercises; when the changed code is on a shared path, every suite that routes through it is a potential regression surface. When running all affected suites is impractical, record the excluded suites explicitly in the Completion summary as uncovered risk rather than treating the representative re-run as sufficient verification.
- Pre-existing vs regression discrimination: before entering the retry path on TEST_FAILED / EXECUTION_ERROR, discriminate each reported failure as regression (introduced by this run's changes) or pre-existing (already failing at <base-commit> from Step 2). Two paths: (i) if the invoked test skill's structured summary already classifies failures as pre-existing / regression (recommended return-contract extension for any verification-class skill — lint, test runners, structural checkers, marketplace validators), trust that classification. (ii) Otherwise, re-run the same test skill against <base-commit>: stash the working changes (git stash --include-untracked), check out <base-commit> into a scratch worktree (git worktree add ../base-commit-check <base-commit>) or rely on the test skill's own --base-commit argument if it supports re-evaluating at that ref without working-tree manipulation; compare the failures. Failures reproducing at <base-commit> are pre-existing — record as an informational warning in the summary (pre-existing failure: <skill> / <case> — out-of-scope for this PR) and do not count toward the 3-retry budget and do not auto-fix. Only failures that do not reproduce at <base-commit> are regressions — proceed with the existing retry / fix path. General principle: regression-vs-pre-existing discrimination via base-commit comparison applies to any verification step running a checker against a working tree (lint, test, structural validator — for skill development this includes marketplace structure validation and plugin integrity checks where docs and implementation can disconnect independently of the current change).
- EXECUTION_ERROR + pre-declared degraded procedure: when a test invocation returns EXECUTION_ERROR AND the approved Plan explicitly pre-declared a degraded procedure for this failure mode (e.g. a Risks entry naming the environmental constraint and a fallback verification path), apply the degraded procedure automatically — execute the fallback, emit a one-line note in the resolved language (e.g. Step 7: EXECUTION_ERROR — applying pre-declared degraded procedure: <procedure-summary>), and continue without consuming a retry. Pre-declared degraded procedures are user-approved accommodations for predictable environmental constraints; routing them through the retry-and-stop path contradicts the plan's prior approval and violates § No-Stall Principle. When no degraded procedure is pre-declared, treat EXECUTION_ERROR as before (trigger the existing retry / fix path).
After 3 retries, report to user and stop

Coverage note (TypeScript multi-tsconfig): For projects with Project References or multiple tsconfig*.json files, a single tsc --noEmit may miss changed files that belong to other tsconfigs. --init auto-registers a per-tsconfig tsc -p <path> --noEmit in this case (see references/init-mode.md for detection rules). If coverage still looks incomplete, re-run --init or append the missing command manually.

GATE: Verify Steps 2-7 are completed (check task status via TaskList; if status is inconsistent, verify actual completion by reviewing work done). Mark Step 7.5 as in_progress unless Step 2 pre-completed it under the difficulty-skip matrix (Trivial tier) — in that case the row is already completed; do not re-mark it in_progress, skip straight to Step 8 (same already-completed-row handling as the Step 8 GATE's N_code=0 case). (If Step 7 launched a background rules-review, it may still be in flight — Step 7 is "complete" once the test phase passes; Step 7.5 sub-step 1 collects the rules-review result.)

Step 7.5: Rules Compliance Review

Dedicated rules compliance check, separate from code review (Step 8). This ensures rule enforcement gets focused attention rather than competing with correctness and design concerns.

Difficulty exception (difficulty-skip matrix). When Step 2 marked Step 7.5: Rules Compliance Review completed under the difficulty-skip matrix (Trivial tier only — Simple skips only Step 6, not Step 7.5; see Step 2's Adjust N by difficulty), the row is already completed: do not re-mark it in_progress; proceed directly to Step 8. The Phase-boundary self-audit (§ Step 1 registration mechanics) treats this pre-completed row as the intended skip exactly as it does the Trivial Step 3 / Step 8 skips, not an unrun-step bug.

Responsibility scope (so the same rule class is not double-reviewed across passes and no class slips through every pass):

Step 7.5 owns the mechanical walk of every matched .claude/rules/ rule against the diff — hard rules (explicit prohibitions, naming, reference form, import paths, placement, file structure) are evaluated strictly; intent-style rules (judgment-based principles, prose conventions) are evaluated best-effort with low-confidence markers per rules-review SKILL.md.
Step 6 Tidy covers reuse, prose quality, dead code, and redundancy; rule compliance is not its primary responsibility — if the Step 6 cleanup skill (Skill(simplify) or the Skill(tidy) fallback) surfaces rule findings as a side effect, treat them as bonus and do not extend its reviewer prompt to take on .claude/rules/ walks.
Step 8 Code Review covers correctness, edge cases, conventions / consistency lightly (a safety-net pass for files modified after Step 7.5), and simplicity / maintainability — the thorough rules check stays at Step 7.5.
Step 11 Update Rules owns the rule-doc-drift class: findings where the code under review is internally consistent with itself (and with the broader file's existing pattern across 3+ call sites per rules-review SKILL.md's drift detection criteria) but the rule document describes different behavior — i.e. the rule text has gone stale relative to the code. Step 7.5 surfaces this class via the reviewer's Classification: rule-doc-drift finding and does not apply a code fix; the disposition is to route the rule-text update to Step 11 (Skill(extract-rules)) rather than rewriting the code to match a stale rule. When rules-review returns a finding tagged Classification: rule-doc-drift, treat it as out-of-scope for Step 7.5's fix loop (no Skill(rules-review) re-run is required to clear it, since the code is the source of truth), record the routing intent so Step 11 picks it up, and continue.

When a rule violation is reported in both passes (Step 7.5 and Step 8), treat Step 7.5 as authoritative and skip the duplicate fix attempt in Step 8 to avoid double-counting in the iteration budget.

Obtain the rules-review report — collect the Step 7 background launch, or invoke directly. If Step 7's "Concurrent rules-review launch" dispatched a background rules-review this pass and it is still fresh (rules_review_launched == true and rules_review_stale == false), collect that background result now (it ran concurrently with the test phase). If the background subagent has not yet reported when you reach this point (the test phase finished first), wait for its completion notification before judging the report — this wait is a non-stalling return boundary (harness-tracked background work), not a user gate, and a not-yet-arrived notification must never be read as "no findings". If the collected background result is itself an error completion — the subagent died or returned something unusable as a rules-review findings report — treat it the same as not-launched and fall back to the fresh sequential dispatch below; this route only redirects (it does not mutate the launch/stale flags, so the lifecycle closed list is unchanged). Otherwise — rules_review_launched == false (background dispatch was unavailable, or the dispatch attempt did not succeed) or rules_review_stale == true (the background result was discarded after a test failure) — invoke Skill(rules-review) with --base-commit <sha> (base-commit recorded in Step 2) — and Model: <subagent_model> when subagent_model is a model id (omit when inherit), which rules-review applies to its internal per-category Agent dispatch — via $ARGUMENTS, including the cross-layer review handoff ledger as a short context item in the dispatch (per § Step 6's Cross-layer review handoff ledger paragraph; omit when the ledger has no recorded dispositions). Either way the report comes from the external rules-review skill: do not substitute an inline rules-walk based on perceived scope, change size, or any other self-judgment of the diff's complexity — small / "obvious" / single-file changes still go through the external skill. The skip-to-fallback path is documented in Prerequisites and fires only on objective skill unavailability (the Skill(rules-review) call itself fails after one retry), never on subjective judgment that an inline equivalent would suffice. The external skill enforces consistent coverage across runs; inline substitution silently degrades that coverage and the user has no visible signal that it happened.
Judge the result semantically: if the skill reports that there is nothing to act on — no actionable violations, no changed files, no applicable rules, no rule files found, or any other "nothing to report" outcome regardless of exact wording — mark Step 7.5: Rules Compliance Review as completed and proceed to Step 8 automatically. Per the No-Stall Principle, do not wait for user input and do not rely on exact-phrase matching; trust semantic judgment since the skill's phrasing may evolve across versions.
If violations found: a. Fix all reported violations. Applying these fixes changes the diff that the Step 8 code-review background launch (if any) analyzed, so set code_review_stale = true — Step 8 sub-step 1 then discards the now-stale background result and dispatches the reviewer fresh against the post-fix diff. b. Re-run Step 7 (Check / Test) to ensure fixes did not break anything (this sequential re-run does not re-fire Step 7's concurrent rules-review or code-review launches) c. Re-run Skill(rules-review) with --base-commit <sha> for verification (2nd cycle). Apply the same semantic judgment as step 2: if the re-run reports nothing actionable, mark Step 7.5: Rules Compliance Review as completed and proceed to Step 8 automatically (per the No-Stall Principle). When a 2nd-cycle verdict differs from the 1st on a specific location (a previously-flagged item now passes, or a previously-clean location is now flagged), record the reason inline in the Step 7.5 user-facing summary presented to the user (1–2 lines per drifted location: which location, 1st-cycle verdict, 2nd-cycle verdict, why) before completing — judgment drift between cycles is acceptable but must be explained, otherwise repeat-cycle stability cannot be assessed. d. If violations still persist after the 2nd review cycle, present remaining violations to user for decision. Above the violations list, emit a summary preamble per references/plan-format.md § User-gate summary preamble. Render the violations following references/plan-format.md § Localization granularity in the resolved language. Wait for user response before marking completed. (This is one of the explicit user-gates enumerated in the No-Stall Principle.)

Mark Step 7.5: Rules Compliance Review as completed only after all violations are resolved or user has decided on remaining violations.

GATE: Verify Steps 2-7.5 are completed (check task status via TaskList; if status is inconsistent, verify actual completion by reviewing work done). Mark Step 8 as in_progress only when N_code ≥ 1; if Step 2 set N_code=0 (Trivial), Step 8 is already completed — do not re-mark it in_progress, skip straight to Step 9. (If Step 7 launched a background code review, it may still be in flight — Step 8 sub-step 1 collects it.)

Step 8: Code Review

Code review catches bugs, convention violations, and design issues that tests alone miss — skipping it risks shipping preventable defects. Always run this step even when tests pass cleanly.

Difficulty exception (Trivial / N_code=0). When Step 2's difficulty assessment set N_code = 0 (a Trivial task — Trivial zeroes both N_plan and N_code), this entire step is skipped: its task rows (top-level Step 8: Code Review and every Step 8-x) were already marked completed by Step 2's Adjust N by difficulty. As with Step 3, this skip is gated on task difficulty; for any Simple / Moderate / Complex task (N_code ≥ 1) Step 8 always runs.

If N_code = 0 (Trivial), skip this step entirely — its rows are already completed, so do not re-mark them in_progress and proceed directly to Step 9 (Completion Hooks). The following in_progress marking and per-iteration processing apply only when N_code ≥ 1.

Mark Step 8: Code Review as in_progress. Process each pending iteration item (Step 8-1 through 8-N_code) in order:

Mark the iteration item as in_progress. Obtain this iteration's reviewer report — collect the Step 7 background launch when it is fresh, otherwise dispatch the reviewer skill resolved in Step 1 (e.g. Skill(ask-peer)):
- Collect the Step 7 background launch when fresh: if code_review_launched == true and code_review_stale == false, the Step 7 "Concurrent code review launch" ran the reviewer in a background subagent concurrently with this pass's test phase — collect that result now as this iteration's reviewer report. Each pass's launch is collected at most once — by the first iteration item processed after that pass (iteration 1 on the initial pass; iteration k+1 on a re-run pass triggered from iteration k; derivation at the Initialize tracking bullet of Step 7's "Concurrent code review launch" paragraph). If the background subagent has not yet reported, wait for its completion notification before judging it, per the same non-stalling wait-boundary rule as Step 7.5 sub-step 1's background collect (a not-yet-arrived notification must never be read as "No actionable findings"). If the collected background result is itself an error completion, apply the same error-completion route as Step 7.5 sub-step 1's background collect — treat it as not-launched and dispatch fresh per the next bullet (the route only redirects; it does not mutate the launch/stale flags).
- Otherwise — dispatch fresh: when code_review_launched == false (background dispatch unavailable, the launch was skipped for this pass, or the dispatch attempt did not succeed), or code_review_stale == true (the diff changed since this pass's launch so the background result is stale), or when redirected here by the collect bullet's error-completion route, call the reviewer skill (e.g. Skill(ask-peer)) to review the code changes now. subagent_model propagation (inline fresh-dispatch only): propagate subagent_model exactly as in Step 3 (Claude-family reviewers only). This applies only to this inline fresh-dispatch path — the Step 8 background-launch path already carries subagent_model via the Step 7 launch's Agent model, so the two paths never double-apply. Pre-dispatch dispatch-boundary reminder: Issue the Skill(<reviewer>) call in the same turn as any accompanying status prose — never produce a standalone status turn before the Skill() call, as that creates a stall point. Reading the reviewer's SKILL.md is preparation, not dispatch; the Skill() call is the dispatch. In both paths the collecting iteration is an ordinary iteration (it judges and applies findings per sub-steps 2–3 below); the collect path only substitutes the report's source. The reviewer report addresses the following — this list is the single parametric source for both paths (it is the fresh-dispatch request, and the same payload the Step 7 background launch bakes; the continuation item and the iteration-scope instruction below each apply per their own conditions):
- Include git diff <base-commit> (base-commit recorded in Step 2) to capture all changes since workflow start
- Thorough rules compliance has been verified in Step 7.5, but instruct reviewer to also flag any obvious .claude/rules/ violations as a safety net — especially for code modified after Step 7.5
- Request feedback organized into three categories (labels only — the full per-category rubric lives in references/review-categories.md § Code review categories; instruct the reviewer to read that section, resolving the link to a concrete readable path when composing the request — the reviewer lacks the skill-directory context): a. Correctness & edge cases b. Conventions & consistency c. Simplicity & maintainability
- If custom_instructions is configured, include the instructions text in the review request and have the reviewer verify compliance and report conflicts
- If a state file is active (executing a subtask from a decomposition), include the current subtask's scope in the reviewer request: list the subtask's title and description, then list what the other subtasks cover (to define out-of-scope). Instruct the reviewer that missing functionality belonging to other subtasks is not an actionable finding for this code review — only findings scoped to the current subtask qualify. Omit this when no state file is active (single-task execution has no defined out-of-scope boundary).
- Include the cross-layer review handoff ledger as a short context item (per § Step 6's Cross-layer review handoff ledger paragraph; omit when the ledger has no recorded dispositions); it complements the same-layer continuation item below.
- If a prior Step 8 iteration completed this run (a re-run pass): include the continuation item — the summary of fixes made and rejections with reasons from the completed iterations, including any class-level sweep record — so the reviewer builds on prior rounds rather than re-raising already-rejected findings (the latest git diff <base-commit> is already the first item above). Omit this item on the initial pass (no prior iteration).
- On a re-run pass, also include an iteration-scope instruction: the reviewer's primary verification scope is the changes applied since the prior iteration (identified via the continuation item's summary of fixes, located within the latest git diff <base-commit>) plus landing confirmation of the prior iteration's findings — the full-coverage pass (re-verifying every target file in the full git diff <base-commit> from scratch) belongs to the initial pass only. The reviewer must still escalate back to full re-verification when content outside that primary scope raises a new concern, so coverage is reordered, not reduced. Omit this item on the initial pass (the initial pass is the full-coverage pass).
- Reviewer should only report actionable findings. If none, explicitly state "No actionable findings"
Judge the reviewer's response semantically: if the reviewer reports nothing actionable — no actionable findings, no bugs / convention violations / design issues raised, or any other "nothing to report" outcome regardless of exact wording — mark this and remaining iteration items as completed (skip). Mark Step 8: Code Review as completed and proceed to Step 9 (Completion Hooks) automatically. Per the No-Stall Principle, do not wait for user input and do not rely on exact-phrase matching; trust semantic judgment since the reviewer skill's phrasing varies (especially Skill(ask-peer) and other free-form-prose reviewers whose verdicts are natural-language Markdown rather than a fixed token).
Otherwise: autonomously fix genuine issues or reject inapplicable points with reason — do not ask the user for judgment on individual review findings. Mark this iteration item as completed.
- Rejection self-question (severity-label override): before rejecting any finding solely because the reviewer labeled it Minor (or any other low-severity bucket), ask "if I rejected this and presented the resulting code to the user, would the user re-raise the same point themselves?" — judging by which areas the user has historically commented on (intent expression, reader-comprehension, placement consistency for test fixtures / helper functions / dependency locality, and other readability concerns where runtime correctness is unaffected but a reader's interpretation is). If the answer is yes or ambiguous, apply the fix instead of rejecting on the Minor label alone; reject on Minor only when you are confident the user would not surface the same point.
- Class-level extension audit (post-Critical/Major-fix): immediately after applying a fix for a Critical-severity finding, or a Major-severity finding whose fix addresses a structural pattern (external I/O boundary conditions, closed enum / form-set networks, shared helper / safety-rail callers, parallel route handlers — for skill development: subagent return-value schemas, shared handler fallback paths, mirrored form-set network audits), and before the modified-vs-rejected branches below, scan the rest of the diff for other instances of the same defect class — same operation, same broken assumption, same side-effect pattern (e.g. shared-resource-destroying API call sequences, direct processing of unverified input, race conditions). Reviewer feedback typically names one instance; the underlying class often spans the diff (cross-construct propagation, shared safety-rail callers, parallel route handlers, etc.). Apply the same fix direction to additional matches found here, then record the sweep outcome (e.g. class-level sweep for <defect-class>: N additional instances found and fixed or no additional instances found) in the summary passed to the next iteration so the next reviewer does not re-trigger the same audit on already-swept ground, then continue to the modified-vs-rejected branch.
- Prose-integrity self-check (post-fix): after applying a fix that edits prose adjacent to its target line (comments, docstrings, paragraph-level documentation — for skill development this includes SKILL.md and references/*.md content), re-read the surrounding paragraph as a single unit before continuing — verify no sentence is cut mid-word, no logical connective is broken (the connectives however / therefore / because / but / etc. still anchor real clauses), and the paragraph's overall logic still holds. Mechanical fix patches see only the line-level diff and routinely leave the surrounding prose semantically broken in ways the next iteration's reviewer flags as a Major finding, costing an extra iter.
- Natural-language quality self-check (post-fix): when a fix adds new natural-language content that mechanical lint / test cannot verify (code comments, config-file annotations, error messages, UI copy, documentation fragments — for skill development this includes SKILL.md / references/*.md prose additions, frontmatter description text, log messages), re-read each added fragment as a standalone unit in the resolved language. Judge it on four axes: concise (no padding or runaway sentences), phrasing natural for the target reader, vocabulary consistent with surrounding text, register and sentence structure not awkward. Revise any fragment that fails. This self-check is the only gate before natural-language content reaches the user-visible commit gate — Step 7 (check_commands / test_commands) and Step 7.5 (rules-review) cannot evaluate natural-language quality.
- If code was modified: re-run Step 7 and Step 7.5 (with same base-commit from Step 2) — this full re-entry is a new pass (per Step 7's "Concurrent rules-review launch" pass definition): both the rules-review launch and the code-review launch re-fire (the code-review launch bakes its payload per sub-step 1's definition, continuation item and iteration-scope instruction included, and only when a pending iteration item remains — see Step 7's "Concurrent code review launch") — then continue to the next pending iteration item (back to step 1). Code fixes routinely introduce fresh bugs, tighten one place while loosening another, or miss a caller the author didn't know about — the next review round is how those leaks get caught. Always re-run Step 7 and Step 7.5 — no exceptions. Do not short-circuit on any rationalization: not on confidence in the fix, not because the diff is small, not because the modified files appear out of scope for the configured check_commands / test_commands (e.g. edits land entirely under a local-skill directory or a docs-only path), not because re-running "would be a no-op". If a re-run is genuinely a no-op, the no-op outcome is the audit trail; skipping the re-run removes the trail. The only permissible skip is when no code was modified in this iteration (handled by the next bullet).
- If all points were rejected (no modifications): mark remaining iteration items as completed (skip — there is nothing new for the next reviewer to look at) Continue to the next pending iteration item; the next pass's reviewer dispatch (the Step 7 background bake and the fresh-dispatch path alike) composes its payload per sub-step 1's definition, continuation item and iteration-scope instruction included — not restated here.
Return-point no-stall reminder: At each iteration boundary (regardless of reviewer outcome — findings reported, "No actionable findings", any non-error result), the next action — the next iteration's reviewer dispatch when more iteration items remain, or the Step 9 (Completion Hooks) transition when this was the last iteration or "No actionable findings" was returned, or the Step 7 / Step 7.5 re-run when code was modified — must be issued in the next tool call. Do not insert an interstitial summary or acknowledgment turn between iterations; the abstract enumeration in § No-Stall Principle is intentionally duplicated here so the rule fires at the decision moment.
If all N_code iteration items are completed and actionable feedback still remains, present the unresolved points to user for decision. Above the unresolved points, emit a summary preamble per references/plan-format.md § User-gate summary preamble. Render the findings following references/plan-format.md § Localization granularity in the resolved language.

Mark Step 8: Code Review as completed.

Step 9: Completion Hooks

Skip this step if hooks.on_complete is not configured. Mark Step 9: Completion Hooks as in_progress.

Task-derived-change gate: before executing any entry, check whether the tracked diff since <base-commit> (recorded in Step 2) contains changes produced by this task. When it does not — the tracked diff is empty or every changed path in it is pre-existing work unrelated to this task, and git status --porcelain=v1 --untracked-files=all shows no task-derived untracked files (gitignored paths never appear in that output, so the typical case — the task's only deliverables living under a gitignored directory — still skips) — skip the whole hooks.on_complete list, mark this step completed, and emit one line in the Completion summary naming the skip reason (e.g. hooks.on_complete skipped: no task-derived changes), then proceed to Step 10 (or directly to Step 11 when interactive_commits: false — per § Step 10). When an unrelated pre-existing diff exists, also add a warning line surfacing those paths (e.g. hooks.on_complete skip warning: pre-existing unrelated diff in <path>, <path>) so the user can notice unintended pre-run changes. Review-class hooks dispatched against an unrelated diff bind their findings to content the task never touched — a misleading record rather than a safety net; skipping the non-review entries along with them is likewise intended — with no task-derived changes there is no task output for any hook entry to act on. On any doubt about whether a changed path is task-derived, run the hooks as usual (the gate skips only when the absence of task-derived changes (tracked or untracked) is clear).

Execute each entry in hooks.on_complete in order:
- Skill(<name>) pattern: invoke the skill — for review-class entries, include the cross-layer review handoff ledger as a short context item (per § Step 6's Cross-layer review handoff ledger paragraph, which also defines review-class; omit when the ledger has no recorded dispositions)
- Other strings: execute as a Bash command
If a hook fails, report the error but continue executing remaining hooks. Include as warnings in the Completion summary
After all hooks complete (or are skipped), mark Step 9: Completion Hooks as completed and proceed to Step 10

GATE: Verify Steps 2-9 are completed (check task status via TaskList; if status is inconsistent, verify actual completion by reviewing work done). Mark Step 10 as in_progress.

Step 10: Interactive Commits

After hooks.on_complete (which may itself modify the working tree, e.g. via auto-formatter or apply-edit hook entries), group the working-tree changes into commits and iterate with the user one commit at a time. Step 10 runs only when interactive_commits: true — Step 1's task registration omits the row otherwise and execution proceeds directly from Step 9 to Step 11. The git push is never performed by this step (or any other step): pushing commits to a remote is the user's responsibility.

On entry to Step 10, initialize landed_count = 0 before running the procedure — so the value is well-defined for the Completion section even when the empty-output skip path in references/interactive-commits.md § Collect changes fires before its § Per-commit loop ever starts.

Read references/interactive-commits.md and follow the procedure from top to bottom — it is the single canonical home for Step 10's procedure body. The Approval token closed list and Localized summary tokens below stay defined in this file and are referenced from both that procedure and other Steps.

Approval token closed list (per § No-Stall Principle's "do not rely on exact-phrase matching" rule). The example phrases below are illustrative, not literal discriminators — categorize each user response into one of the four buckets via semantic judgment. When presenting an approval gate, include at least one short-form token from the accept bucket (e.g., "OK", "LGTM", "next") so users know brief responses are valid.

accept: explicit affirmative — "OK" / "approve" / "next" / "LGTM" / "コミットして" / "進めて" / "いいよ" or any semantic equivalent
adjust: specific revision request — "subject を ... に" / "this file should be in commit 2" / "split this commit" / any other concrete change demand
cancel / stop: explicit halt — "stop" / "abort" / "やめる" / "中断"
NOT approval: interrogative or non-committal — "look good?" / "どう？" / "これでいい？" / "OK ？". Treat as adjust and re-present (do not silently advance)

Localized summary tokens (per references/plan-format.md § Localization granularity). These tokens are defined here as the single source of truth — § Completion below references the same paired form rather than re-rendering it:

language: ja: Step 10 部分完了: <N>/<total> コミット適用済み
language: en: Step 10 partial completion: <N>/<total> commits landed

§ Completion below emits the localized token whenever Step 10 ended via Mid-loop cancel (see references/interactive-commits.md § Mid-loop cancel). On a normal completion path (every commit landed, or the Mid-loop adjust un-landed-drops-to-zero / merge-absorbs-into-landed branches — see references/interactive-commits.md § Mid-loop adjust — closed-list branches), no partial-state line is needed.

Step 11: Update Rules

Skill(extract-rules) with --from-conversation — Skip if already run this session: if (a) any entry in hooks.on_complete (as resolved in Step 1) contains the string extract-rules (direct invocation), OR (b) Step 9 executed at least one hook and the output produced by Step 9's hook invocations (visible in this session's context) contains evidence that extract-rules --from-conversation ran this session (sufficient signal: output contains staged_count or promoted_count), skip this sub-step — extract-rules --from-conversation has already run this session. Running --from-conversation twice against the same session causes the staged-promotion mechanism (first-observation → second-observation escalation) to miscount one session as two independent observations, prematurely promoting staging candidates to confirmed rules.
Skill(extract-rules) with --update — Skip if --from-conversation ran this session: if extract-rules --from-conversation ran at any point this session — whether via sub-step 1's skip condition or because sub-step 1 itself ran in this Step 11 execution — skip this sub-step. Running --update immediately after --from-conversation risks prematurely promoting staging candidates that were just created before they accumulate a second observation. Trigger (when not skipped): significant structural/pattern changes to application code occurred — new frameworks, libraries, architectural patterns, or API conventions introduced in the diff; prose-only changes to SKILL.md, agent definitions, references, or rule files do not qualify. A dependency major-version bump alone (no implementation code changes in the diff) does not trigger --update; the major-bump signal (detected via git diff <base-commit> of the package manifest — the same signal used in the Step 2 difficulty assessment) instead triggers the extract-rules Update Mode operational note: surface it to the user as a reminder to run --update after the session and to manually review .examples.md samples that may have gone stale after the bump.
Char-count compaction gate:

Skip condition: If compact_rules is not true (i.e. the default false, or any non-boolean value that fell back to false), skip this entire sub-step — do not invoke Skill(extract-rules) --compact, do not open the Step 11 compaction approval gate, and proceed directly to sub-step 4 (variable initialization is not skipped — it is governed by the State-variable contract below, which covers the skipped case). Emit a one-line informational note in the resolved language so the user has a visible signal that compaction is intentionally not running:
- language: ja: Step 11 sub-step 3（圧縮）を skip しました — \compact_rules: true` が設定されていません（実験的機能 / デフォルト無効）`
- language: en: Step 11 sub-step 3 (compaction) skipped — \compact_rules: true` is not set (experimental feature / disabled by default)`
State-variable contract (cross-step declaration — § Completion reads both variables; the full 4-point lifecycle is specified in references/update-rules.md § Char-count compaction gate): at sub-step 3 entry, initialize compaction_applied_count = 0 and below_threshold_failed_files = []. When the skip condition above fired, both variables simply stay at these initial values (no advance ever runs), so § Completion's reads are well-defined and its compaction reminder is omitted.

When not skipped (compact_rules: true): read references/update-rules.md and follow § Char-count compaction gate from top to bottom — it is the single canonical home for this sub-step's procedure body, including the Step 11 compaction approval gate (USER APPROVAL GATE).
If extract-rules is unavailable, skip this step and inform user
After the applicable invocations above return, or after the step was skipped because extract-rules is unavailable — regardless of whether new rules were added or the report indicates nothing changed — mark Step 11: Update Rules as completed and proceed automatically. Per the No-Stall Principle, do not wait for user input.
If extract-rules wrote any changes to .claude/rules/ during sub-steps 1, 2, or 3, record the count so § Completion can surface the manual-commit reminder. The compaction-specific count (file-unit compaction_applied_count) is rendered separately by § Completion's "Step 11 compaction reminder" — see § Completion below

Step 11.5: Self-Retrospective

Emit a sanitized improvement signal for the dev-workflow-bundle skills (dev-workflow, ask-peer, extract-rules, rules-review) to a user-configured destination. Raw conversation jsonl stays in-session; only abstracted, project-agnostic text leaves.

Skip this step if self_retrospective.feedback is unset/invalid (Step 1 did not register the row). Otherwise read references/self-retrospective.md and follow the procedure from top to bottom — the difficulty assessment does not gate this step. Thread the Step 2-resolved subagent_model into the procedure the same way the resolved language is passed through — §2.1 uses it to set the scan subagent's Agent model (omitted when inherit).

Completion

Derived staging artifact cleanup: before reporting summary, delete any per-agent staging documents under .claude/plans/ that dispatched review subagents generated this run (files matching <slug>-agent-*.md, where <slug> is the plan slug established in Step 1). These files are excluded from commit scope but accumulate as untracked noise in the working tree. Use rm -f .claude/plans/<slug>-agent-*.md (-f suppresses the error when no files match). Do not delete the main plan document (<slug>.md) or any decomposition state file — those are canonical workflow artifacts that Step 1.5 / --resume depend on.

Report summary: tasks completed, files modified, test results, review outcomes, rules updated. Output in the resolved language following references/plan-format.md § Localization granularity.

Difficulty-skip reminder (per references/plan-format.md § Localization granularity): when difficulty_skipped_steps (initialized at Step 2 entry, populated by Step 2's Adjust N by difficulty) is non-empty, surface a line in the resolved language naming the steps the difficulty-skip matrix skipped, so the skip is never silent. Render the recorded steps with their tier; the example below pairs the two language values:

language: ja: 難易度判定（<tier> tier）により <steps> を skip しました — 例: 難易度判定（Trivial tier）により Step 6 Tidy / Step 7.5 Rules Compliance Review を skip しました
language: en: Skipped <steps> per the difficulty-skip matrix (<tier> tier) — e.g. Skipped Step 6 Tidy / Step 7.5 Rules Compliance Review per the difficulty-skip matrix (Trivial tier)

The reminder is omitted when difficulty_skipped_steps is empty (Moderate / Complex tasks, or -i-skipped Adjust N runs — see the Step 2-entry init). The step names (Step 6 Tidy / Step 7.5 Rules Compliance Review) stay verbatim regardless of language.

Step 10 partial-state line: if Step 10 ended via its Mid-loop cancel branch (see references/interactive-commits.md § Mid-loop cancel), emit the localized partial-completion token defined at § Step 10's "Localized summary tokens" paragraph. On a normal completion path, omit this line.

Step 11 rule-update reminder (per references/plan-format.md § Localization granularity): if Skill(extract-rules) wrote any changes to .claude/rules/ during Step 11, surface a manual-commit reminder in the resolved language:

language: ja: extract-rules が \.claude/rules/` に <N> 件の変更を加えました — PR を開く前に手動で commit してください`
language: en: extract-rules made <N> changes to \.claude/rules/` — please commit manually before opening a PR`

The reminder is omitted when the rule-change count is zero.

Step 11 compaction reminder (per references/plan-format.md § Localization granularity): when compaction_applied_count > 0 (the Step 11 sub-step 3 char-count compaction gate landed user-accepted edits), surface a separate manual-commit reminder in the resolved language (rendered in file-unit count, distinct from the rule-update reminder above which counts entry-level writes):

language: ja: Step 11 で <N> 件のルールファイルを圧縮しました — PR を開く前に手動で commit してください
language: en: Step 11 compacted <N> rule files — please commit manually before opening a PR

When below_threshold_failed_files is non-empty, additionally surface a follow-up reminder naming the files that remain over threshold. <files> always renders at the sentence tail so the block-level list never appears mid-sentence:

language: ja: <M> 件のファイルが閾値を超えています。手動で再度 \Skill(extract-rules) --compact` を実行するか、当該ファイルを直接編集してください:followed by<files>` on the next line
language: en: <M> files still exceed the threshold. Re-run \Skill(extract-rules) --compact` manually or edit the files directly:followed by<files>` on the next line

Render <files> as one path per line — verbatim from files_processed[].path (repo-root-relative, e.g. .claude/rules/project.rules.local.md; never rewritten to user-absolute /Users/... form) — each prefixed with - (hyphen + space, no leading indent) directly below the reminder sentence as a top-level markdown bullet list. This applies for any M ≥ 1 — single-element lists render as a one-bullet list, not inline, so the layout is identical across runs and the trailing prose clause never floats after the bullet list.

The compaction reminder is omitted when compaction_applied_count == 0 AND below_threshold_failed_files is empty.

If this run was executing a subtask from a decomposition state file, also do the following (all reads/writes target the canonical state-file path recorded in Step 1.5):

Execution-time deferral/exclusion gate: before marking the subtask as completed, check whether any in-scope work items were excluded, deferred, or discovered as unassigned during implementation or testing. Items recorded only in prose (Risks entries, inline notes) are invisible to --resume and will be silently skipped — each such item must be promoted to a tracked subtask entry in the state file before completion is declared. For each uncovered item, get user approval on one of: (a) add as a new pending subtask with a depends_on link if sequencing matters, (b) fold into an existing pending subtask's scope, or (c) explicitly accept as permanently out of parent-task scope. The completion report must confirm that no goal-required items remain in untracked prose form.

Mark the current subtask's status as completed in the canonical state file and write back
Ask the user for an optional PR URL for this subtask. On a non-empty answer, set the subtask's pr field and write back; otherwise leave it null
Refresh the parent-task progress row's <done>/<total> count
Find the next runnable subtask (smallest-id pending with all depends_on completed)
If a next subtask exists: branch on whether Step 10 actually landed any commits this run (use the landed_count from Step 10 — taking the config flag alone would mis-route the case where interactive_commits: true met the Step 10 skip conditions and exited at zero commits):
- landed_count > 0: tell the user the current subtask's changes have already been committed by Step 10 — open a PR for those commits, then start a new session with /dev-workflow --resume <slug> once the PR is up
- landed_count == 0 (either because interactive_commits: false or because Step 10 was skipped): tell the user to commit the current subtask's changes and open a PR before resuming, then start a new session with /dev-workflow --resume <slug>. Explain why this matters: the next run records a fresh base-commit from HEAD, so uncommitted changes would leak into the next subtask's diff In both branches, if Step 11 also wrote rule updates (i.e., the Step 11 rule-update reminder above fired with <N> > 0), tell the user to commit those .claude/rules/ writes manually before resuming — otherwise they leak into the next subtask's diff the same way uncommitted feature changes would. The "no push" invariant for both branches is stated at § Step 10's preamble
If no next subtask exists (all subtasks completed): delete the canonical state file via rm -f <canonical-path>, remove the parent-task progress row, and include every subtask's title and recorded pr (if any) in the parent-task completion summary

Dev Workflow

Usage

/dev-workflow --init                             # Project setup (detect check/test commands)
/dev-workflow [-i N | --iterations N] <task>    # Execute workflow (default)
/dev-workflow --resume <state-file> [-i N]      # Resume next subtask from a decomposition state file

Prerequisites

Reviewer skill (reviewer setting, default: ask-peer): Required for plan/code review. Supported: ask-peer, ask-claude, ask-codex, ask-gemini, ask-copilot, ask-agy. If a Skill() call for the configured reviewer fails, attempt once more before declaring unavailable. If still unavailable, present the user with three explicit fallback options, each with its own resume semantics: (a) switch to another supported reviewer from the list — re-invoke the current review step with the new reviewer immediately (the original reviewer is not retried); (b) self-review — perform the review inline and advance past the current step (no later retry of the original reviewer); (c) pause at the current gate until the skill is installed — name the specific step where the original reviewer call will be retried once the skill is available. Do not silently advance past a review pass without the user knowing their options.
rules-review skill: Required for rules compliance review (Step 7.5). If a Skill(rules-review) call fails, attempt once more before declaring unavailable. If still unavailable: skip Step 7.5 with a message that names the fallback (Step 8 reviewer as a lightweight backup) and the resume point (re-run rules-review manually after the session or re-run the workflow once the skill is installed).
extract-rules skill: Required for rule update. If a Skill(extract-rules) call fails, attempt once more before declaring unavailable. If still unavailable: skip Step 11 with a message that names the fallback (no rule updates this run) and the resume point (invoke extract-rules manually after the session to capture rule changes).
Cleanup skill (Step 6 Tidy): The Step 6 cleanup pass prefers the built-in simplify skill. Invoke Skill(simplify); if the call fails (skill-not-found or equivalent — a Claude Code version that lacks the built-in simplify), attempt once more, then emit a one-line note naming the fallback (e.g. simplify unavailable — falling back to in-house tidy) and fall back to the bundled Skill(tidy). "Available" is defined by the observable call outcome (a successful call), not by introspecting the in-context skill list — this mirrors the reviewer / rules-review / extract-rules bullets above so the orchestrator follows it deterministically. This bullet is the single source of truth for the simplify→tidy resolution; Step 6 references it rather than restating the definition. The fallback proceeds without a user gate (unlike the reviewer bullet's three-option prompt): tidy is a functionally-equivalent cleanup pass, so swapping it for simplify does not change outcomes materially enough to warrant a user decision — whereas a reviewer swap changes review quality and so warrants one. After simplify (or the tidy fallback) returns, judge the result semantically and proceed per § No-Stall Principle.

Configuration

Settings files (YAML frontmatter only, merged across layers):

~/.claude/dev-workflow.local.md — User global defaults (lowest priority)
.claude/dev-workflow.md — Project shared settings (git tracked, team-shared)
.claude/dev-workflow.local.md — Personal overrides (gitignored, highest priority)

Scalar (reviewer, review_iterations, subagent_model, task_decomposition, interactive_commits, compact_rules, custom_instructions, language): higher layer wins (replaces) when the key is present; a key absent from a higher layer inherits from lower layers (see the inherit note below). When review_iterations carries a map value ({plan, code}) it is still a scalar key here — a higher layer's value replaces the lower layer's wholesale, with no per-key cross-layer merge (an absent map key is not back-filled from a lower layer; it falls to default 3 at resolution time). The subagent_model map ({<tier>: <model>}) is the same scalar/map class — a higher layer's map replaces the lower layer's wholesale (no per-key cross-layer merge), and an absent tier key falls to its built-in per-tier default at resolution time (sonnet for trivial / simple, inherit for moderate / complex)
List (check_commands): append — lower-layer items first, then higher-layer items, duplicates removed (keep first occurrence)
List-replace (test_commands): higher layer's list replaces lower layer's list as a whole (no item-level merge or dedup). Defaults to ["Skill(run-tests)"] when unset
hooks: deep-merge at the hooks level — each sub-key (on_complete) is merged as a list (append, deduplicated)

Keys absent from a higher layer inherit from lower layers. Only specify keys you want to override or extend.

---
reviewer: "ask-peer"
review_iterations: 3
subagent_model:
  trivial: sonnet
  simple: sonnet
task_decomposition: true
interactive_commits: true
compact_rules: false
custom_instructions: "Always use TDD. Write tests before implementation."
language: "ja"
check_commands:
  - "pnpm run lint:fix"
  - "pnpm run format"
  - "pnpm run typecheck"
test_commands:
  - "Skill(run-tests)"
hooks:
  on_complete:
    - "Skill(work-complete)"
self_retrospective:
  feedback: "owner/repo"        # or "/abs/path", "~/rel", "./rel"
---

reviewer: Reviewer skill name (default: ask-peer). Choose from: ask-peer, ask-claude, ask-codex, ask-gemini, ask-copilot, ask-agy. Unsupported values fall back to ask-peer
review_iterations: Max iterations for Plan Review (Step 3) and Code Review (Step 8) (default: 3). Two accepted forms: (i) a scalar positive integer applies the same cap to both phases (e.g. review_iterations: 2); (ii) a map {plan: N, code: M} sets the Plan Review cap (Step 3) and the Code Review cap (Step 8) independently (e.g. {plan: 1, code: 3}). Each map key must be a positive integer; an absent or invalid key falls back to default 3 for that phase only (per-key validation — see Step 1 sub-step 4). Adopting the map form is the user's explicit opt-in; the scalar form and an absent key are fully backward-compatible (both phases default to 3). Can be overridden per invocation with -i N / --iterations N, which overrides both phases with the same value regardless of the config form
subagent_model: Optional. A map from difficulty tier (trivial / simple / moderate / complex) to the model the workflow's Agent-tool subagent dispatches run on — sonnet / opus / haiku, or inherit (use the session model). It governs (i) the workflow's direct Agent dispatches (Step 7's two background launches, Step 11.5) and (ii) the model propagated via the Model: argument to the named callees the workflow dispatches (Step 7.5 rules-review; the Step 3 / Step 8 inline reviewer when the resolved reviewer is Claude-family — see Step 1's reviewer-family classification; external-CLI reviewers are not affected). Resolved once in Step 2 from the assessed tier (see Step 2's Adjust N for the resolution chain and the -i-path handling). Built-in default = {trivial: sonnet, simple: sonnet} (moderate / complex inherit). Behavior change: under this default, Trivial and Simple tasks run their subagent dispatches on sonnet instead of the session model. To opt out (restore all-inherit on the low tiers), set subagent_model: {trivial: inherit, simple: inherit} in .claude/dev-workflow.md or ~/.claude/dev-workflow.local.md. Invalid values / unknown tier keys warn and fall back to the built-in per-tier default. hooks.on_complete skill entries' models are independent of this key — the workflow never passes Model: to hooks.on_complete callee skills; each callee's model is set skill-side and is unaffected by subagent_model. Per-subagent effort is out of scope — the Agent tool exposes only model.
task_decomposition: Whether Step 1.5 runs the auto-decomposition check in Normal sub-mode (default: true). Set to false to treat Normal sub-mode requests (/dev-workflow <task>) as single tasks — Step 1.5 is omitted from the task list and the decomposition judgment is skipped entirely. --resume <state-file> is unaffected and still executes existing state files. Non-boolean values fall back to true with a warning
interactive_commits: Whether Step 10 (Interactive Commits) runs after hooks.on_complete (default: true). When true, after Step 9 (Completion Hooks) the workflow proposes commit groupings and messages, then iterates per-commit with the user. When false, Step 10 is omitted from the task list and never executes — the workflow ends with an uncommitted tree as before. Non-boolean values fall back to true with a warning. To opt out, set interactive_commits: false in .claude/dev-workflow.md or ~/.claude/dev-workflow.local.md
compact_rules: Whether Step 11 sub-step 3 (Char-count compaction gate) runs (default: false). The compaction mode added in v1.38.0 is currently experimental — when false (the default), sub-step 3 is skipped entirely: Skill(extract-rules) --compact is never invoked, the gate is never opened, and compaction_applied_count / below_threshold_failed_files stay at their initial values so § Completion's compaction reminder is automatically omitted. When true, the workflow invokes Skill(extract-rules) --compact and may enter the Step 11 compaction approval gate (USER APPROVAL GATE). Non-boolean values fall back to false with a warning. To opt in for a specific project, set compact_rules: true in .claude/dev-workflow.md or .claude/dev-workflow.local.md
custom_instructions: Free-form development instructions applied as guiding principles across planning, implementation, review, and tidy phases (e.g., "Always use TDD", "Prefer functional style"). Optional. .claude/rules/ and explicit user requests take precedence if they conflict
language: Optional. Output language code (e.g. ja, en) for user-facing prose produced by this skill — Step 4 plan body (Overview / Decisions / Design / Test plan / Risks / Unknowns content), user-gate preambles (Step 4 / Step 7.5 / Step 8), Step 2 difficulty-assessment log, Step 10 commit-plan / per-commit gate output (subjects, body, diff blocks framed in the resolved language; verbatim git output and file paths remain English), Completion summary, and Step 11.5 finding Description / Suggested fix direction paragraphs. Resolution: merged skill config → Claude Code settings (~/.claude/settings.json → language field) → default ja. null / empty string / non-string values fall through to the next resolution step. For the localization boundary between translated concepts and verbatim identifiers, see references/plan-format.md § Localization granularity. See references/self-retrospective.md §2.1 Language handling / §5 Contract note for the Step 11.5 scope contract. No Step 11.5 output unless self_retrospective.feedback is also configured
check_commands: Static checks (lint, format, typecheck, etc.). Always run all in order
test_commands: Defaults to ["Skill(run-tests)"]. Each entry must be a Skill(<name>) string (no shell commands). Entries run sequentially during Step 7. Run --init to generate or update run-tests; additional structural-check skills can be appended in project config (e.g. for bundle-sync drift detection, custom marketplace structure validators, or other repository-specific checks)
hooks: Execute skills/commands at specific workflow timing points
- on_complete: Runs as Step 9 (immediately after Step 8 Code Review). Entry format: Skill(<name>) or shell command string
- Entries not covered by allowed-tools require user approval
self_retrospective: Optional. Emits sanitized improvement signal for the dev-workflow-bundle skills (dev-workflow, ask-peer, extract-rules, rules-review) at Step 11.5 (between Step 11 and Completion). Raw conversation stays in-session; only abstracted text leaves
- feedback: Destination string. Auto-detected:
  - Starts with /, ~/, ./, or ../ → local directory path → retrospective written as a markdown file under that directory
  - Matches ^[\w.-]+/[\w.-]+$ → GitHub owner/repo → retrospective submitted via gh api POST to /repos/<feedback>/issues
  - Any other string (including empty) → warn and skip Step 11.5
- If feedback is unset, Step 11.5 is not registered as a task and never executes — the workflow behaves as before
- Step 11.5 runs whenever self_retrospective.feedback is configured, regardless of the Step 2 difficulty assessment — difficulty gates the review-iteration counts N_plan / N_code (Step 3 / Step 8) and the difficulty-skip matrix (Step 6 Tidy / Step 7.5 Rules Compliance on Trivial / Simple), but not the self-retrospective. Even Simple / Trivial tasks emit a retrospective when feedback is set; when nothing notable surfaced, the retrospective is simply short
- Agent tool usage: Exactly two steps directly spawn subagents via the Agent tool — Step 11.5 (for jsonl scan + sanitization) and Step 7's two concurrent background launches, the per-pass rules-review launch and the per-pass code review (run_in_background dispatches for test-phase overlap; see Step 7's "Concurrent rules-review launch" and "Concurrent code review launch" paragraphs for why Agent rather than Skill()). This is two steps / three dispatch sites (Step 7 carries two launches, Step 11.5 one); each of the three dispatch sites passes the Step 2-resolved subagent_model as the Agent model parameter (omitted when the resolution is inherit). All other steps delegate to named skills (Skill(ask-peer), Skill(run-tests), Skill(rules-review), Skill(simplify) / Skill(tidy), etc.) and must not invoke Agent directly. (The Step 3 / Step 8 inline reviewer's subagent_model propagation rides the named Skill(<reviewer>) call's Model: argument — it is not a direct Agent spawn and does not count against the "two steps" above.)

Mode Detection

--init → Init Mode (-i / --iterations is ignored)
--resume <state-file> → Execution Mode (Resume sub-mode; see Step 1.5)
Otherwise → Execution Mode (Normal sub-mode)

Init Mode

Read references/init-mode.md and follow the procedure.

Note: Skills generated by --init (e.g. run-tests) are recognized from the next session onward. Do not run /dev-workflow <task> in the same session as --init.

Execution Mode

No-Stall Principle

Explicit user-gates (the only permissible pause points):

Each bullet names the gate and points to the authoritative definition site. When editing either the enumeration or the definition, update both together.

Step 1.5 task-decomposition proposal dialogue — yes / adjust / no confirmation (Normal sub-mode; defined in Step 1.5 dispatch and references/task-decomposition.md § B. Normal sub-mode)
Step 1.5 leftover-subtask picker dialogue — selecting which subtask to run when more than one leftover in_progress subtask is runnable (Resume sub-mode; defined in references/task-decomposition.md § A. Resume sub-mode)
Step 4 plan approval (defined in Step 4: Finalize Plan)
Step 5 probe → real-implementation user-observation gate — when the Plan explicitly stages a probe / intermediate-artifact step before its real-implementation replacement: hold the workflow at the boundary until the user signals observation completion (defined in Step 5's "User-observable artifact protection gate" paragraph). Fires conditionally per the Plan's content — non-probe-staged plans never enter this gate
Step 7 pre-execution scope-narrowing stop — when a check_commands entry is assessed as a repo-wide auto-fix tool, the working tree has unrelated existing changes, and scope narrowing is not feasible given the tool's interface: stop and ask the user for direction (options: run accepting full-width effect, skip, or provide an alternative scoped invocation) (defined in Step 7: Check / Test)
Step 7 scope-drift stop — when check_commands writes non-trivial changes outside the task-scope snapshot (trivial = whitespace-or-comment-only formatting on ≤ 5 lines attributable to the formatter/linter that just ran — those proceed automatically with a one-line note): warn and wait for user direction (defined in Step 7: Check / Test)
Step 7 check/test fail-stop — failure after 3 retries: report the error and stop (defined in Step 7: Check / Test). Note: this is an error-stop, not a pause for user decision
Step 7.5 persistent-violations decision — rule violations still present after the 2nd review cycle (defined in Step 7.5: Rules Compliance Review)
Step 8 unresolved-findings decision — reviewer-reported actionable findings still unresolved after the N_code-th iteration (defined in Step 8: Code Review)
Step 10 commit-plan approval gate — accept the proposed commit grouping (subjects + file lists) for the working-tree changes; fires once on the initial plan and re-fires whenever a Mid-loop adjust file-regrouping / split-adding branch rebuilds the un-landed portion of the plan (defined in references/interactive-commits.md § Propose commit plan)
Step 10 per-commit accept gate — accept each individual commit (subject / body / files / diff) before it lands; repeats N times where N is the approved commit count (defined in references/interactive-commits.md § Per-commit loop, judged per § Approval token closed list inside Step 10)
Step 10 fold-or-defer gate — after a pre-commit hook auto-modifies the working tree following a zero-exit commit, ask the user whether to amend the just-landed commit (fold) or leave the changes uncommitted for a later iteration (defer); judged per the dedicated 5-branch → fold / defer / cancel / re-present-as-adjust classifier in references/interactive-commits.md § Post-commit auto-modify cycle bound (the 5 input branches extend § Approval token closed list's 4 buckets with an additional defer-direction branch; this gate is not the per-commit-accept-gate enum — cancel routes via Mid-loop cancel and ambiguous adjust responses re-enter the gate via § Mid-loop adjust branch f, both in the same reference)
Step 10 ambiguous-adjust clarifier — when a Mid-loop adjust request cannot be classified into branches a–e, ask the user a clarifying question and re-enter the gate that issued the request — this gate is itself the disposition for branch f of Mid-loop adjust — closed-list branches (in references/interactive-commits.md; categorization vocabulary depends on which gate originated the request)
Step 11 compaction approval gate — when Skill(extract-rules) --compact returns top-level status: "compacted", present per-file diff (chars_before / chars_after / iterations_used / applied_edits_count / structural_notes / per_file_status / below_threshold) per § User-gate summary preamble and wait for accept/reject/adjust/cancel per the Step 11 local closed list (defined in references/update-rules.md § Char-count compaction gate). cancel aligns with Step 10's Mid-loop cancel semantic (no revert; see references/interactive-commits.md § Mid-loop cancel); adjust uses Step 11's own three-case closed list (per-file disposition / clarification / other), not Step 10's branch f
Completion execution-time deferral/exclusion gate — when executing a decomposed subtask, if in-scope work items were excluded / deferred / discovered-unassigned during implementation or testing, ask the user to promote each uncovered item to a tracked subtask entry: (a) add as a new pending subtask (with depends_on if sequencing matters), (b) fold into an existing pending subtask's scope, or (c) explicitly accept as permanently out of parent-task scope (defined in Completion's "Execution-time deferral/exclusion gate" paragraph). Fires conditionally — only on decomposed-subtask runs that surfaced uncovered items
Completion subtask PR URL prompt — when executing a decomposed subtask, ask for optional PR URL before resuming (defined in Completion)

Progress Visibility

Workflow artifacts (cross-step fixed exclusion)

Step 1: Load Settings

Read settings from up to three layers and merge (type-aware):
```
merged = {}
if ~/.claude/dev-workflow.local.md exists:  overlay its frontmatter onto merged
if .claude/dev-workflow.md exists:          overlay its frontmatter onto merged
if .claude/dev-workflow.local.md exists:    overlay its frontmatter onto merged
```
"Overlay" = for each key present in the file:
- Scalar keys: merged[key] = file[key] (replace) — this includes review_iterations when its value is a map ({plan, code}): the whole map replaces the lower layer's value with no per-key cross-layer merge (a map key absent from the higher layer is not back-filled from the lower layer)
- List keys (check_commands): append file[key] items after merged[key], then deduplicate (keep first occurrence)
- List-replace keys (test_commands): merged[key] = file[key] — the higher layer's whole list replaces the lower layer's (no item-level merge or dedup)
- hooks: deep-merge — for each sub-key (e.g. on_complete), append and deduplicate the list
- null or empty ([], {}) explicitly clears the key — lower-layer value is discarded, not inherited
- Key absent from the file: left untouched (inherit from lower layers) If a file's YAML frontmatter is malformed (parse error), warn the user naming the file, skip that layer, and continue with remaining layers.
If none of the three files exist, prompt user to run /dev-workflow --init and stop
Resolve reviewer from config. If not specified or not in the supported list (ask-peer, ask-claude, ask-codex, ask-gemini, ask-copilot, ask-agy), use ask-peer. Reviewer-family classification (the single definition referenced by the Step 3 / Step 8 inline-reviewer subagent_model propagation): Claude-family = ask-peer / ask-claude — model-controllable (ask-peer via its Model: argument applied to its internal Agent dispatch; ask-claude via the claude -p --model flag); external-CLI = ask-codex / ask-gemini / ask-copilot / ask-agy — these run their own non-Claude models and are not subagent_model-controllable (never receive a propagated model)
Resolve the review iteration counts — N_plan (Plan Review, Step 3) and N_code (Code Review, Step 8). A scalar config, the -i option, and the default all set both values equally; only the map config form makes them differ:
1. If -i / --iterations option is present and is a positive integer, set both N_plan and N_code to it (the option overrides both phases)
2. Else if config review_iterations is present:
  - scalar positive integer → set both N_plan and N_code to it
  - map ({plan, code}) → N_plan = plan if it is a positive integer else default 3; N_code = code if it is a positive integer else default 3 (per-key validation, independent per phase; warn on each absent/invalid key)
  - any other value (non-positive or non-integer scalar, list, string, or any non-map / non-scalar type) → warn and set both to default 3 (any map takes the map branch above — an empty map, or a map with no valid plan / code key, already resolves to default 3 per phase there)
3. Else use default 3 for both Wherever a later step says "N" without a phase qualifier, Step 3 references resolve to N_plan and Step 8 references to N_code.
Parse hooks from config. Warn and ignore if hooks.on_complete has invalid format. For review_iterations, emit the invalid-value / invalid-map-key warnings as sub-step 4's resolution defines (sub-step 4 owns the case analysis and the default-3 fallback per phase). Parse custom_instructions from config (optional, string). Warn and ignore if not a string. Parse task_decomposition from config (optional, boolean, default true). Warn and fall back to true if present but not a boolean. Parse interactive_commits from config (optional, boolean, default true). Warn and fall back to true if present but not a boolean. Parse compact_rules from config (optional, boolean, default false). Warn and fall back to false if present but not a boolean. Parse subagent_model from config (optional, map of tier → model). Warn and ignore a non-map value (the built-in per-tier default applies); for each tier entry, warn and drop an unknown tier key or a value outside the enum sonnet / opus / haiku / inherit (that tier then falls to its built-in default at resolution time). The merged map is consumed by Step 2's subagent_model resolution. Parse language from config per the Configuration bullet above. For ~/.claude/settings.json, silently accept missing file / absent key / null value; warn once per Step 1 settings-load pass on malformed JSON, non-string, or empty string. The resolved language is available to Step 11.5. Parse self_retrospective.feedback from config (optional, string). Warn and ignore if not a string or if empty string "". When feedback matches the owner/repo pattern (^[\w.-]+/[\w.-]+$), additionally run gh auth status as an early warning only — if auth fails, warn but do not block the run
Determine execution sub-mode: Resume if --resume <state-file> was provided, otherwise Normal. Step 1.5 branches on this
Register all workflow phases with the Task tools, including review iterations — issue one TaskCreate per phase below (each returns an auto-numbered taskId). Do NOT skip any phase:
- Step 1.5: Task Decomposition (Normal sub-mode only, AND only when task_decomposition is true — omit this entry entirely in Resume sub-mode or when task_decomposition is false, since in either case the step has nothing to do at registration time)
- Step 2: Create Plan
- Step 3: Plan Review
- Step 3-1 through Step 3-N_plan: Plan Review - iteration 1 through N_plan (generate N_plan items based on resolved N_plan)
- Step 4: Finalize Plan
- Step 5: Implement
- Step 6: Tidy
- Step 7: Check / Test [check: {check_commands} | test: {test_commands}]
- Step 7.5: Rules Compliance Review
- Step 8: Code Review
- Step 8-1 through Step 8-N_code: Code Review - iteration 1 through N_code (generate N_code items based on resolved N_code)
- Step 9: Completion Hooks (only if hooks.on_complete is configured)
- Step 10: Interactive Commits (only if interactive_commits is true; single row — per-commit iteration is handled inline within Step 10 because the commit count is not known until the proposal phase)
- Step 11: Update Rules
- Step 11.5: Self-Retrospective (only if self_retrospective.feedback is set and parses as a valid destination — see Step 11.5 for detection rules; if unset/invalid, omit this entry) Tool availability (Task tools vs TodoWrite): these steps name the Task tools (TaskCreate / TaskUpdate / TaskList), the default since Claude Code v2.1.142. Where the Task tools are unavailable (e.g. the VSCode extension, or Claude Code before v2.1.142), use the equivalent TodoWrite operations instead — the status values (pending / in_progress / completed) and the register-all-upfront semantics are identical, and a TaskList-by-subject status read becomes a read of the TodoWrite list. allowed-tools grants both, so use whichever the environment exposes. Registration mechanics (Task tools): issue every TaskCreate in a single upfront burst (one tool-call batch) so all phases are registered before Step 2 begins. Two conditional cases: (i) conditionally-omitted phases (the list items above carrying a condition) are omitted by not issuing their TaskCreate; (ii) N-reduced excess iteration tasks (Step 3-x beyond resolved N_plan / Step 8-x beyond resolved N_code) are still TaskCreated here at the resolved ceiling (Step 3 ceiling = N_plan, Step 8 ceiling = N_code), then marked completed via TaskUpdate by Step 2's Adjust N by difficulty. Mark each task in_progress (via TaskUpdate {taskId, status}) when starting and completed when done. Task-handle resolution convention: every later "mark Step N as in_progress / completed" instruction in this skill is shorthand for resolve that Step's task — by its registration-time captured taskId, or by subject via TaskList — then TaskUpdate {taskId, status}; the per-step lines name tasks by their human-readable subject and do not restate this resolution path. Registering all phases upfront gives the user visibility into overall progress and prevents steps from being accidentally dropped. Phase-boundary self-audit: at every top-level Step transition (not the iteration sub-rows Step 3-i / Step 8-i, which are governed by the Return-point no-stall reminders below), before issuing the first tool call that advances into a new Step's procedure, name the Step number you are entering, resolve the prior Step's task by subject via TaskList, and verify it is completed — if it is still pending or in_progress, return to the unfinished Step first instead of advancing. This guards against silent phase-skipping (e.g. jumping from Step 5 Implement to Step 7 Check / Test without running Step 6 Tidy, only to discover the gap during a later phase) that the task registration alone cannot prevent. Implementation sub-tasks in Step 5 are additions, not replacements. Note: Unless -i / --iterations was explicitly specified, Step 2 may reduce N_plan / N_code based on task difficulty.
Context-compaction recovery: if the session context was compacted (prior turns summarized) before reaching this step in the current turn, re-read the configuration files from disk rather than relying on the summary — verify each step's skip conditions (e.g. whether self_retrospective.feedback is set, whether hooks.on_complete is configured, whether interactive_commits is true, whether compact_rules is true) from the actual merged config, not from compacted context.

Step 1.5: Task Decomposition

State-file semantics are critical (a malformed or mis-routed file silently corrupts subtask boundaries), so the full procedure lives in a dedicated reference. Dispatch:

Resume sub-mode (--resume <state-file> was provided): read references/task-decomposition.md and follow section A. Resume sub-mode from top to bottom.
Normal sub-mode + task_decomposition: true (the default): read references/task-decomposition.md and follow section B. Normal sub-mode.
Normal sub-mode + task_decomposition: false: no decomposition work. Set the "effective task" to the original request and proceed to Step 2 without creating a state file. Step 1.5 is not registered as a task in this case (see Step 1), so there is nothing to mark completed. You do not need to read the reference file.

EnterPlanMode is reserved for Step 2 — any decomposition proposal in Step 1.5 is a plain yes/no dialogue, not a plan.

After section A or B completes, the "effective task" is set for Step 2 onward: the selected subtask when decomposed, otherwise the original request.

Step 2: Create Plan

Record the current commit as base-commit (git rev-parse HEAD) for later diff comparison. Initialize the difficulty-skip ledger here: set difficulty_skipped_steps = [] (a cross-step list of human-readable records — <step> skipped (<tier> tier) — that § Completion's difficulty-skip reminder renders). This init lives at Step 2 entry, outside the -i-gated Adjust N sub-step below, so the variable is well-defined on every path: when Adjust N is skipped (because -i / --iterations was given) or no tier qualifies for a skip, the ledger simply stays empty and § Completion omits the reminder. Same purpose as the compaction_applied_count State-variable contract (well-defined on the skipped path) but a different technique: compaction initializes inside its conditionally-skipped sub-step and relies on a prose contract; this ledger is physically hoisted out of the skippable sub-step instead — do not relocate it into Adjust N expecting prose to cover the -i path, since the init statement would then not run under -i. Also initialize the subagent_model cross-step variable here (same hoist rationale): set subagent_model = inherit (no model override). The built-in tier → model map (§ Configuration) is consulted only after a tier is assessed in Adjust N; the pre-assessment value — and the value on the -i path, where Adjust N is skipped and no tier is assessed — is always inherit, so every downstream Agent dispatch / Model: propagation omits the model (current behavior). Read sites: Step 7's two background launches, Step 7.5's sequential rules-review, Step 11.5, and the Step 3 / Step 8 inline reviewer (Claude-family only).
EnterPlanMode
Analyze the task and codebase, create implementation plan. Apply custom_instructions to shape plan priorities and structure. Follow the structure defined in references/plan-format.md — Overview / Decisions / Design / Test plan required; Risks / Unknowns optional. When the work is sequential, Design defaults to an ordered, numbered list of implementation steps (see references/plan-format.md § Template, the source of truth). Section-level content rules live in the reference file; do not re-derive them here.
- If a state file exists (this run is executing one subtask of a decomposed parent): the "effective task" = the current in_progress subtask. Frame the plan around just this subtask while keeping the full parent task and other subtasks as background context so the plan stays consistent with the overall direction. Do not plan work belonging to other subtasks. See references/plan-format.md § Subtask / Resume handling for how Decisions is scoped in this case
- TDD-conflict resolution: if custom_instructions includes a TDD-style requirement (e.g. "Always use TDD", "write tests before implementation") AND the current task is adding tests for existing behavior (characterization tests, coverage tests, or relocating existing tests — keywords: "add tests for", "characterize behavior", "test coverage", "move tests", "固定する", "追加する") rather than driving new implementation, declare explicitly in Plan Overview or Risks that this subtask is TDD-loop-external: tests describe and fix already-implemented behavior, not specification of new behavior. This resolves the apparent conflict: the TDD guideline governs feature-implementation subtasks; characterization and coverage subtasks are outside the TDD loop by design.
- Version/identifier string replacement tasks: if the core operation is replacing a specific version string, identifier, or constant across the project (e.g. version bump, rename, migration), grep the entire repository for the old value before drafting the plan. Include the complete list of affected files in the Design section — missing even one location is the primary regression source for this task class
Simplicity self-audit: Before proceeding to Step 3, read references/simplicity-self-audit.md and audit the plan against its checklist.
Plan self-check: Run the checklist in references/plan-format.md § Step 2 self-check against the plan. This is the author's first-pass judgment on Decisions content; fix any failures before Step 3.
No code changes in this phase
Adjust N by difficulty (skip if -i / --iterations was explicitly specified): A typo fix doesn't need 3 rounds of review. Based on the plan just created, assess task difficulty and reduce the iteration counts to avoid unnecessary iterations — the configured value is a ceiling, not a target. The same difficulty cap is applied independently to N_plan and N_code (the two values that may differ only when review_iterations is a map; otherwise they are already equal):
- Trivial (a genuinely self-evident change with a single unambiguous solution — a typo fix, a one-line edit, a config value change): N_plan = N_code = 0 — Step 3 (Plan Review) and Step 8 (Code Review) are skipped entirely. Difficulty-skip matrix (Trivial): additionally skip Step 6 Tidy and Step 7.5 Rules Compliance Review — at this tier the cleanup pass and the rules-compliance walk are low-yield, and the Step 4 plan-approval gate plus Step 7 check_commands / test_commands remain the safety net. Conservative tie-break: classify as Trivial only when the solution is truly unique and obvious; if the change spans more than a trivial edit, the solution is not uniquely determined, or there is any doubt at all, fall to Simple or above so internal review is retained. The same external-library major-bump exception described under Simple applies here too (such a change is never Trivial)
- Simple (typo fix, config tweak, straightforward bug fix with obvious solution): N_plan = N_code = 1 — unless the change touches an external library's config file or type-level API AND that library had a recent major-version bump (primary check: git diff <base-commit> of the package manifest; if absent in this run, judge from other context since the bump may predate this run); then classify as at least Moderate. Similar qualitative risks (external config-DSL rewrites, etc.) follow the same rule. Purely cosmetic edits (comments, whitespace, auto-formatting) do not trigger the exception — use judgment. Difficulty-skip matrix (Simple): additionally skip Step 6 Tidy only (a pure quality-cleanup pass, correctness-neutral); Step 7.5 Rules Compliance Review still runs, since rule violations do not correlate with change size
- Moderate (multi-file within one module, feature following existing patterns): N_plan = min(2, N_plan), N_code = min(2, N_code) — no step is skipped (difficulty-skip matrix applies to Trivial / Simple only)
- Complex (cross-module, new patterns, API changes, significant refactoring): keep N_plan and N_code — no step is skipped
Step 9 (Completion Hooks) is never skipped by the difficulty-skip matrix at any tier — hooks.on_complete is a project-configured open list whose callee set varies per project, so difficulty-gating it would make behavior project-dependent; the matrix covers only the whole-step-skippable Step 6 / Step 7.5.

File count is a hint, not the sole criterion. If adjusted, mark excess task iteration items (Step 3-x beyond N_plan, Step 8-x beyond N_code) completed via TaskUpdate. When Trivial reduces both to 0, mark every Step 3-x / Step 8-x iteration item AND the top-level Step 3: Plan Review / Step 8: Code Review rows as completed — both steps are skipped entirely (their entry-point guards in Step 3 / Step 8 recognize this pre-completed state and pass straight through; only Trivial produces 0, and it zeroes both counts together, so the "Trivial → both steps skipped" coupling holds). Difficulty-skip matrix marking (Step 6 / Step 7.5): apply the same pre-completed-mark + entry-point-guard mechanism to the whole-step-skippable quality steps. Keyed on the assessed tier alone (no config flag): for Trivial, mark the top-level Step 6: Tidy AND Step 7.5: Rules Compliance Review rows completed; for Simple, mark only Step 6: Tidy completed; for Moderate / Complex, mark neither. For each row marked completed here, append one record to difficulty_skipped_steps (e.g. Step 6 Tidy skipped (Trivial tier) / Step 7.5 Rules Compliance Review skipped (Trivial tier)) so § Completion's difficulty-skip reminder can render it (the skip is never silent). Step 9 (Completion Hooks) is never marked here (see the Step 9 note above). Resolve subagent_model here (after the tier is assessed, in the same Adjust N pass): set subagent_model = the merged-config subagent_model map entry for the lowercased assessed tier name (trivial / simple / moderate / complex) when that key is present and valid, else the built-in default for that tier (sonnet for Trivial / Simple, inherit for Moderate / Complex), else inherit. A resolved value of inherit means downstream dispatches omit the model (current behavior). This resolution is skipped under -i / --iterations (Adjust N does not run), leaving subagent_model at its sub-step 1 inherit init. Log the assessed difficulty and effective N_plan / N_code in the resolved language (see §Configuration; default ja). The Step 11.5 task row is not affected by the difficulty assessment — it stays pending regardless, since the self-retrospective is gated only on self_retrospective.feedback.
Do not present the plan to the user or ask for approval/confirmation — presenting an unreviewed plan wastes user time and risks approval of a suboptimal approach. This prohibition extends to confirmation-seeking transition sentences such as "if this design looks good, I'll proceed to Step 3 (Plan Review)", "shall I move on to Plan Review?", or any equivalent ask-for-go-ahead phrasing — these read as natural conversation but constitute the same approval-gate that wastes user attention on an unreviewed plan. The moment Step 2 ends, advance directly to Step 3 without emitting any user-facing message about the plan or the transition. The user will see the plan in Step 4 (internally reviewed in Step 3, unless the task was assessed Trivial — N_plan=0 — in which case Step 3 is skipped and the plan reaches Step 4 unreviewed).

Step 3: Plan Review

(i) The Step 3 reviewer skill is always invoked.
(ii) User-provided analysis (long task descriptions that themselves argue for the approach, embedded justification in handoff docs, etc.) is fed into the reviewer skill's dispatch payload as additional context so the reviewer can build on it rather than re-derive it.
(iii) An explicit user override in the task prompt ("you may skip Step 3 for this run", or equivalent) is the only analysis-driven path to skipping (distinct from the difficulty exception above). When this fires, record a warning in the Completion summary so the user has a visible signal that the bias-free review pass was bypassed.

Mark Step 3: Plan Review as in_progress. Process each pending iteration item (Step 3-1 through 3-N_plan) in order:

Mark the iteration item as in_progress. Call the reviewer skill resolved in Step 1 (e.g. Skill(ask-peer)): Review the plan. subagent_model propagation (inline reviewer): when the resolved reviewer is Claude-family (per Step 1's reviewer-family classification) and subagent_model is a model id, propagate it — pass Model: <subagent_model> to ask-peer, or include --model <subagent_model> in the dispatch instruction to ask-claude. External-CLI reviewers and an inherit resolution carry no model (current behavior). Step 3 is always inline, so there is no background-launch path to double-apply against here. Pre-dispatch dispatch-boundary reminder: Issue the Skill(<reviewer>) call in the same turn as any accompanying status prose — never produce a standalone status turn before the Skill() call, as that creates a stall point. Reading the reviewer's SKILL.md is preparation, not dispatch; the Skill() call is the dispatch.
- Instruct reviewer to read all files under .claude/rules/ for project conventions, references/plan-format.md for the Decisions (a)+(b) criterion and § Step 3 (f) content-quality rubric, references/simplicity-self-audit.md for the Step 2 audit checklist that category (a) below verifies, and references/review-categories.md § Plan review categories for the full per-category rubric of the six categories below (resolve these references/*.md links to concrete readable paths when composing the request — the reviewer lacks the skill-directory context)
- Request feedback organized into six categories (labels only — full rubric per the read-instruction above): a. Scope & feasibility b. Approach & alternatives c. Completeness d. Incrementality e. External library primary-source verification f. Presentation & attention allocation (content quality)
- If custom_instructions is configured, include the instructions text in the review request and have the reviewer verify alignment and report conflicts
- If a state file is active (executing a subtask from a decomposition), include the current subtask's scope in the reviewer request: list the subtask's title and description, then list what the other subtasks cover (to define out-of-scope). Instruct the reviewer that missing functionality belonging to other subtasks is not an actionable finding for this review — only findings scoped to the current subtask qualify. Omit this when no state file is active (single-task execution has no defined out-of-scope boundary).
- Reviewer should only report actionable findings. If none, explicitly state "No actionable findings"
Judge the reviewer's response semantically: if the reviewer reports nothing actionable — no actionable findings, no improvements to apply, no review points raised, or any other "nothing to report" outcome regardless of exact wording — mark this and remaining iteration items as completed (skip). Mark Step 3: Plan Review as completed and proceed to Step 4 automatically. Per the No-Stall Principle, do not wait for user input and do not rely on exact-phrase matching; trust semantic judgment since the reviewer skill's phrasing varies (especially Skill(ask-peer) and other free-form-prose reviewers whose verdicts are natural-language Markdown rather than a fixed token).
Otherwise: autonomously apply improvements or reject inapplicable points with reason — do not ask the user for judgment on individual review findings. Mark this iteration item as completed.
- Approach-reconsideration self-audit on high findings count (iter 1 only): at the iter 1 → iter 2 boundary, before applying findings individually for iter 1's output, count the reviewer's findings by severity. If either threshold trips — Critical ≥ 3 OR (Critical + Major) ≥ 10 — additionally scan the findings list for any item that surfaces an approach-level alternative (typical phrasings: "X の方が筋がよい", "existing X と統合できる", "switch to <sibling>", "use <existing-mechanism> instead", or any equivalent "the plan should adopt a different overall approach" framing). If at least one such approach-alternative finding is present, do not proceed with mechanical apply-and-iterate — instead, treat the findings cluster as a signal that the plan's Approach selection itself is the root cause. Rewrite the plan with the approach-alternative finding's direction promoted into the Decisions section (Recommendation / Alternative swap or insertion-direction new Decision item, per the rewrite class), add a new review iteration item Step 3-(N_plan+1), and return to Step 3 to re-review the rewritten plan. The remaining iter-1 findings are carried forward as context for the next reviewer. When the threshold trips but no approach-alternative finding is present (mechanical-fix-level findings only), proceed with the usual per-finding apply-and-iterate path. This audit applies only at the iter 1 → iter 2 boundary; later iterations have already exercised one or more apply cycles and approach-reconsideration after that point is the Step 4 user-gate's responsibility (general principle: high finding density paired with an approach-level alternative finding is a structural signal, not a quality signal — keep applying mechanical fixes and the plan still fails at Step 4 user gate).
- Prose-integrity self-check (post-fix): after applying a fix that edits plan prose adjacent to its target line (Decisions / Design / Test plan / Risks / Unknowns paragraphs), re-read the surrounding paragraph as a single unit before continuing — verify no sentence is cut mid-word, no logical connective is broken (the connectives however / therefore / because / but / etc. still anchor real clauses), and the paragraph's overall logic still holds. Mechanical fix patches see only the line-level diff and routinely leave the surrounding prose semantically broken in ways the next iteration's reviewer flags as a Major finding, costing an extra iter.
- If the plan was modified: continue to the next pending iteration item (back to step 1). Plan modifications often introduce new gaps or ripple effects that the previous reviewer had no chance to see — the re-review round-trip is cheap compared to shipping a plan that looks fine to the author but has an unvetted section. Don't short-circuit even when the fixes feel airtight
- If all points were rejected (no modifications): mark remaining iteration items as completed (skip — there is nothing new for the next reviewer to look at) Continue to the next pending iteration item with:
- the updated plan
- a summary of changes made and rejections with reasons
- an iteration-scope instruction: from iteration 2 onward, the reviewer's primary verification scope is the plan changes applied since the previous iteration (conveyed by the summary of changes above — no separate diff artifact is provided) plus landing confirmation of the previous iteration's findings — the full-coverage pass (re-verifying every plan section, decision, and cited reference from scratch) belongs to iteration 1 only. The reviewer must still escalate back to full re-verification when content outside that primary scope raises a new concern, so coverage is reordered, not reduced
- the same six-category structure (a–f), .claude/rules/ reference, and "No actionable findings" requirement
Return-point no-stall reminder: At each iteration boundary (regardless of reviewer outcome — findings reported, "No actionable findings", any non-error result), the next action — the next iteration's reviewer dispatch when more iteration items remain, or the Step 4 transition when this was the last iteration or "No actionable findings" was returned — must be issued in the next tool call. Do not insert an interstitial summary or acknowledgment turn between iterations; the abstract enumeration in § No-Stall Principle is intentionally duplicated here so the rule fires at the decision moment.
If all N_plan iteration items are completed and actionable feedback still remains, carry the unresolved points forward to Step 4.

Mark Step 3: Plan Review as completed.

Step 4: Finalize Plan (USER APPROVAL GATE)

Before presenting, verify via TaskList that Step 3: Plan Review and every Step 3-x iteration item are completed — ExitPlanMode is the effective exit from Plan Mode, so issuing it while any Step 3 item is still pending or in_progress skips the internal review entirely. If any Step 3 item is not completed, emit a one-line inline note to the user naming all incomplete items (e.g., Plan review found incomplete (Step 3-2 still pending) — running the remaining review pass before presenting the plan., substituting the actual incomplete iteration item label(s)) then return to Step 3 to process it (do not flip the row to completed without doing the review work). Exception: when Step 2's difficulty assessment set N_plan=0 (a Trivial task) and therefore pre-marked all Step 3 rows completed, that completed state is the intended skip — not an unrun-review bug — so proceed to ExitPlanMode normally. 1.5. Prose-language self-audit: Before calling ExitPlanMode, verify that explanation prose in the plan body (Overview narrative, Decisions rationale, Design descriptions, Test plan steps, Risks/Unknowns paragraphs) is written in the resolved language. Schema tokens (Overview / Decisions / Design / Test plan / Risks / Unknowns), step labels, enum values, identifiers, and quoted code strings stay in their original form regardless of language. Audit both directions: (a) if any explanation sentences are in a different language than the resolved language, and (b) if concept words outside the verbatim-preserve scope — ordinary nouns, adjectives, conjunctions, verb phrases — are over-preserved in the source language rather than rendered in the resolved language (per references/plan-format.md § Localization granularity's Negative-direction rule). Revise any failures now per references/plan-format.md § Localization granularity before proceeding to sub-step 2. Re-entry coverage: this audit must re-run on every entry into Step 4 — both the initial entry and any re-entry triggered by sub-step 1's "return to Step 3" path or sub-step 3's material-change path — since revisions during Step 3 iteration may introduce prose in a language different from the resolved language.
This is the first time the user sees the plan. Write the full plan body to the Plan Mode plan file with the Write tool (the ExitPlanMode approval modal renders that file's contents), and present a condensed view in chat per the two-tier protocol in references/plan-format.md § Step 4 presentation order — internally reviewed in Step 3 for N_plan ≥ 1 (include any unresolved review points from Step 3); for a Trivial task (N_plan=0) Step 3 was skipped, so present the plan as unreviewed and rely on this user-approval gate as the sole review. Render the chat view in this order: a. ## Plan header as a visual boundary. b. The > Review guide line (per references/plan-format.md § Review guide line) followed by the condensed plan body, following references/plan-format.md § Localization granularity in the resolved language (see §Configuration; default ja): Overview in full (including Highlights when present), Decisions in full, and Design as a file-list only (files to change + one line of what-changes each). Test plan and Risks / Unknowns are not rendered in chat — they live in the full plan file and surface via the preamble's verification approach / known risks slots. Section headings render at ### level (one below the ## Plan container); sub-sections (Title, Goal, Scope, Decision N, Implementation, etc.) at ####. c. Horizontal rule (---) separator. d. Summary preamble per references/plan-format.md § User-gate summary preamble. e. Guidance line per references/plan-format.md § Step 4 guidance lines (verbatim, no paraphrasing, no concatenation). f. Call ExitPlanMode in the same turn, immediately after the guidance line. ExitPlanMode triggers the approval modal (which renders the full plan file) — if it is not called, the user sees the plan text but has no way to approve. Delaying ExitPlanMode to a subsequent turn is the primary cause of Step 4 appearing stalled.

Section headings (Overview / Decisions / Design / Test plan / Risks / Unknowns) and the Step 4 guidance line stay English.
Collaborate with the user to refine the plan as needed (normal Plan Mode interaction). Categorize each user response into one of the four buckets below via semantic judgment (per § No-Stall Principle's "do not rely on exact-phrase matching" rule — example phrasings are illustrative, not literal discriminators):
- accept: explicit affirmative — "OK" / "approve" / "looks good" / "進めて" / any semantic equivalent. Begin implementation.
- swap-decisions (Decisions Recommendation/Alternative swap on one or more specific items — "Decision 1 を Alternative に", "swap the recommendation on the language flag", "use the alternative for Decision N", "Decision N と M は Alternative で残りはそのまま"): re-render the plan with the specified Recommendation / Alternative pairs swapped on the named Decisions items, leave other items unchanged, run the read-back sub-step below, then re-present the plan (re-enter the gate). When the user names multiple Decisions in one message, list every affected item on the read-back line so partial-coverage misses cannot slip through.
- rewrite-approach (Approach / Design / Scope-level material change — "switch from independent skill to extending sibling mode", "split this into two subtasks", "scope down to only the canonical site", or any change that does not fit a clean Decisions swap): add a new review iteration item (Step 3-(N_plan+1)), run the read-back sub-step below, return to Step 3 to re-review the modified plan, then re-enter Step 4 from sub-step 1 (so sub-step 1's task completion check on the new Step 3-(N_plan+1) item and sub-step 1.5's prose-language re-entry-coverage audit both run before re-presenting at sub-step 2). Trivial (N_plan=0) re-activation: if the task had been assessed Trivial (N_plan=N_code=0) so Step 3 was skipped, an Approach-level material change means the task is no longer trivially self-evident — re-run Step 2's Adjust N by difficulty against the rewritten plan to re-derive the difficulty assessment itself (it will no longer be Trivial) and the effective N_plan / N_code (re-running the independent-cap logic on both values, not a single value). Updating the difficulty assessment — not just the counts — is required because every downstream gate keys on the assessment, not on a bare count: Step 4's unreviewed-plan presentation and the references/plan-format.md Trivial conditional both read "the task was assessed Trivial"; leaving the stale Trivial label in place would keep announcing "sole review". Then re-mark the task rows for the re-derived difficulty: register Step 3-1 … Step 3-N_plan (and the Step 8-1 … Step 8-N_code rows) as fresh pending, clear the previously-skip-completed top-level Step 3: Plan Review / Step 8: Code Review rows back to pending. The difficulty-skip matrix is re-derived in the same pass: reset difficulty_skipped_steps = [] and re-run Adjust N, which recomputes which of Step 6: Tidy / Step 7.5: Rules Compliance Review the new tier skips and re-populates the ledger from scratch (no find-and-remove of individual records). Clear any previously-skip-completed Step 6 / Step 7.5 row back to pending when the higher tier no longer skips it — the same re-pending treatment applied to the Step 3 / Step 8 rows. Without this re-derivation the Step 3 entry-point guard would skip the new review item (it skips whenever N_plan=0), Step 4's completion check would loop on the unprocessed item, and a Step 6 / Step 7.5 row left stale-completed would silently skip a quality step the higher tier now requires.
- withdraw: explicit halt — "stop" / "cancel" / "abort" / "やめる" / "取り下げ". Exit the workflow with no further steps; do not proceed to implementation.
Read-back sub-step (mandatory before applying any swap-decisions / rewrite-approach interpretation): emit a one-line summary of the interpreted change in the resolved language (e.g. Decision 1 を Alternative に切り替え、Decisions 2 と 3 は Recommendation のまま保持します — このまま反映してよろしいですか？) and wait for the user to confirm before re-rendering. The read-back is the gate-of-origin's own resolution branch; do not nest a separate ExitPlanMode call inside it. If the user's confirmation response itself reads as another swap-decisions / rewrite-approach / withdraw instruction, treat the read-back as un-confirmed and re-classify under the four buckets above. The read-back catches multi-Decisions instructions with partial coverage and Approach-level instructions that masquerade as Decisions swaps — both are common failure modes that silently lose user-specified scope when interpreted without read-back.

NOT approval (interrogative or non-committal — "look good?" / "どう？" / "これでいい？"): treat as ambiguous — ask the user to confirm whether they intended an affirmative or to surface a change request, then re-classify the response under the four buckets above. Do not silently advance.

After the user accepts (accept bucket), begin implementation.

Step 5: Implement

Plan entry self-check — user-side manual action extraction: before issuing the first implementation tool call, scan the approved Plan body (Overview / Decisions / Design / Test plan / Risks / Unknowns) for embedded user-side manual actions — environment-prerequisite probes the user must run themselves, configuration values the user must add to a config file outside the agent's write surface, external authentications / API keys / hook installations / OS-level changes the user must perform manually, manual verification steps the user must execute against external systems (for general software development this includes "run <command> and confirm output", "edit ~/<config> to set X=Y", "log in to <external dashboard> and authorize Z"; for skill development this includes ~/.claude/settings.json edits, hook installation, external CLI installation, or workspace-level config the user must place outside the repo). When at least one such manual action is present, emit a short independent block at the top of Step 5 — separately from any other implementation prose — listing each manual action verbatim with the Plan section it came from, before proceeding with the first implementation tool call. The block ensures the user sees the manual action items distinctly rather than discovering them buried inside long-running plan execution. When no manual actions are present (purely agent-executable plan), skip this block. The block is informational — Step 5 continues without waiting for user input on the manual actions themselves (the user-side observation gate at the probe → real-implementation boundary handles cases where the workflow needs to wait).
Follow the plan, track progress with the Task tools (TaskUpdate). When the Design is an ordered, numbered list of implementation steps (per references/plan-format.md § Template), you MAY register each step as an implementation sub-task and execute them in order, marking each completed as it lands — recommended for long ordered plans, optional for short ones. This is consistent with Step 1's "Implementation sub-tasks in Step 5 are additions, not replacements" rule and does not change the Phase-boundary self-audit (which governs only top-level Step transitions). Apply custom_instructions throughout implementation
Respect prior in-session edits: content the user explicitly removed earlier in this session (comments, guards, logs) must not reappear. Treat deletion as authoritative, not as a gap to fill. This discipline applies when applying plan steps, when applying Step 6 tidy output, and when applying Step 8 review fixes — the reviewer/tidy subagents only see the diff and cannot enforce this themselves
Late-stage scaffolding self-audit: when implementation introduces a structural element that was not present in the Step 2 plan — a new sub-step, an additional enum value, a new branch arm, an additional call site that invokes the same callee at a new location, a new recovery / fall-through path (for skill development this includes a new SKILL.md sub-step, an additional status enum in a return contract, a new error class, a new disposition mapping row) — re-apply the same Step 2 § Simplicity self-audit rigor to the newly introduced element before moving on. Sub-checks (i)–(iv) fire only when a new structural element is introduced; sub-check (v) fires unconditionally for every .md file edit in the diff: (i) sibling symmetry — when the new element parallels existing sibling elements, verify same fields / same disposition values / same error-class coverage; (ii) error-path symmetry — for any success path introduced, trace its corresponding failure path explicitly (counter increment vs. non-increment, success-only vs. failure-included); (iii) boundary-value coverage — for any predicate, threshold, or count introduced, trace the boundary cases (empty input, all-same-classification, mixed-classification) and verify the predicate truth value matches design intent; (iv) reference-site sweep — if the new element is referenced from prose elsewhere in the file, verify those references use stable phrase anchors (not raw sub-step numbers / branch letters); (v) Markdown block-element structural integrity — for each edited .md file (unconditional: applies regardless of whether a new structural element was introduced), scan newly added or changed content for adjacent block elements (list items, paragraphs, code fences, headings, blockquotes) that lack a separating blank line; when a gap is found, insert the missing blank line immediately (in the same Edit call) before moving on; missing blank lines cause the following element to be parsed as a lazy continuation of the preceding block and rendered merged — the gap typically surfaces in skill-review as a mechanical edit (for skill development this includes SKILL.md, references/*.md, and README.md edits). The reviewer / tidy subagents see only the diff and cannot enforce this self-audit, so it must run in the main thread at Step 5 — late-stage scaffolding correctness gaps surfacing first at Step 8 Code Review iter 1 indicate this audit was skipped.
Final-pass literal-value full-repo grep: at Step 5 completion (after all planned edits are applied and before advancing to Step 6 Tidy), for each literal value the plan replaced or introduced — numeric constants (threshold values, version numbers, magic numbers), token strings (status enum values, config keys, identifiers), file path fragments, or any other literal whose semantics are tied to a specific value — grep the entire repository (not just the plan's enumerated sweep targets) for the old value and confirm zero hits. The Plan's Test plan typically enumerates known sweep sites, but narrative examples embedded in prose (illustrative numbers in descriptive text, story-style usage examples in SKILL.md / references/*.md / README files) routinely sit outside the enumerated list and silently retain the old value through mechanical search-and-replace passes. Multi-stage structure: (i) enumerated sites — the explicit list from the Plan's Test plan, verified one by one; (ii) final-pass full-repo grep — git ls-files | xargs grep -l <old-value> (or equivalent) for any residual hits outside the enumerated list, with each hit reviewed in context and either updated (if it carries the old value's semantics) or marked as out-of-scope (e.g. a different concept that coincidentally uses the same literal); (iii) alias and derived-form sweep — for rename and migration tasks, additionally grep for mechanically-derivable aliases and derived forms of the old value (abbreviated forms, synonyms, or alternate identifiers the codebase uses interchangeably to refer to the same concept); when the derived-form set can be enumerated upfront, list and grep each form before declaring the sweep complete; grepping only the exact old value misses same-concept usages expressed under an alternate spelling (for skill development this includes aliased import names, short-form identifiers referenced in SKILL.md narrative examples, and config-key abbreviations). The reviewer / tidy subagents see only the diff and cannot enforce the full-repo sweep, so it runs as a Step 5 completion gate (for skill development this includes literal numeric thresholds cited in references/*.md narrative examples, version strings in README usage snippets, and example values in compaction / extract-rules-style descriptive prose). Authoritative-tool cross-check for load-bearing enumeration claims: when the completeness of the enumeration is mechanically verifiable by a downstream authoritative tool — a compiler or type checker reporting all affected call sites for a changed type or interface, a language server returning all references for a renamed symbol (for skill development, when no compiler or language server is available, the authoritative check is a structured manifest audit: e.g. jq against marketplace.json to enumerate all skills array entries affected by a renamed skill, or a targeted Grep scoped to SKILL.md / plugin.json / marketplace.json for all hook-firing paths or all subagent dispatch routes that a configuration change affects — when no such structured verification is available, the grep-only pass is sufficient) — cross-check the grep results against that tool's output; if the tool reports additional sites that grep missed, treat those as unresolved hits and apply the same two-option disposition defined above for the full-repo grep pass (update if it carries the old value's semantics; mark out-of-scope if it coincidentally uses the same literal for a different concept) — and complete this resolution before presenting the enumeration as confirmed. Search filter prefix/anchor errors can silently drop matches with no error signal; authoritative-tool verification catches these missed cases. When both grep and an authoritative tool are available, treat the tool output as the oracle. Non-literal-replacement tasks skip this audit.
Pre-write path scope check (Write / Edit / new-file path safety): before every Write / Edit / similar file-creation tool call whose file_path argument does not match a path that already exists in git ls-files output (typically: new files generated by Step 5 / Plan rewrite / staging document creation / new test fixtures / new CHANGELOG entries — file paths that the tool will create rather than modify-in-place), run a two-stage path verification before issuing the tool call: (i) repo-root containment — verify the absolute resolved path sits under git rev-parse --show-toplevel (no ../ escape from the working directory, no absolute path leading outside the repo); (ii) prefix sanity — verify the path's leading directory matches an expected location for its content class (.claude/plans/ for plan documents, skills/<name>/ for skill content, src/ or tests/ or equivalent for code, .triage/ or tmp/ for staging, etc.). If either check fails, abort the tool call with a fail-loud diagnostic naming the resolved path and the expected prefix set, rather than silently creating the file. The allowed-tools permission grant alone does not prevent parent-directory landing (Write accepts any string file_path), so a procedural pre-check is the only structural defense against typo-induced orphaned files (for general software development this includes accidental migration / config / test-fixture writes landing one directory up; for skill development this includes .claude/plans/<slug>.md typos depositing files at ../<slug>.md, marketplace.json paired-bump operations writing to the wrong manifest, or staging documents landing outside .triage/ / .claude/). If a tool call has already created a file in the wrong location, instruct the user to delete it manually — the workflow's auto-mode classifier cannot reach files outside the project scope, so manual cleanup is the only path.
User-observable artifact protection gate at probe → real-implementation boundary: when the Plan explicitly stages an implementation as probe / intermediate-artifact → real-implementation replacement (e.g. "first emit a debug-instrumented version for user to observe, then replace with the production implementation", "scaffold a placeholder file the user will manually inspect, then overwrite with the final content", "log expected probe output as a verification step, then remove the logging"), do not advance to the real-implementation step until the user has signaled observation completion. The probe-output observation gate is the only user-side wait state permitted inside Step 5 — every other Step 5 sub-step proceeds autonomously per § No-Stall Principle. When the probe is committed to disk and the user has not yet acknowledged observation, hold the workflow at this boundary and emit a one-line wait prompt in the resolved language (e.g. Probe artifact deployed at <path> — please observe its output before the workflow replaces it with the final implementation. Reply when ready to proceed.). Resume the real-implementation step on any non-empty user reply. When no probe → real sequence is in the Plan (typical case — purely incremental implementation), this gate does not fire (for general software development this includes debug-log-instrumented scaffolds replaced by clean production versions, mock-data fixtures replaced by real-data fetches; for skill development this includes verbose-tracing skill versions replaced by streamlined final versions). The gate exists to prevent the probe artifact from being silently overwritten before the user has had a chance to inspect it — a failure mode the No-Stall Principle's autonomy guarantee otherwise creates.

AskUserQuestion option design (applies to the probe gate above and any future user-state-query call in this workflow): when the workflow uses AskUserQuestion (or any equivalent multi-option user-query tool) to query the user about a plan-derived state — probe-execution outcome, manual-verification result, environment-prerequisite check, or any equivalent state confirmation — the options list MUST include a meta-confusion branch alongside the result enumeration. Concretely, do not present only outcome categories (e.g. success / failure / skipped); also include an option phrased as "the procedure / expected outcome is not yet understood (please re-explain)" in the resolved language (e.g. language: ja: 手順 / 期待結果がまだ把握できていない（要再説明）; language: en: the procedure / expected outcome is not yet clear (please re-explain)). The meta-confusion branch absorbs the "I cannot answer the question as posed" state — without it, the user is forced into Other free-text and the workflow consumes an extra clarification turn re-explaining what was already in the Plan. General principle: user-state queries enumerate outcomes AND leave a fallback for the premise-not-conveyed case, never outcomes alone (for general software development this includes deployment-readiness queries, migration-completion confirmations, external-system-state checks; for skill development this includes probe-result queries inside this Step 5 gate, callee-execution-outcome confirmations, manual-config-applied verifications).
Derived-value claim deferral: when deliverable prose embeds a value derived from content that later phases can still change — a size claim about a generated artifact, an item or step count, or any other body-derived figure (for skill development this includes char-count claims about SKILL.md / references/*.md and step-count mentions in CHANGELOG entries or descriptive prose) — do not finalize that value during Step 5. Keep it as a clearly-marked provisional value (e.g. render the figure as <provisional — finalized at Step 10 entry> so the placeholder is grep-able at the application point) and compute + write the final figure exactly once at the last gate where the source content is settled — the plan-deferred bookkeeping application point at Step 10 entry (the deferred-bookkeeping paragraph at the top of references/interactive-commits.md, applied before its § Collect changes step collects the working tree), after Step 6 Tidy, Step 7.5 fixes, Step 8 review fixes, and any Step 9 hooks.on_complete working-tree modifications have all landed. Re-verifying and re-correcting the figure after every downstream phase that touches the body is the anti-pattern this item forbids — each chase is an avoidable rework turn. When interactive_commits: false (Step 10 is omitted and execution proceeds directly from Step 9 to Step 11 — see § Step 10: Interactive Commits), the Step 10 entry gate never occurs: finalize the figure at the same settledness point — immediately after Step 9 completes or is skipped, before proceeding to Step 11 — so the provisional marker never survives into the final tree.

Step 6: Tidy

Implementation often introduces unnecessary complexity that's easier to spot in a dedicated pass after the code works.

Pre-dispatch rename-sweep self-audit: if the Implement diff (since <base-commit> recorded in Step 2) includes a term-rename operation — a search-and-replace across the project that swapped a step name, callee name, config key, identifier, or domain concept for a new one — sweep the changed-path SKILL.md / references/*.md / README prose for synonyms and derived forms of the rename target before dispatching the Step 6 cleanup skill (the resolved simplify or tidy), and fix any residue inline. General principle: mechanical search-and-replace leaves synonym / derived-form residue that the substitution alone cannot catch — gerund forms when a verb is renamed, nominalizations and related-noun forms when an action is renamed, conceptual paraphrases of the original term in surrounding description text when a step or concept is renamed (for skill development this includes renaming a procedural verb leaving its -ing form in description prose, renaming a step leaving the prior step-concept paraphrase in cross-section reference text, or renaming a callee leaving the old concept noun in doc-comment / SKILL.md narrative). Detect at this Step 6 so the Completion-time integrity check (Step 8 reviewer / hooks.on_complete) remains a backstop rather than the primary detection point. Non-rename diffs skip this audit.
Dispatch the cleanup skill (resolved per § Prerequisites' Cleanup skill bullet): review changed code for reuse, quality, and efficiency, then apply cleanup edits.
- Primary — Skill(simplify): invoke Skill(simplify). Its argument interface is unverified (built-in skill, no on-disk SKILL.md), so do not assume any tidy-specific field — pass no scope argument (Base ref / --base-commit; simplify auto-scopes to the changed working-tree code), and when custom_instructions is set, pass it only as a short best-effort natural-language hint (simplify may ignore it; do not name a Custom instructions field it may not expose). Omit it entirely when custom_instructions is unset or empty (per § Step 2's Sub-skill natural-language argument minimalism note).
- Fallback — Skill(tidy): Do not pass Base ref / --base-commit <sha> — tidy's default working-tree mode is the intended scope here (covers tracked + staged + untracked changes per tidy's § Invocation contract); passing Base ref would switch tidy to committed-history mode and silently drop untracked files from the cleanup scope, even though sibling Steps (Step 7's test_commands, Step 7.5's Skill(rules-review)) invoke their callees with --base-commit <sha>. This Base ref asymmetry rationale is scoped to the tidy path only — the simplify path above passes no scope argument regardless. Pass the workflow's custom_instructions config value through tidy's natural-language Custom instructions field (omit the field entirely when custom_instructions is unset or empty — do not render (none) / empty string / fabricated default). General principles: (i) when a caller-skill dispatch field is driven by an optional config key, state the absent-key behavior inline on the dispatch line rather than relying on cross-reference to the config-parse step; (ii) when a caller depends on a callee's default-mode behavior for scope correctness and sibling steps use a different argument convention, name the asymmetry on the dispatch line as load-bearing rather than implicit — the executor cannot rely on a default-by-omission when sibling steps create an extrapolation pull toward the explicit form.
Regardless of the outcome — whether the cleanup skill (simplify, or the tidy fallback) applied fixes, reported no actionable findings, or returned any other non-error result — mark Step 6: Tidy as completed and proceed to Step 7 automatically. Per the No-Stall Principle, do not wait for user input.
If the cleanup skill result is not observable (e.g. context compaction occurred during or immediately after the call): inspect git diff <base-commit>. If the diff contains changes clearly attributable to a cleanup pass, treat Step 6 as completed and proceed to Step 7. Otherwise (no cleanup-attributed changes visible, or ambiguous), re-execute the Step 6 cleanup skill once (the resolved simplify, or tidy if simplify is unavailable) — inspection-and-fix-class skills are idempotent — then proceed to Step 7.

Step 7: Check / Test (max 3 retries)

Run check_commands in order (always run all)
- On failure, fix and retry (do not proceed to test execution)
- Pre-execution scope-narrowing: before running each check command, assess whether it is a repo-wide auto-fix tool — a command that writes to files across the repository regardless of which files are in the task scope (e.g. a project-wide formatter, linter with --fix / --write, or bulk document transformer). If the command is a repo-wide auto-fix tool and the working tree contains files changed outside the task-scope snapshot (unrelated existing changes), narrow the command's scope to the task-scope snapshot files before running (e.g. pass the snapshot file list as explicit path arguments if the tool supports it). If scope narrowing is not feasible given the tool's interface, stop and ask the user for direction before running the command — options: run the command accepting the full-width effect, skip the command, or provide an alternative scoped invocation. The Scope-drift guard below is the second safety net for cases where pre-execution assessment is not feasible.
- Scope-drift guard: before each command, record git diff --name-only <base-commit> as the task-scope snapshot (the file set scoped to this task at the start of Step 7). After the command, re-check — any file newly appearing outside that snapshot was written by the command (auto-fix/write behavior sweeping unrelated drift). If scope drift is detected, classify the out-of-scope changes before acting: if all of the following hold — (i) the out-of-scope diff is whitespace or comment changes only (no code-skeleton changes: no non-blank, non-comment lines added or removed), (ii) the total changed line count across all out-of-scope files is ≤ 5, and (iii) the changes are attributable to the formatter or linter that just ran (the command is a known formatter/linter, e.g. lint:fix, format, prettier, black) — then proceed automatically without a user-direction stop: emit a one-line note (e.g. Scope-drift note: <file>(s) received whitespace-only formatting from <command> — proceeding) and continue to the next command. Otherwise (non-trivial drift): warn the user (list both the in-scope files and the newly-appeared out-of-scope files), do not auto-revert / git checkout / delete the out-of-scope changes (leave the working tree as the command left it for user inspection), leave Step 7: Check / Test as in_progress, and wait for user direction. This is a step-internal stop directive — one of two allowed non-completing exits from the check_commands phase (the other being the pre-execution scope-narrowing infeasibility stop above) — and is consistent with the No-Stall Principle, which permits explicit step-defined stops

Initialize tracking (at every Step 7 entry — pass or not — unconditionally before the availability branch, so the unavailable / skip / re-run paths never read an uninitialized variable): rules_review_launched = false and rules_review_stale = false. Re-initializing on a non-pass entry (the Step 7-only re-run inside Step 7.5's fix flow) is harmless: that entry occurs only after the current pass's Step 7.5 sub-step 1 collect has already consumed the result. Lifecycle: this bullet is the only init site; the only set sites are the If available bullet (rules_review_launched) and the discard bullet below (rules_review_stale); the skip / unavailable paths set neither; Step 7.5 sub-step 1 is the only read site.
Availability detection: inspect the current tool list the same way rules-review SKILL.md § 5 detects Agent availability — do not make a speculative call. The capability gated here is specifically background dispatch (Agent with run_in_background), not a bare foreground Agent. Positive criterion: background dispatch is available when the Agent tool is exposed (top-level or via ToolSearch) AND the session offers an Agent run_in_background parameter or equivalent async-dispatch mechanism — the common case in a standard interactive session, so default to parallel. Treat it as unavailable only when one of these two signals holds (closed list — if neither holds, choose parallel): (a) Agent is absent (this also covers this skill running inside a non-recursing subagent, surfacing as nested Agent being unavailable); (b) Agent is exposed but the session offers no background/detached dispatch capability (e.g. an older Claude Code). This two-item list is the single definition of "unavailable" the If unavailable branches below refer to. Do not treat "unsure" as "unavailable": if a background-dispatch capability is present, choose parallel.
If available (and this Step 7 entry is a pass per the definition above — note that on a Trivial task Step 2 pre-completes Step 7.5 under the difficulty-skip matrix, so no Step 7.5 sub-step 1 collect follows and the entry is not a pass; skip the launch, same orphan-avoidance as the code-review launch's N_code=0 handling): dispatch a background subagent (Agent with run_in_background: true, subagent_type: general-purpose, plus model: <subagent_model> when the Step 2-resolved subagent_model is a model id — omit model when it is inherit) instructed to run Skill(rules-review) --base-commit <sha> — including the cross-layer review handoff ledger as a short context item (per § Step 6's Cross-layer review handoff ledger paragraph; omit when the ledger has no recorded dispositions) — and return the findings report verbatim, applying no edits (the main thread applies fixes in Step 7.5). On a successful dispatch set rules_review_launched = true. Emit a Progress Visibility status line in the same turn. Include in the dispatch payload a note that nested Agent is unavailable in this subagent context — execute Skill(rules-review) directly without probing for sub-subagent availability (the § 5 inline fallback applies automatically; no runtime discovery is needed). Collect the report from the background Agent's completion notification (no extra tool needed).
If unavailable (per the Availability detection criterion above): skip the launch (rules_review_launched stays false) — Step 7.5 invokes Skill(rules-review) sequentially as before (fully backward-compatible).
If test_commands then fail and you fix them (the diff changes): discard both background results — the rules-review result (set rules_review_stale = true; Step 7.5 sub-step 1 then falls back to a fresh sequential Skill(rules-review) dispatch) and the code-review result (set code_review_stale = true; see the next paragraph) — since the prior analyses are now stale. No-op on the no-launch path: setting rules_review_stale = true when rules_review_launched == false has no effect — Step 7.5 sub-step 1's collect branch is gated on rules_review_launched == true, so the unconditional set is safe on every path. Handle disposition: a stale (or never-collected) background rules-review result is simply ignored at the Step 7.5 collect point — no explicit cancellation of the background subagent is owed.

Initialize tracking (at every Step 7 entry — pass or not — unconditionally before the availability branch, so the unavailable / skip / re-run paths never read an uninitialized variable): code_review_launched = false and code_review_stale = false. Re-initializing on a non-pass entry (the Step 7-only re-run inside Step 7.5's fix flow) is harmless: it rewinds the code_review_stale = true that Step 7.5 sub-step 3.a just set, but code_review_launched is also reset to false, and the Step 8 collect branch requires code_review_launched == true — both states route Step 8 sub-step 1 to a fresh sequential dispatch, so the routing is equivalent. Lifecycle: this bullet is the only init site; the only set sites are the If available bullet (code_review_launched) and the staleness set-sites named in the Staleness bullet below (code_review_stale); the skip / unavailable paths set neither; Step 8 sub-step 1 is the only read site. Because every continuation path to a next Step 8 iteration item passes through a Step 7 re-entry (and therefore through this re-initialization), each pass's launch is collected at most once (the Step 8 sub-step 1 collect bullet names which iteration collects).
Availability detection: use the rules-review launch's detection above verbatim — its positive criterion (default to parallel in the common interactive case), its two-item closed list defining "unavailable", and its "do not treat 'unsure' as 'unavailable'" directive all apply here unchanged. The gated capability is background dispatch (Agent with run_in_background), not a bare foreground Agent.
If available (and this Step 7 entry is a pass per the shared definition): launch only when a pending Step 8 iteration item remains to collect the result after this pass — on the initial pass this means N_code ≥ 1 (iteration 1 collects; when N_code = 0 / Trivial, Step 8 is skipped entirely and no iteration item exists); on a re-run pass it holds only when the fix-applying iteration k satisfies k < N_code (a re-run triggered from the final iteration k = N_code leaves no pending iteration item, so a launch there would be an orphan dispatch with no collector — skip it; same orphan-avoidance vocabulary as the rules-review paragraph's non-pass rationale). Dispatch a background subagent (Agent with run_in_background: true, subagent_type: general-purpose, plus model: <subagent_model> when the Step 2-resolved subagent_model is a model id — omit model when it is inherit) instructed to run Skill(<reviewer>) with the same payload Step 8 sub-step 1 would compose for the next pending iteration item — sub-step 1's review-payload definition (including its rubric-link resolution note and, on a re-run pass, the continuation item and the iteration-scope instruction) is the single parametric source; do not restate its list here. Omitting the continuation item on a re-run pass would hand the reviewer a context-free diff and re-surface already-rejected findings. The reviewer returns its report verbatim, applying no edits. On a successful dispatch set code_review_launched = true. Emit a Progress Visibility status line in the same turn. Include in the dispatch payload a note that nested Agent is unavailable in this subagent context — execute Skill(<reviewer>) directly without probing for sub-subagent availability (the reviewer's inline fallback applies automatically; no runtime discovery is needed). The result is collected at Step 8 sub-step 1, not here. The Step 7-only re-run inside Step 7.5's fix flow is not a pass and does not re-fire this launch; the rules-review paragraph's orphan rationale does not transfer here (the code-review collect point is Step 8, where a collector exists even for that path) — overlapping that path simply stays out of scope.
If unavailable (per the Availability detection criterion above): skip the launch (code_review_launched stays false) — Step 8 dispatches the reviewer sequentially as before (fully backward-compatible).
Staleness — discard owned by Step 8 sub-step 1: this background result is speculative. The discard decision lives at Step 8 sub-step 1 (it reads code_review_stale); this paragraph only names the set sites. code_review_stale is set true whenever an edit lands between this pass's launch and the Step 8 collect point that changes the diff the reviewer analyzed: (i) a test_commands failure fix during Step 7 (the discard bullet above), or (ii) any fix Step 7.5 applies (see Step 7.5). Both set-site descriptions are pass-independent and apply to every pass unchanged. The condition is broader than the rules-review launch's (whose collect point is Step 7.5, before Step 7.5's own fixes land); the code-review launch's collect point is Step 8, after Step 7.5's fixes land, so Step 7.5 fixes also count. No-op on the no-launch path: when code_review_launched == false (the launch was skipped or unavailable), setting code_review_stale has no effect — Step 8 sub-step 1's collect branch is already gated on code_review_launched == true, so the no-launch path dispatches the reviewer fresh regardless of the flag's value; the unconditional code_review_stale = true set-sites are therefore safe to execute on every path. Handle disposition: a stale (or never-collected) background result is simply ignored at the Step 8 collect point — no explicit cancellation of the background subagent is owed.

After launching (or skipping) both, run test_commands in the main thread per sub-step 2 below; the background rules-review and (when launched) the code review proceed concurrently.

Iterate over test_commands in order. For each entry (which must be of the form Skill(<name>)), invoke that skill with --base-commit <sha> (from Step 2) via $ARGUMENTS. Each invocation must return a structured summary with one of three statuses (SUCCESS / TEST_FAILED / EXECUTION_ERROR); a TEST_FAILED or EXECUTION_ERROR from any entry halts the loop immediately and triggers the retry path in sub-step 3 — subsequent entries do not run on the failing pass.
- Each test skill handles scope decision and test execution internally via subagent (when applicable)
- Returns structured summary: SUCCESS / TEST_FAILED / EXECUTION_ERROR
- Bulk-vs-split execution: when the change is cross-cutting (shared components, mirrored services, or parallel handlers) and the test suite includes long-duration categories (E2E, integration tests with external dependencies), prefer passing scoped or split arguments rather than requesting a single bulk run. A single command bundling long-running jobs makes intermediate progress opaque and failure recovery harder — scope-targeted execution lets each category succeed or fail independently.
- Shared-path re-run scope: when a fix touches a shared path — a utility, helper, or function invoked by multiple distinct test suites (for skill development this includes subagent dispatch shared forms, hook wiring, state-file processing, or any cross-suite path) — include all suites that exercise that path in the re-run scope, not just a representative suite. A green representative suite proves only the paths it exercises; when the changed code is on a shared path, every suite that routes through it is a potential regression surface. When running all affected suites is impractical, record the excluded suites explicitly in the Completion summary as uncovered risk rather than treating the representative re-run as sufficient verification.
- Pre-existing vs regression discrimination: before entering the retry path on TEST_FAILED / EXECUTION_ERROR, discriminate each reported failure as regression (introduced by this run's changes) or pre-existing (already failing at <base-commit> from Step 2). Two paths: (i) if the invoked test skill's structured summary already classifies failures as pre-existing / regression (recommended return-contract extension for any verification-class skill — lint, test runners, structural checkers, marketplace validators), trust that classification. (ii) Otherwise, re-run the same test skill against <base-commit>: stash the working changes (git stash --include-untracked), check out <base-commit> into a scratch worktree (git worktree add ../base-commit-check <base-commit>) or rely on the test skill's own --base-commit argument if it supports re-evaluating at that ref without working-tree manipulation; compare the failures. Failures reproducing at <base-commit> are pre-existing — record as an informational warning in the summary (pre-existing failure: <skill> / <case> — out-of-scope for this PR) and do not count toward the 3-retry budget and do not auto-fix. Only failures that do not reproduce at <base-commit> are regressions — proceed with the existing retry / fix path. General principle: regression-vs-pre-existing discrimination via base-commit comparison applies to any verification step running a checker against a working tree (lint, test, structural validator — for skill development this includes marketplace structure validation and plugin integrity checks where docs and implementation can disconnect independently of the current change).
- EXECUTION_ERROR + pre-declared degraded procedure: when a test invocation returns EXECUTION_ERROR AND the approved Plan explicitly pre-declared a degraded procedure for this failure mode (e.g. a Risks entry naming the environmental constraint and a fallback verification path), apply the degraded procedure automatically — execute the fallback, emit a one-line note in the resolved language (e.g. Step 7: EXECUTION_ERROR — applying pre-declared degraded procedure: <procedure-summary>), and continue without consuming a retry. Pre-declared degraded procedures are user-approved accommodations for predictable environmental constraints; routing them through the retry-and-stop path contradicts the plan's prior approval and violates § No-Stall Principle. When no degraded procedure is pre-declared, treat EXECUTION_ERROR as before (trigger the existing retry / fix path).
After 3 retries, report to user and stop

Coverage note (TypeScript multi-tsconfig): For projects with Project References or multiple tsconfig*.json files, a single tsc --noEmit may miss changed files that belong to other tsconfigs. --init auto-registers a per-tsconfig tsc -p <path> --noEmit in this case (see references/init-mode.md for detection rules). If coverage still looks incomplete, re-run --init or append the missing command manually.

GATE: Verify Steps 2-7 are completed (check task status via TaskList; if status is inconsistent, verify actual completion by reviewing work done). Mark Step 7.5 as in_progress unless Step 2 pre-completed it under the difficulty-skip matrix (Trivial tier) — in that case the row is already completed; do not re-mark it in_progress, skip straight to Step 8 (same already-completed-row handling as the Step 8 GATE's N_code=0 case). (If Step 7 launched a background rules-review, it may still be in flight — Step 7 is "complete" once the test phase passes; Step 7.5 sub-step 1 collects the rules-review result.)

Step 7.5: Rules Compliance Review

Dedicated rules compliance check, separate from code review (Step 8). This ensures rule enforcement gets focused attention rather than competing with correctness and design concerns.

Responsibility scope (so the same rule class is not double-reviewed across passes and no class slips through every pass):

Step 7.5 owns the mechanical walk of every matched .claude/rules/ rule against the diff — hard rules (explicit prohibitions, naming, reference form, import paths, placement, file structure) are evaluated strictly; intent-style rules (judgment-based principles, prose conventions) are evaluated best-effort with low-confidence markers per rules-review SKILL.md.
Step 6 Tidy covers reuse, prose quality, dead code, and redundancy; rule compliance is not its primary responsibility — if the Step 6 cleanup skill (Skill(simplify) or the Skill(tidy) fallback) surfaces rule findings as a side effect, treat them as bonus and do not extend its reviewer prompt to take on .claude/rules/ walks.
Step 8 Code Review covers correctness, edge cases, conventions / consistency lightly (a safety-net pass for files modified after Step 7.5), and simplicity / maintainability — the thorough rules check stays at Step 7.5.
Step 11 Update Rules owns the rule-doc-drift class: findings where the code under review is internally consistent with itself (and with the broader file's existing pattern across 3+ call sites per rules-review SKILL.md's drift detection criteria) but the rule document describes different behavior — i.e. the rule text has gone stale relative to the code. Step 7.5 surfaces this class via the reviewer's Classification: rule-doc-drift finding and does not apply a code fix; the disposition is to route the rule-text update to Step 11 (Skill(extract-rules)) rather than rewriting the code to match a stale rule. When rules-review returns a finding tagged Classification: rule-doc-drift, treat it as out-of-scope for Step 7.5's fix loop (no Skill(rules-review) re-run is required to clear it, since the code is the source of truth), record the routing intent so Step 11 picks it up, and continue.

When a rule violation is reported in both passes (Step 7.5 and Step 8), treat Step 7.5 as authoritative and skip the duplicate fix attempt in Step 8 to avoid double-counting in the iteration budget.

Obtain the rules-review report — collect the Step 7 background launch, or invoke directly. If Step 7's "Concurrent rules-review launch" dispatched a background rules-review this pass and it is still fresh (rules_review_launched == true and rules_review_stale == false), collect that background result now (it ran concurrently with the test phase). If the background subagent has not yet reported when you reach this point (the test phase finished first), wait for its completion notification before judging the report — this wait is a non-stalling return boundary (harness-tracked background work), not a user gate, and a not-yet-arrived notification must never be read as "no findings". If the collected background result is itself an error completion — the subagent died or returned something unusable as a rules-review findings report — treat it the same as not-launched and fall back to the fresh sequential dispatch below; this route only redirects (it does not mutate the launch/stale flags, so the lifecycle closed list is unchanged). Otherwise — rules_review_launched == false (background dispatch was unavailable, or the dispatch attempt did not succeed) or rules_review_stale == true (the background result was discarded after a test failure) — invoke Skill(rules-review) with --base-commit <sha> (base-commit recorded in Step 2) — and Model: <subagent_model> when subagent_model is a model id (omit when inherit), which rules-review applies to its internal per-category Agent dispatch — via $ARGUMENTS, including the cross-layer review handoff ledger as a short context item in the dispatch (per § Step 6's Cross-layer review handoff ledger paragraph; omit when the ledger has no recorded dispositions). Either way the report comes from the external rules-review skill: do not substitute an inline rules-walk based on perceived scope, change size, or any other self-judgment of the diff's complexity — small / "obvious" / single-file changes still go through the external skill. The skip-to-fallback path is documented in Prerequisites and fires only on objective skill unavailability (the Skill(rules-review) call itself fails after one retry), never on subjective judgment that an inline equivalent would suffice. The external skill enforces consistent coverage across runs; inline substitution silently degrades that coverage and the user has no visible signal that it happened.
Judge the result semantically: if the skill reports that there is nothing to act on — no actionable violations, no changed files, no applicable rules, no rule files found, or any other "nothing to report" outcome regardless of exact wording — mark Step 7.5: Rules Compliance Review as completed and proceed to Step 8 automatically. Per the No-Stall Principle, do not wait for user input and do not rely on exact-phrase matching; trust semantic judgment since the skill's phrasing may evolve across versions.
If violations found: a. Fix all reported violations. Applying these fixes changes the diff that the Step 8 code-review background launch (if any) analyzed, so set code_review_stale = true — Step 8 sub-step 1 then discards the now-stale background result and dispatches the reviewer fresh against the post-fix diff. b. Re-run Step 7 (Check / Test) to ensure fixes did not break anything (this sequential re-run does not re-fire Step 7's concurrent rules-review or code-review launches) c. Re-run Skill(rules-review) with --base-commit <sha> for verification (2nd cycle). Apply the same semantic judgment as step 2: if the re-run reports nothing actionable, mark Step 7.5: Rules Compliance Review as completed and proceed to Step 8 automatically (per the No-Stall Principle). When a 2nd-cycle verdict differs from the 1st on a specific location (a previously-flagged item now passes, or a previously-clean location is now flagged), record the reason inline in the Step 7.5 user-facing summary presented to the user (1–2 lines per drifted location: which location, 1st-cycle verdict, 2nd-cycle verdict, why) before completing — judgment drift between cycles is acceptable but must be explained, otherwise repeat-cycle stability cannot be assessed. d. If violations still persist after the 2nd review cycle, present remaining violations to user for decision. Above the violations list, emit a summary preamble per references/plan-format.md § User-gate summary preamble. Render the violations following references/plan-format.md § Localization granularity in the resolved language. Wait for user response before marking completed. (This is one of the explicit user-gates enumerated in the No-Stall Principle.)

Mark Step 7.5: Rules Compliance Review as completed only after all violations are resolved or user has decided on remaining violations.

GATE: Verify Steps 2-7.5 are completed (check task status via TaskList; if status is inconsistent, verify actual completion by reviewing work done). Mark Step 8 as in_progress only when N_code ≥ 1; if Step 2 set N_code=0 (Trivial), Step 8 is already completed — do not re-mark it in_progress, skip straight to Step 9. (If Step 7 launched a background code review, it may still be in flight — Step 8 sub-step 1 collects it.)

Step 8: Code Review

Code review catches bugs, convention violations, and design issues that tests alone miss — skipping it risks shipping preventable defects. Always run this step even when tests pass cleanly.

Mark Step 8: Code Review as in_progress. Process each pending iteration item (Step 8-1 through 8-N_code) in order:

Mark the iteration item as in_progress. Obtain this iteration's reviewer report — collect the Step 7 background launch when it is fresh, otherwise dispatch the reviewer skill resolved in Step 1 (e.g. Skill(ask-peer)):
- Collect the Step 7 background launch when fresh: if code_review_launched == true and code_review_stale == false, the Step 7 "Concurrent code review launch" ran the reviewer in a background subagent concurrently with this pass's test phase — collect that result now as this iteration's reviewer report. Each pass's launch is collected at most once — by the first iteration item processed after that pass (iteration 1 on the initial pass; iteration k+1 on a re-run pass triggered from iteration k; derivation at the Initialize tracking bullet of Step 7's "Concurrent code review launch" paragraph). If the background subagent has not yet reported, wait for its completion notification before judging it, per the same non-stalling wait-boundary rule as Step 7.5 sub-step 1's background collect (a not-yet-arrived notification must never be read as "No actionable findings"). If the collected background result is itself an error completion, apply the same error-completion route as Step 7.5 sub-step 1's background collect — treat it as not-launched and dispatch fresh per the next bullet (the route only redirects; it does not mutate the launch/stale flags).
- Otherwise — dispatch fresh: when code_review_launched == false (background dispatch unavailable, the launch was skipped for this pass, or the dispatch attempt did not succeed), or code_review_stale == true (the diff changed since this pass's launch so the background result is stale), or when redirected here by the collect bullet's error-completion route, call the reviewer skill (e.g. Skill(ask-peer)) to review the code changes now. subagent_model propagation (inline fresh-dispatch only): propagate subagent_model exactly as in Step 3 (Claude-family reviewers only). This applies only to this inline fresh-dispatch path — the Step 8 background-launch path already carries subagent_model via the Step 7 launch's Agent model, so the two paths never double-apply. Pre-dispatch dispatch-boundary reminder: Issue the Skill(<reviewer>) call in the same turn as any accompanying status prose — never produce a standalone status turn before the Skill() call, as that creates a stall point. Reading the reviewer's SKILL.md is preparation, not dispatch; the Skill() call is the dispatch. In both paths the collecting iteration is an ordinary iteration (it judges and applies findings per sub-steps 2–3 below); the collect path only substitutes the report's source. The reviewer report addresses the following — this list is the single parametric source for both paths (it is the fresh-dispatch request, and the same payload the Step 7 background launch bakes; the continuation item and the iteration-scope instruction below each apply per their own conditions):
- Include git diff <base-commit> (base-commit recorded in Step 2) to capture all changes since workflow start
- Thorough rules compliance has been verified in Step 7.5, but instruct reviewer to also flag any obvious .claude/rules/ violations as a safety net — especially for code modified after Step 7.5
- Request feedback organized into three categories (labels only — the full per-category rubric lives in references/review-categories.md § Code review categories; instruct the reviewer to read that section, resolving the link to a concrete readable path when composing the request — the reviewer lacks the skill-directory context): a. Correctness & edge cases b. Conventions & consistency c. Simplicity & maintainability
- If custom_instructions is configured, include the instructions text in the review request and have the reviewer verify compliance and report conflicts
- If a state file is active (executing a subtask from a decomposition), include the current subtask's scope in the reviewer request: list the subtask's title and description, then list what the other subtasks cover (to define out-of-scope). Instruct the reviewer that missing functionality belonging to other subtasks is not an actionable finding for this code review — only findings scoped to the current subtask qualify. Omit this when no state file is active (single-task execution has no defined out-of-scope boundary).
- Include the cross-layer review handoff ledger as a short context item (per § Step 6's Cross-layer review handoff ledger paragraph; omit when the ledger has no recorded dispositions); it complements the same-layer continuation item below.
- If a prior Step 8 iteration completed this run (a re-run pass): include the continuation item — the summary of fixes made and rejections with reasons from the completed iterations, including any class-level sweep record — so the reviewer builds on prior rounds rather than re-raising already-rejected findings (the latest git diff <base-commit> is already the first item above). Omit this item on the initial pass (no prior iteration).
- On a re-run pass, also include an iteration-scope instruction: the reviewer's primary verification scope is the changes applied since the prior iteration (identified via the continuation item's summary of fixes, located within the latest git diff <base-commit>) plus landing confirmation of the prior iteration's findings — the full-coverage pass (re-verifying every target file in the full git diff <base-commit> from scratch) belongs to the initial pass only. The reviewer must still escalate back to full re-verification when content outside that primary scope raises a new concern, so coverage is reordered, not reduced. Omit this item on the initial pass (the initial pass is the full-coverage pass).
- Reviewer should only report actionable findings. If none, explicitly state "No actionable findings"
Judge the reviewer's response semantically: if the reviewer reports nothing actionable — no actionable findings, no bugs / convention violations / design issues raised, or any other "nothing to report" outcome regardless of exact wording — mark this and remaining iteration items as completed (skip). Mark Step 8: Code Review as completed and proceed to Step 9 (Completion Hooks) automatically. Per the No-Stall Principle, do not wait for user input and do not rely on exact-phrase matching; trust semantic judgment since the reviewer skill's phrasing varies (especially Skill(ask-peer) and other free-form-prose reviewers whose verdicts are natural-language Markdown rather than a fixed token).
Otherwise: autonomously fix genuine issues or reject inapplicable points with reason — do not ask the user for judgment on individual review findings. Mark this iteration item as completed.
- Rejection self-question (severity-label override): before rejecting any finding solely because the reviewer labeled it Minor (or any other low-severity bucket), ask "if I rejected this and presented the resulting code to the user, would the user re-raise the same point themselves?" — judging by which areas the user has historically commented on (intent expression, reader-comprehension, placement consistency for test fixtures / helper functions / dependency locality, and other readability concerns where runtime correctness is unaffected but a reader's interpretation is). If the answer is yes or ambiguous, apply the fix instead of rejecting on the Minor label alone; reject on Minor only when you are confident the user would not surface the same point.
- Class-level extension audit (post-Critical/Major-fix): immediately after applying a fix for a Critical-severity finding, or a Major-severity finding whose fix addresses a structural pattern (external I/O boundary conditions, closed enum / form-set networks, shared helper / safety-rail callers, parallel route handlers — for skill development: subagent return-value schemas, shared handler fallback paths, mirrored form-set network audits), and before the modified-vs-rejected branches below, scan the rest of the diff for other instances of the same defect class — same operation, same broken assumption, same side-effect pattern (e.g. shared-resource-destroying API call sequences, direct processing of unverified input, race conditions). Reviewer feedback typically names one instance; the underlying class often spans the diff (cross-construct propagation, shared safety-rail callers, parallel route handlers, etc.). Apply the same fix direction to additional matches found here, then record the sweep outcome (e.g. class-level sweep for <defect-class>: N additional instances found and fixed or no additional instances found) in the summary passed to the next iteration so the next reviewer does not re-trigger the same audit on already-swept ground, then continue to the modified-vs-rejected branch.
- Prose-integrity self-check (post-fix): after applying a fix that edits prose adjacent to its target line (comments, docstrings, paragraph-level documentation — for skill development this includes SKILL.md and references/*.md content), re-read the surrounding paragraph as a single unit before continuing — verify no sentence is cut mid-word, no logical connective is broken (the connectives however / therefore / because / but / etc. still anchor real clauses), and the paragraph's overall logic still holds. Mechanical fix patches see only the line-level diff and routinely leave the surrounding prose semantically broken in ways the next iteration's reviewer flags as a Major finding, costing an extra iter.
- Natural-language quality self-check (post-fix): when a fix adds new natural-language content that mechanical lint / test cannot verify (code comments, config-file annotations, error messages, UI copy, documentation fragments — for skill development this includes SKILL.md / references/*.md prose additions, frontmatter description text, log messages), re-read each added fragment as a standalone unit in the resolved language. Judge it on four axes: concise (no padding or runaway sentences), phrasing natural for the target reader, vocabulary consistent with surrounding text, register and sentence structure not awkward. Revise any fragment that fails. This self-check is the only gate before natural-language content reaches the user-visible commit gate — Step 7 (check_commands / test_commands) and Step 7.5 (rules-review) cannot evaluate natural-language quality.
- If code was modified: re-run Step 7 and Step 7.5 (with same base-commit from Step 2) — this full re-entry is a new pass (per Step 7's "Concurrent rules-review launch" pass definition): both the rules-review launch and the code-review launch re-fire (the code-review launch bakes its payload per sub-step 1's definition, continuation item and iteration-scope instruction included, and only when a pending iteration item remains — see Step 7's "Concurrent code review launch") — then continue to the next pending iteration item (back to step 1). Code fixes routinely introduce fresh bugs, tighten one place while loosening another, or miss a caller the author didn't know about — the next review round is how those leaks get caught. Always re-run Step 7 and Step 7.5 — no exceptions. Do not short-circuit on any rationalization: not on confidence in the fix, not because the diff is small, not because the modified files appear out of scope for the configured check_commands / test_commands (e.g. edits land entirely under a local-skill directory or a docs-only path), not because re-running "would be a no-op". If a re-run is genuinely a no-op, the no-op outcome is the audit trail; skipping the re-run removes the trail. The only permissible skip is when no code was modified in this iteration (handled by the next bullet).
- If all points were rejected (no modifications): mark remaining iteration items as completed (skip — there is nothing new for the next reviewer to look at) Continue to the next pending iteration item; the next pass's reviewer dispatch (the Step 7 background bake and the fresh-dispatch path alike) composes its payload per sub-step 1's definition, continuation item and iteration-scope instruction included — not restated here.
Return-point no-stall reminder: At each iteration boundary (regardless of reviewer outcome — findings reported, "No actionable findings", any non-error result), the next action — the next iteration's reviewer dispatch when more iteration items remain, or the Step 9 (Completion Hooks) transition when this was the last iteration or "No actionable findings" was returned, or the Step 7 / Step 7.5 re-run when code was modified — must be issued in the next tool call. Do not insert an interstitial summary or acknowledgment turn between iterations; the abstract enumeration in § No-Stall Principle is intentionally duplicated here so the rule fires at the decision moment.
If all N_code iteration items are completed and actionable feedback still remains, present the unresolved points to user for decision. Above the unresolved points, emit a summary preamble per references/plan-format.md § User-gate summary preamble. Render the findings following references/plan-format.md § Localization granularity in the resolved language.

Mark Step 8: Code Review as completed.

Step 9: Completion Hooks

Skip this step if hooks.on_complete is not configured. Mark Step 9: Completion Hooks as in_progress.

Execute each entry in hooks.on_complete in order:
- Skill(<name>) pattern: invoke the skill — for review-class entries, include the cross-layer review handoff ledger as a short context item (per § Step 6's Cross-layer review handoff ledger paragraph, which also defines review-class; omit when the ledger has no recorded dispositions)
- Other strings: execute as a Bash command
If a hook fails, report the error but continue executing remaining hooks. Include as warnings in the Completion summary
After all hooks complete (or are skipped), mark Step 9: Completion Hooks as completed and proceed to Step 10

GATE: Verify Steps 2-9 are completed (check task status via TaskList; if status is inconsistent, verify actual completion by reviewing work done). Mark Step 10 as in_progress.

Step 10: Interactive Commits

accept: explicit affirmative — "OK" / "approve" / "next" / "LGTM" / "コミットして" / "進めて" / "いいよ" or any semantic equivalent
adjust: specific revision request — "subject を ... に" / "this file should be in commit 2" / "split this commit" / any other concrete change demand
cancel / stop: explicit halt — "stop" / "abort" / "やめる" / "中断"
NOT approval: interrogative or non-committal — "look good?" / "どう？" / "これでいい？" / "OK ？". Treat as adjust and re-present (do not silently advance)

language: ja: Step 10 部分完了: <N>/<total> コミット適用済み
language: en: Step 10 partial completion: <N>/<total> commits landed

Step 11: Update Rules

Skill(extract-rules) with --from-conversation — Skip if already run this session: if (a) any entry in hooks.on_complete (as resolved in Step 1) contains the string extract-rules (direct invocation), OR (b) Step 9 executed at least one hook and the output produced by Step 9's hook invocations (visible in this session's context) contains evidence that extract-rules --from-conversation ran this session (sufficient signal: output contains staged_count or promoted_count), skip this sub-step — extract-rules --from-conversation has already run this session. Running --from-conversation twice against the same session causes the staged-promotion mechanism (first-observation → second-observation escalation) to miscount one session as two independent observations, prematurely promoting staging candidates to confirmed rules.
Skill(extract-rules) with --update — Skip if --from-conversation ran this session: if extract-rules --from-conversation ran at any point this session — whether via sub-step 1's skip condition or because sub-step 1 itself ran in this Step 11 execution — skip this sub-step. Running --update immediately after --from-conversation risks prematurely promoting staging candidates that were just created before they accumulate a second observation. Trigger (when not skipped): significant structural/pattern changes to application code occurred — new frameworks, libraries, architectural patterns, or API conventions introduced in the diff; prose-only changes to SKILL.md, agent definitions, references, or rule files do not qualify. A dependency major-version bump alone (no implementation code changes in the diff) does not trigger --update; the major-bump signal (detected via git diff <base-commit> of the package manifest — the same signal used in the Step 2 difficulty assessment) instead triggers the extract-rules Update Mode operational note: surface it to the user as a reminder to run --update after the session and to manually review .examples.md samples that may have gone stale after the bump.
Char-count compaction gate:

Skip condition: If compact_rules is not true (i.e. the default false, or any non-boolean value that fell back to false), skip this entire sub-step — do not invoke Skill(extract-rules) --compact, do not open the Step 11 compaction approval gate, and proceed directly to sub-step 4 (variable initialization is not skipped — it is governed by the State-variable contract below, which covers the skipped case). Emit a one-line informational note in the resolved language so the user has a visible signal that compaction is intentionally not running:
- language: ja: Step 11 sub-step 3（圧縮）を skip しました — \compact_rules: true` が設定されていません（実験的機能 / デフォルト無効）`
- language: en: Step 11 sub-step 3 (compaction) skipped — \compact_rules: true` is not set (experimental feature / disabled by default)`
State-variable contract (cross-step declaration — § Completion reads both variables; the full 4-point lifecycle is specified in references/update-rules.md § Char-count compaction gate): at sub-step 3 entry, initialize compaction_applied_count = 0 and below_threshold_failed_files = []. When the skip condition above fired, both variables simply stay at these initial values (no advance ever runs), so § Completion's reads are well-defined and its compaction reminder is omitted.

When not skipped (compact_rules: true): read references/update-rules.md and follow § Char-count compaction gate from top to bottom — it is the single canonical home for this sub-step's procedure body, including the Step 11 compaction approval gate (USER APPROVAL GATE).
If extract-rules is unavailable, skip this step and inform user
After the applicable invocations above return, or after the step was skipped because extract-rules is unavailable — regardless of whether new rules were added or the report indicates nothing changed — mark Step 11: Update Rules as completed and proceed automatically. Per the No-Stall Principle, do not wait for user input.
If extract-rules wrote any changes to .claude/rules/ during sub-steps 1, 2, or 3, record the count so § Completion can surface the manual-commit reminder. The compaction-specific count (file-unit compaction_applied_count) is rendered separately by § Completion's "Step 11 compaction reminder" — see § Completion below

Step 11.5: Self-Retrospective

Completion

Report summary: tasks completed, files modified, test results, review outcomes, rules updated. Output in the resolved language following references/plan-format.md § Localization granularity.

language: ja: 難易度判定（<tier> tier）により <steps> を skip しました — 例: 難易度判定（Trivial tier）により Step 6 Tidy / Step 7.5 Rules Compliance Review を skip しました
language: en: Skipped <steps> per the difficulty-skip matrix (<tier> tier) — e.g. Skipped Step 6 Tidy / Step 7.5 Rules Compliance Review per the difficulty-skip matrix (Trivial tier)

language: ja: extract-rules が \.claude/rules/` に <N> 件の変更を加えました — PR を開く前に手動で commit してください`
language: en: extract-rules made <N> changes to \.claude/rules/` — please commit manually before opening a PR`

The reminder is omitted when the rule-change count is zero.

language: ja: Step 11 で <N> 件のルールファイルを圧縮しました — PR を開く前に手動で commit してください
language: en: Step 11 compacted <N> rule files — please commit manually before opening a PR

language: ja: <M> 件のファイルが閾値を超えています。手動で再度 \Skill(extract-rules) --compact` を実行するか、当該ファイルを直接編集してください:followed by<files>` on the next line
language: en: <M> files still exceed the threshold. Re-run \Skill(extract-rules) --compact` manually or edit the files directly:followed by<files>` on the next line

The compaction reminder is omitted when compaction_applied_count == 0 AND below_threshold_failed_files is empty.

If this run was executing a subtask from a decomposition state file, also do the following (all reads/writes target the canonical state-file path recorded in Step 1.5):

Mark the current subtask's status as completed in the canonical state file and write back
Ask the user for an optional PR URL for this subtask. On a non-empty answer, set the subtask's pr field and write back; otherwise leave it null
Refresh the parent-task progress row's <done>/<total> count
Find the next runnable subtask (smallest-id pending with all depends_on completed)
If a next subtask exists: branch on whether Step 10 actually landed any commits this run (use the landed_count from Step 10 — taking the config flag alone would mis-route the case where interactive_commits: true met the Step 10 skip conditions and exited at zero commits):
- landed_count > 0: tell the user the current subtask's changes have already been committed by Step 10 — open a PR for those commits, then start a new session with /dev-workflow --resume <slug> once the PR is up
- landed_count == 0 (either because interactive_commits: false or because Step 10 was skipped): tell the user to commit the current subtask's changes and open a PR before resuming, then start a new session with /dev-workflow --resume <slug>. Explain why this matters: the next run records a fresh base-commit from HEAD, so uncommitted changes would leak into the next subtask's diff In both branches, if Step 11 also wrote rule updates (i.e., the Step 11 rule-update reminder above fired with <N> > 0), tell the user to commit those .claude/rules/ writes manually before resuming — otherwise they leak into the next subtask's diff the same way uncommitted feature changes would. The "no push" invariant for both branches is stated at § Step 10's preamble
If no next subtask exists (all subtasks completed): delete the canonical state file via rm -f <canonical-path>, remove the parent-task progress row, and include every subtask's title and recorded pr (if any) in the parent-task completion summary

Adoption

hiroro-work/dev-workflow

$ install --global

Security Scan Results

SKILL.md

Dev Workflow

Usage

Prerequisites

Configuration

Mode Detection

Init Mode

Execution Mode

No-Stall Principle

Progress Visibility

Workflow artifacts (cross-step fixed exclusion)

Step 1: Load Settings

Step 1.5: Task Decomposition

Step 2: Create Plan

Step 3: Plan Review

Step 4: Finalize Plan (USER APPROVAL GATE)

Step 5: Implement

Step 6: Tidy

Step 7: Check / Test (max 3 retries)

Step 7.5: Rules Compliance Review

Step 8: Code Review

Step 9: Completion Hooks

Step 10: Interactive Commits

Step 11: Update Rules

Step 11.5: Self-Retrospective

Completion

Related Skills

hiroro-work/tidy

hiroro-work/tidy

hiroro-work/rules-review

hiroro-work/extract-rules

hiroro-work/dev-workflow

$ install --global

Security Scan Results

SKILL.md

Dev Workflow

Usage

Prerequisites

Configuration

Mode Detection

Init Mode

Execution Mode

No-Stall Principle

Progress Visibility

Workflow artifacts (cross-step fixed exclusion)

Step 1: Load Settings

Step 1.5: Task Decomposition

Step 2: Create Plan

Step 3: Plan Review

Step 4: Finalize Plan (USER APPROVAL GATE)

Step 5: Implement

Step 6: Tidy

Step 7: Check / Test (max 3 retries)

Step 7.5: Rules Compliance Review

Step 8: Code Review

Step 9: Completion Hooks

Step 10: Interactive Commits

Step 11: Update Rules

Step 11.5: Self-Retrospective

Completion

Related Skills

hiroro-work/tidy

hiroro-work/tidy

hiroro-work/rules-review

hiroro-work/extract-rules