skills/planner-worker/SKILL.md
--- name: planner-worker description: Multi-agent coding swarm. Planner / worker / merger over git worktrees on the Max plan. Reads a grilled PRD from `.ralph/<name>/prd.md` OR a pre-ranked queue from `/ro:night-shift`, dispatches parallel Claude Code Agent workers (one per vertical-slice issue in its own git worktree) with hybrid file-area conflict detection, and a merger Agent reviews + merges clean work back to the target branch. Defers to /ro:repo-mode for `--github` default — `personal` rep
npx skillsauth add RonanCodes/ronan-skills skills/planner-workerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Multi-agent coding swarm. Sibling to /ro:ralph. Where Ralph is one agent doing stories in series, planner-worker is N agents doing stories in parallel under a planner and a merger, with optional judge-agent termination.
This skill is one part of the local factory — the family of agent-loop skills that run autonomously on Ronan's machine. Siblings: /ro:ralph, /ro:matt-pocock-coding-workflow, /ro:night-shift, /ro:day-shift. They share artefact shape, gitignore rules, and PR conventions. See /ro:ralph § "Run artefacts (the canonical shape)" for the canonical reference.
The companion is the remote factory — the Factory app (tracked separately) that runs equivalent loops as a cloud service. Story formats and PR conventions are kept compatible so a queue from one can be consumed by the other.
For parallel workers specifically: every worker MUST write its scratch to .ralph/sessions/<session-id>/<worker-id>.md (gitignored). The planner harvests learnings into .ralph/patterns.md, writes the per-session detail log to .ralph/sessions/<session-id>.md, and updates the per-phase rolling aggregate at .ralph/<phase>.session.md at end-of-wave. At the same end-of-wave point the planner also harvests .swarm/shared-learnings.md into .ralph/patterns.md and clears .swarm/claims.jsonl (see US-3c2). No worker touches a shared file. No merge conflicts on the progress log.
Close-with-summary (mandatory for every shipped slice): every worker MUST post a structured comment on the GH issue BEFORE merging via gh issue comment <N>. Sections: Shipped, Surprises encountered, Patterns promoted to patterns.md, Local learnings, Follow-ups. See /ro:ralph § "Close-with-summary comment (required for every story)" for the canonical body shape. The merger should reject any PR whose linked issue is missing this comment.
/ro:grill-me and written via /ro:write-a-prd and now has 5+ vertical-slice issues to ship--afk)depends_on to serialise hot-file work)/ro:ralph --mode single)--sandcastle flag)/ro:grill-me then /ro:write-a-prd first)# Interactive: grills you for config, then runs against a named PRD
/ro:planner-worker --prd lekkertaal-pwa-bootstrap
# Autonomous night-shift (AFK)
/ro:planner-worker --prd lekkertaal-pwa-bootstrap --afk
# Real GitHub issues + PRs
/ro:planner-worker --prd lekkertaal-pwa-bootstrap --github
# Resume after a human-resolved escalation
/ro:planner-worker --resume
prd:draft is NEVER picked upprd:draft issues are NEVER picked up by this skill. They represent ideas captured in the agent-native repo's "inbox", not ready work. Drafts have freeform bodies, NOT Pocock's 7-section template, and have not been grilled. The planner picking one up would mean dispatching workers against an unfinished spec.
To promote a draft into ready work, the user runs /grill on the issue. The grill-with-docs flow rewrites the body into the Pocock 7-section template, then the user swaps the label from prd:draft to the repo's gate label (ready-for-agent by default, or the project synonym like Sandcastle / swarm configured in docs/agents/triage-labels.md).
When the planner (or any phase of this skill) queries GitHub for backlog issues — typically via --source github:<label> from /agentic-e2e-flow or /ro:night-shift — ALWAYS exclude prd:draft. gh label semantics are tricky: --label <gate> matches issues that have the gate label, but an issue can have BOTH prd:draft AND the gate label (mis-labelling drift). Defence in depth:
--label <gate> to scope the initial query.labels[].name contains prd:draft.gh issue list --label <gate> --search "-label:prd:draft" ... to push the exclusion server-side.Tip: if you're not sure what's queued vs what's still an idea, run /ro:list-draft-prds first to see the drafts inbox before kicking off the swarm.
Unless --skip-grill (or --afk which implies it) is passed, the skill opens with a project-aware grilling phase. Each missing flag is asked via AskUserQuestion, one at a time, with a recommendation derived from project state.
Order and project-awareness rules:
GitHub issues + PRs for this run?
/ro:repo-mode. If mode == work → recommend "no" (hard rule, do NOT prompt; work-mode repos must stay invisible to the work GH/Jira/ADO project). If mode == personal → recommend "yes" when a gh remote resolves. If mode == unset → run the first-run prompt from /ro:repo-mode § "First-run prompt", persist, then re-resolve.gh repo view --json visibility,nameWithOwner 2>/dev/nullpersonal mode: recommend "yes" when a remote resolves and visibility is PRIVATE; "no" when no remote configured or repo is public + you don't want noise.--githubWorker count?
sysctl -n hw.ncpu (macOS) or nproc (linux); free disk on .swarm/worktrees/ parent via df -k .min(5, cores - 2) and warn if free disk < (repo_size_mb * worker_count * 2)--workers N, .swarm.json workers, or --unsafe-many past 10.--workers NRun mode?
--afk for AFK, nothing for interactiveMerge target?
git rev-parse --abbrev-ref HEADswarm-staging) as alt--staging-branch <name> to override; default = current branchWorker model?
sonnet; offer opus only if the PRD is flagged as "research-heavy" (more than 3 issues mention "investigate" or "explore").swarm.json models.workerAdd a Judge agent?
--judge-agentTrust local pre-push CI; skip waiting for GitHub Actions?
test -f .husky/pre-push AND grep that file for any of pnpm quality, pnpm test, pnpm build, pnpm lint, pnpm typecheck/ro:new-tanstack-app).github/workflows/*.yml runs checks that are NOT in the pre-push hook (e.g. cross-OS matrix, integration tests against a managed service)--trust-local-ci (yes) or --wait-for-gh-ci (no).swarm.json as trust-local-ci: true|falsetrust-local-ci=true push, verify the push hook ran clean, and squash-merge immediately. They do NOT poll gh api .../check-runs.gh api repos/<owner>/<repo>/branches/main/protection and overrides to wait-for-gh-ci=true (printing why)Each answer is echoed back as the CLI flag the user could have passed next time, e.g. "noted, equivalent to --workers 5". CLI flags supplied at invocation skip their corresponding prompt.
--skip-grill short-circuits to defaults: workers=5, github=<true if repo-mode=personal AND gh remote present, else false>, mode=interactive, merge-target=current-branch, worker-model=sonnet, judge-agent=false, trust-local-ci=true. The github default specifically defers to /ro:repo-mode — work-mode repos always run with github=false so nothing leaks to the work GH/Jira/ADO project. If repo-mode is unset and --skip-grill is in play (e.g. --afk), default github=false (safe failure mode); a separate session will catch the unset state and prompt.
The planner is invoked once at the start of a run via the Agent tool, with the Opus model.
Planner prompt template (see scripts/planner-prompt.sh for the canonical version):
You are the PLANNER agent for /ro:planner-worker.
Read the PRD at .ralph/<name>/prd.md.
Explore the codebase enough to size the work.
Emit a backlog of vertical-slice issues to .ralph/<name>/issues/<NNN>-<slug>.md.
Each issue file MUST have frontmatter:
id: 001
title: <one-line>
status: ready
depends_on: [] # array of issue ids that must merge first
estimate: <S|M|L>
dod:
- <pnpm test passes>
- <pnpm typecheck passes>
- <any extra acceptance check>
Body: 5-30 lines describing what to build + why, file pointers, test plan.
Every body MUST end with the verbatim `### Close-the-loop tests` block
(unit + integration + e2e + live smoke) from `/ro:slice-into-issues`.
The dispatcher refuses to send a worker against a slice missing it
(see US-2a "Close-the-loop AC gate"); pre-emit it so we never hit
that gate.
Rules:
- Vertical slices only. No "scaffold the schema" issues without a UI touchpoint.
- Mark depends_on for hot-file conflicts (two issues touching the same component).
- 3-12 issues for v1. If you'd plan more than 12, split the PRD instead.
- Exit cleanly when issues are written. Do not implement anything.
The planner is only re-invoked under US-7 (worker failure re-decomposition).
After the planner exits, the skill prints a one-screen summary:
Backlog plan ready (.ralph/<name>/issues/):
001 [S] schema: add points column deps: -
002 [M] service: award points on completed task deps: 001
003 [M] dashboard: tile rendering points trend deps: 002
004 [S] hook: usePoints deps: 001
005 [M] integration test: points e2e deps: 003,004
Dispatch order: [001] -> [002, 004] -> [003] -> [005]
Workers will run up to N=3 in parallel.
[y] dispatch [n] re-plan with feedback [q] abort
Skipped under --auto-approve or --afk. Rejecting with feedback re-invokes the planner Agent with the feedback inlined.
kind:research issues BYPASS this gate. A research/investigation issue legitimately has no test matrix — its close-the-loop is a doc in docs/research/ + the LLM wiki, per [[canon:research-tasks]]. When a candidate issue carries kind:research, skip the test-AC check below, dispatch the research-worker prompt (see "US-3d: Research-worker mode"), and let the merger assert the research close-the-loop matrix (doc exists + index updated + wiki ingested + bidirectional issue↔doc links) instead of tests.
Before dispatching ANY non-research worker, the planner parses each candidate slice body for the marker ### Close-the-loop tests (the narrower predecessor) OR ### Close-the-loop verification matrix (the full 7-row matrix per [[close-the-loop-verification-matrix]]). Slices missing both are guaranteed leaks: the night-shift swarm only implements what the AC list spells out, so a slice without an explicit e2e AC ships with no Playwright coverage (lesson captured at [[close-the-loop-tests-acs]]; canonical example is dataforce #229 / #231).
Behaviour is controlled by the repo-local .ronan-skills.json flag swarm.missing_test_acs:
refuse (default, safe): print the offending slice ids + bodies, skip dispatching them, fail-loud if zero dispatchable slices remain. Forces the human to either re-run /ro:slice-into-issues against the parent PRD (which now emits the section automatically) or hand-edit the slice body and re-queue.inject: auto-prepend the boilerplate ### Close-the-loop tests block (verbatim from /ro:slice-into-issues) before dispatch AND tag the issue with the acs-auto-injected label (so reviewers know to read the diff for actual test files). Faster but riskier: a worker may implement the slice without writing the tests if the issue body doesn't already reference them in the "What to build" prose.Parse logic (cheap, no LLM needed) — accepts EITHER the narrower ### Close-the-loop tests marker OR the full ### Close-the-loop verification matrix marker:
issue_body="$(gh issue view "$NUM" --json body --jq .body)"
if ! grep -qE '^### Close-the-loop (tests|verification matrix)' <<< "$issue_body"; then
case "$(jq -r '.swarm.missing_test_acs // "refuse"' .ronan-skills.json)" in
inject) inject_block_and_label_acs_auto_injected ;;
*) mark_slice_skipped_with_reason "missing-close-the-loop-tests" ;;
esac
fi
This gate runs in both --prd <name> mode (planner Agent-emitted slices) AND --queue <path> mode (ranked night-shift input). The night-shift loop should not be expected to backfill ACs into pre-existing issues; if a slice arrives without the section, the gate is the last line of defence.
When invoked with --queue <path>, the skill skips US-1 (planner runs) and US-2 (confirmation) and reads the ranked slice list straight from the file. Format is a one-slice-per-line plaintext list of <issue-number> (when --github) or <issue-id> (when local). Used by /ro:night-shift to pass its ranker output through and prevent every wave from re-running the planner Agent.
The file is consumed in order — top of file = highest priority. The skill still respects depends_on and the US-3b file-area graph; the ranked order only sets dispatch priority among work that's otherwise dispatchable.
--queue and --prd are mutually exclusive. When --queue is set the planner Agent is NOT re-invoked; that's the night-shift loop's job between waves.
For each unblocked issue (depends_on all merged or empty), up to the worker cap:
gh issue develop <gh-issue-number> --name swarm/<id> — creates the branch with the issue↔branch dev-link (canon ronan-skills/canon/labels.md § Branch flow). NEVER plain git checkout -b here; the dev-link is what makes the Closes #N propagation reliable and lets the night-shift retro walk issue↔branch↔PR without title-matching.git worktree add .swarm/worktrees/<id> swarm/<id> — check the just-created branch out into the isolated worktree..swarm/ to .gitignore if not already present (commit only on first run, on a chore(swarm): gitignore .swarm commit)Workers are dispatched as multiple Agent tool calls in a single assistant message so the runtime fans out. They are independent and do not coordinate.
Two parallel workers touching the same files race each other at merge time. The skill mitigates this with a hybrid graph, separate from the depends_on graph.
Canonical-label hints (read first, free): before doing any path-set analysis, read these labels from the canonical label system (~/Dev/ronan-skills/canon/labels.md):
parallel-eligible — the slicer asserted this slice is file-disjoint from its siblings. The planner treats this as "safe to fan out unless the static pass disagrees with a path overlap".repo-lock — the slice takes a repo-wide lock (lockfile churn, schema reset, top-level config rewrite). Mark it serial-only: it runs in its own wave after every parallel-eligible slice drains.hitl-likely — reviewer expected to escalate (ORM, schema-migration, billing, OAuth, secret rotation). The planner counts these against the HITL concurrency cap (default 1 in flight at a time) to keep reviewer attention focused.These hints are advisory: the static pass below still runs and a path conflict still serialises. The labels exist so the planner doesn't need to re-derive the slicer's intent.
Static pass (every wave, free): read each candidate slice body for:
## Files touched or ## Touches section listing pathsestimate: or paths: frontmatter arraysrc/..., app/..., apps/..., packages/..., drizzle/..., etc.Two slices conflict if their declared path sets intersect by exact file OR share a directory at depth ≥ 2 (e.g. src/routes/settings.tsx and src/routes/settings.test.ts collide via src/routes/). Conflicting slices are never dispatched in the same wave, even if they're both unblocked.
Scout pass (only after a wave hits a cross-worker merge conflict): spawn a one-shot scout sub-agent per remaining candidate slice with the prompt:
You are a SCOUT sub-agent for /ro:planner-worker.
Slice: <issue body inlined>
Task: read the slice and predict the SET of file paths the implementation will touch.
Output a JSON object on stdout, nothing else:
{"paths": ["src/routes/...", "src/lib/..."], "confidence": "high|medium|low"}
Rules:
- Do not edit anything.
- Do not run pnpm install or any side-effects.
- Read the codebase enough to be accurate. ~2-3 min budget.
- If you cannot tell, output paths: [] with confidence: low.
Scout outputs are merged into the file-area graph and persist for subsequent waves of the same run.
--no-scout disables the scout pass entirely (purely static). Use when you'd rather waste worker time on conflicts than spend planner time on probing.
The static + scout passes above predict conflicts BEFORE dispatch. This protocol catches the conflicts prediction cannot: scope drift, where a worker legitimately wanders outside its predicted territory mid-implementation. Proof case: nutmeg #167 and #168 (2026-06-11) both created admin/sources.tsx in the same wave — the predicted territories were disjoint, the actual work wasn't, and the collision only surfaced at merge time. A live claims file would have surfaced it mid-run.
Two append-only files in .swarm/, active in every wave:
.swarm/claims.jsonl — the live claims registry. One JSON object per line:
{"worker": "003", "issue": 168, "paths": ["src/routes/admin/sources.tsx", "src/lib/sources/"], "ts": "2026-06-11T21:14:02Z"}
Appends are flock-guarded (flock .swarm/claims.jsonl -c 'cat >> .swarm/claims.jsonl <<< "$line"', or single-write >> lines under PIPE_BUF). Protocol:
.swarm/status.md and in the final report so the planner arbitrates (serialise, re-scope, or hand the path to one worker)..swarm/shared-learnings.md — append-only one-liner discoveries: build quirks ("run pnpm build first to regenerate routeTree.gen"), flaky tests, migration-journal gotchas. Workers append discoveries as they hit them and read the file on arrival and between phases — wave-mates benefit immediately instead of waiting for the next wave's prompts.
Honest framing: this is cooperative, not enforced — LLM workers follow it because the prompt says so. The merge-time rebase + full DoD re-run remains the hard gate. Claims exist to catch scope drift early, not to replace the gate. It pays off most at 4+ workers; at 2-3 the static graph already does well.
Planner duties: at wave end, harvest .swarm/shared-learnings.md into .ralph/patterns.md (alongside the existing worker-scratch harvest) and clear .swarm/claims.jsonl — stale claims from a finished wave must not block the next one.
Relation to the factory: this is the local, cooperative sibling of the remote factory's enforced DO file-lock graph (github.com/RonanCodes/factory). Keep the claim record shape (worker, issue, paths, ts) compatible so a queue can move between local swarm and cloud factory.
The planner is the lifecycle-label authority while a wave is in flight, per the canonical label system (~/Dev/ronan-skills/canon/labels.md):
gh issue list --label kind:slice --label ready-for-agent --state open. The planner picks from this set, applying the conflict graph + label hints above.ready-for-agent → in-progress on the chosen issue (single gh issue edit --add-label in-progress --remove-label ready-for-agent).in-progress → ready-for-agent (back into the queue for the next wave).in-progress → needs-human and post a structured comment naming the human action required.gh issue edit <NUM> --remove-label in-progress BEFORE the squash-merge call. The Closes #N autoclose flips the issue from open to closed but does NOT strip labels — without the explicit remove, the closed issue keeps a stale in-progress label and the lifecycle-label invariant breaks. Closed state is the absence of any lifecycle label.Branch flow: workers branch off their assigned issue with gh issue develop <issue-number> --name <slug> --checkout rather than plain git checkout -b. This produces the GitHub issue→branch dev-link, which makes Closes #N automatic in the PR and lets the night-shift retro walk issue→PR without title-matching.
kind:research)When the dispatched issue carries kind:research, the planner routes to the research-worker prompt instead of the code-worker template below, and the merger asserts the research close-the-loop matrix instead of tests. The full prompt + matrix + folder convention live in [[canon:research-tasks]]. In short: the worker does deep web research (WebSearch + WebFetch + /ro:perplexity-research), writes a cited doc to docs/research/<slug>/README.md with bidirectional issue↔doc front-matter, updates the docs/research/README.md index, ingests the doc into the designated LLM wiki vault via /ro:wiki, posts a close-with-summary comment, and opens a docs-only PR. CI gate is lint/build only — research PRs touch docs/**, never application code, so there are no unit/integration/e2e tests to wait on. Research issues are always parallel-eligible (they only touch docs/research/, never conflicting with code workers' file-areas).
Worker prompt template (code workers; for kind:research see US-3d above and [[canon:research-tasks]]):
You are a WORKER agent for /ro:planner-worker.
Issue: .ralph/<name>/issues/<id>-<slug>.md (full body inlined here)
OR a GH issue number; in that case, FIRST run:
gh issue develop <issue-number> --name <slug> --checkout
to branch off the issue (NOT git checkout -b). This produces the
issue->branch dev-link required by the canon.
Worktree: .swarm/worktrees/<id>/
You are pinned to this worktree. Do NOT touch sibling worktrees.
Workflow:
1. On arrival: read .swarm/shared-learnings.md (repo quirks your wave-mates
already hit) and .swarm/claims.jsonl (which paths other live workers hold)
2. Read the issue body
3. Append your claim to .swarm/claims.jsonl BEFORE editing an area — one
JSON line {"worker","issue","paths","ts"}, flock-guarded append. Re-read
claims and append a fresh claim BEFORE any scope expansion (touching a
path you didn't originally claim)
4. NEVER edit a path another live claim holds. On a clash: do NOT edit —
flag it in .swarm/status.md and your final report; the planner arbitrates
5. Implement the feature/fix
6. Append discoveries (build quirks, flaky tests, migration gotchas) to
.swarm/shared-learnings.md as one-liners AS YOU HIT THEM, not at the end
7. Run the DoD commands (auto-detected or .swarm.json `dod`)
8. Only exit successfully when ALL DoD commands pass
9. Commit on the worktree branch with an emoji-conventional message
10. Append a one-line summary to .swarm/logs/<id>.log
11. Before squash-merge (or just before exit, whichever applies in your dispatch shape): `gh issue edit <NUM> --remove-label in-progress`. The `Closes #N` autoclose strips the issue's open state but does NOT strip labels; without this step the closed issue keeps its stale `in-progress` label and the canonical lifecycle invariant breaks.
Worktree guard — run BEFORE every git command (including `git status` / `git rev-parse`):
```bash
test "$(git rev-parse --abbrev-ref HEAD)" != "main" \
|| { echo "ERROR: on main branch, refusing to commit"; exit 1; }
test "$(pwd)" != "<MAIN_REPO_ABSOLUTE_PATH>" \
|| { echo "ERROR: in main checkout, refusing to commit"; exit 1; }
If either check fails, exit "stuck" immediately — do NOT recover by switching branches. Lesson #5a documents the failure mode: the Agent tool's isolation: "worktree" flag occasionally fails to take effect and the worker lands in the shared main checkout; once there, every subsequent git op mutates the main checkout's index instead of the worktree's. Branch protection on main is the load-bearing backstop (rejects the eventual push), but the in-prompt guard catches the drift earlier and avoids a "stuck on rejected push" loop.
If git push is rejected by GitHub's branch-protection hook (e.g. remote: error: GH006: Protected branch update failed), do NOT retry, do NOT use --force, --force-with-lease, or --admin. That rejection means you drifted out of your worktree onto main. Exit "stuck" with the rejection message verbatim so the orchestrator can intervene.
Rules:
## US-4: DoD detection + override
Auto-detect from `package.json` in this order:
1. `.swarm.json` `dod` block (highest precedence)
2. If `package.json` has `scripts.test` and `scripts.typecheck`: use them
3. Else if `tsconfig.json` present: `tsc --noEmit`
4. Else: `pnpm test` if `pnpm-lock.yaml` present; `npm test` if `package-lock.json` present
If `.swarm.json` lists `dod.extras`, append those (e.g. `pnpm lint`, `./verify-quick.sh`).
Worker exits successfully **only** when every DoD command passes (non-zero exit = stuck).
## US-5: Merger agent
When a worker reports green, dispatch a **merger Agent** (Opus) for that worktree:
You are the MERGER agent for /ro:planner-worker, issue <id>.
Worktree: .swarm/worktrees/<id>/ Merge target: <merge-target-branch>
Workflow:
git merge --no-ff swarm/<id> (or gh pr merge --squash if --github)git worktree remove .swarm/worktrees/<id> and git branch -D swarm/<id>MANDATORY stale-squash guard (--github mode): after ANY
git push --force-with-lease (e.g. post-rebase), do NOT merge blind.
Re-read the PR head:
remote_sha=$(gh pr view <PR#> --json headRefOid --jq .headRefOid) local_sha=$(git rev-parse HEAD)
Only run gh pr merge --squash when remote_sha == local_sha (your rebased
HEAD). On mismatch: wait a few seconds and re-check — GitHub's view of the
branch lags the push. NEVER merge a stale head. Lesson: nutmeg night-shift
ns-20260611-1624 — a merge fired immediately after a force-push, the squash
captured the pre-push branch state, main landed red, and fix-forward #170
was needed. Applied manually for waves 5-6: zero recurrence.
The merger has merge authority on the merge-target branch. Default = the branch the skill was invoked from. `--staging-branch swarm-staging` routes merges to a staging branch instead; main is updated only when the user explicitly merges staging.
## US-6: Escalation
If the merger exits `escalated`:
1. Leave the worktree intact (forensics)
2. Append a one-line summary to stdout AND `.swarm/status.md` (under `Escalations:`)
3. Mark the issue `status: escalated` in its file
4. The dispatcher continues with the next unblocked issue
5. On all-issues-resolved-or-escalated, the run pauses for the user
6. User fixes the conflict, then `/ro:planner-worker --resume` picks up where we left off
## US-7: Worker failure -> planner re-plan (retry-and-split state machine)
Implements [[close-the-loop-verification-matrix]] § "Retry-and-split state machine". If a worker exits stuck:
1. Mark the issue `status: stuck`
2. **Attempt 2** — Retry ONCE with the failure context inlined in the worker prompt (single retry, fresh context, same models). Capture the failure-class (the matrix row that failed) so the retro can track patterns.
3. If the retry hits a **different** failure-class, treat that as attempt 2.b (reset, no counter bump): re-dispatch fresh with both failure contexts injected as constraints.
4. **Attempt 3** — If still stuck on the same failure-class, re-dispatch with model bump (Opus implementer + Opus reviewer; both fresh contexts).
5. **Auto-split** — If attempt 3 fails: re-invoke the **planner Agent** with:
- The stuck issue body
- All three workers' failure logs
- The matrix-row(s) that failed across attempts
- Instructions to decompose this single issue into 2-4 smaller sub-issues that route around the failure mode
6. New sub-issues are appended to `.ralph/<name>/issues/` with ids like `001a`, `001b` and `depends_on` set sensibly. When `--github`, they also get GH issue numbers via the standard slicer flow and the parent gets the `split-by-loop` label.
7. Original stuck issue is marked `status: replaced-by: [001a, 001b]`
8. Dispatcher picks up the new sub-issues next cycle
9. **Hard escalation** — If the same parent ancestry triggers auto-split twice (e.g. `001` → `001a` + `001b`, then `001a` → `001a-i` + `001a-ii`), STOP and escalate to HITL via Pushover + Telegram. Anthropic's "0% across many trials = broken task" rule says the spec is wrong, not the agent.
### Emit `failures[]` for the retro
For every auto-split, EVERY model-bumped attempt, AND every successful merge that required >= 2 attempts, append a JSON line to `.swarm/failures.jsonl`:
```json
{
"slice": 234,
"mode": "thrashing|proxy-gaming|misalignment|non-convergence|context-drift|runaway-resource|other",
"evidence_ref": "<PR url or .swarm/logs/<id>.log>",
"narrative": "<one to two sentences>"
}
/ro:night-shift-retro reads .swarm/failures.jsonl as part of US-1 (Gather run data) and folds entries into the retro JSON's failures[] block. The retro classifier may re-classify the mode based on cross-slice patterns (a single thrashing looks like non-convergence; three across slices is a SYSTEM signal).
This is the only path that re-invokes the planner mid-run.
When --github is set:
gh issue create --title <title> --body <body> --label swarm for each backlog issue and stores the issue number in the issue file's frontmatter (github-issue: 42)git push origin swarm/<id> and gh pr create --base <merge-target> --head swarm/<id> --title <title> --body "Closes #42"gh pr merge --squash --delete-branch instead of local merge — gated by the MANDATORY stale-squash guard: after any git push --force-with-lease, re-read gh pr view <PR#> --json headRefOid and only merge when it equals the local rebased HEAD SHA; on mismatch wait and re-check, never merge a stale head (nutmeg #168 incident: the squash captured pre-push state, main landed red, fix-forward #170 required)swarm:escalated label addedgh auth status must be green; gh repo view must resolve; the skill fails loudly otherwiseDefault termination: empty backlog = exit.
With --judge-agent, after each dispatch cycle (one cycle = "dispatch all unblocked, wait for all merged or escalated"), invoke a judge Agent (Opus):
You are the JUDGE agent for /ro:planner-worker.
PRD: .ralph/<name>/prd.md (re-read it)
Backlog state: .ralph/<name>/issues/ (re-read all)
This cycle: <summary of merged + escalated>
Decide: KEEP_GOING or STOP.
KEEP_GOING means: the PRD has goals not yet covered by the merged backlog, and you can describe what's missing in 1-3 bullets. Returning KEEP_GOING re-invokes the planner with those bullets.
STOP means: the PRD's acceptance criteria are met, or all remaining work is escalated, or you're going in circles.
Hard caps (regardless of judge): --max-cycles (default 10), --max-runtime (default 4h).
KEEP_GOING re-invokes the planner with the judge's missing-work bullets; STOP exits cleanly.
--afk is the autonomous combination flag. It implies:
--auto-approve (skip US-2 confirmation)--judge-agent (Cursor-style termination)--skip-grill (use defaults for unspecified flags)ANY /ro:planner-worker run against a real backlog ends with TWO tail calls, in this order:
/ro:completion-report --prs <merged-pr-list> --title "<prd-name> swarm" --no-open — writes a browsable HTML report to <repo>/.completion-reports/<ts>-<slug>.html with per-PR cards (CI status, file stats, summary), per-file diffs, per-PR revert command, per-file rollback command, and a risk panel (schema migrations, env changes, large deletions, lockfile-only PRs, no-tests-on-feat). Capture the absolute path it prints. Pass --prs rather than --prd because the swarm tracks the merged PR list directly; falls back to --prd <name> if PR list isn't available./ro:pushover --url file://<path-from-step-1> — sends the done / paused / blocked / crashed ping with the report path as a deep link.If step 1 reports an empty range, skip --url on step 2 but STILL send the ping with state.
Pushover message anatomy:
night shift done: <N> stories merged, <M> stuck, <duration>
with the report URL attached so tapping the notification opens the diff browser.
Skip BOTH tail calls ONLY when:
--plan-only (planner runs but no workers dispatch)--no-pingPer the global Pushover firing rule (see ~/CLAUDE.md § Pushover Notifications, condition 4) and the completion-report skill at ~/Dev/ronan-skills/skills/completion-report/SKILL.md.
N defaults to 5. Hard ceiling of 10 without --unsafe-many. Above 10:
WARNING: >10 worktrees will likely thermal-throttle your laptop and consume
<estimate_mb> MB of disk via duplicate node_modules. Pass --unsafe-many to
override.
The skill computes estimate_mb as repo_size_mb * worker_count.
After a human-resolved escalation, --resume:
.swarm/status.md and .ralph/<name>/issues/*.mdstatus: ready and unmet depends_on now metstatus: escalated whose worktree was modified since the escalation timestamp--resume on a clean state is a no-opDuring the run:
.swarm/logs/<id>.log is the worker's own append-only log.swarm/status.md is rewritten on every state change with this template:# planner-worker live status
PRD: <name>
Started: <ISO8601>
Cycle: <N> of <max>
## Workers active
- 002 (worker, 4 min)
- 003 (merger, rebasing)
## Merged
- 001 schema: add points column <commit-hash>
## Escalated
- (none)
## Stuck (retrying)
- (none)
## Last 5 events
- 12:14:02 worker-001 GREEN
- 12:14:05 merger-001 dispatched
- 12:14:18 merger-001 MERGED abc1234
- 12:14:19 worker-002 dispatched
- 12:14:19 worker-003 dispatched
Tail with watch -n 2 cat .swarm/status.md or code -r .swarm/status.md.
On final exit, the skill writes .swarm/run-<ISO8601>.md:
# Swarm run postmortem
PRD: <name>
Duration: 47m 12s (started 2026-05-14T07:30Z, ended 2026-05-14T08:17Z)
Workers cap: 5
## Planned issues: 5
- 001 schema: add points column
- 002 service: award points on completed task
- 003 dashboard: tile
- 004 hook: usePoints
- 005 integration test
## Merged: 4
- 001 abc1234 (4 min worker, 2 min merger)
- 002 def5678 (8 min worker, 1 min merger)
- 003 ghi9abc (12 min worker, 3 min merger)
- 004 jkl0def (6 min worker, 1 min merger)
## Stuck: 1
- 005 integration test: failed e2e setup after retry; planner re-decomposed into 005a, 005b (queued for next run)
## Escalated: 0
## PRs (github mode)
- (n/a; local merge mode)
Optional. If present in the repo root:
{
"dod": {
"test": "pnpm test",
"typecheck": "pnpm typecheck",
"extras": ["pnpm lint", "./verify-quick.sh"]
},
"models": {
"planner": "opus",
"worker": "sonnet",
"merger": "opus",
"judge": "opus"
},
"workers": 5,
"github": false,
"merge-target": "main"
}
Sensible defaults when the file is absent. CLI flags always win over the file.
.ralph/<name>/
prd.md
issues/
001-schema-points.md
002-service-award.md
...
.swarm/ # gitignored
swarm.json # optional config
worktrees/ # active git worktrees
001-schema-points/
...
logs/
001-schema-points.log
...
claims.jsonl # live claims registry (US-3c2), cleared at wave end
shared-learnings.md # append-only one-liner discoveries (US-3c2)
status.md # live state
run-<ISO8601>.md # postmortem (one per run)
.ralph/<name>/prd.md is missing, fail loudly and point to /ro:write-a-prdmain directly from the worker; only the merger has merge authority--github mode (no noise on origin)--unsafe-many. Real laptops thermal-throttle hardCaptured from the lekkertaal swarm run on 2026-05-14 (8 PRs landed across two waves). These are baked into the prompt template each worker receives.
When multiple workers each run pnpm drizzle-kit generate, they all grab the same next number (e.g. 0004) and stomp drizzle/meta/_journal.json. The conflict is mechanical to resolve but burns planner cycles.
Fix: the planner assigns each DB-touching worker a unique migration slot in its dispatch prompt: "your migration is 0007_<slug>.sql; do not run drizzle-kit generate until you've manually claimed that number in _journal.json." Workers without a DB change don't get a slot.
Workers that delegate CI watching to an in-context monitor tool can park silently if the monitor doesn't deliver events back into the worker's conversation. We saw one worker exit with "I'll wait for the monitor events to come through" — the PR was open, conflicts present, and the worker never came back.
Fix: the dispatch prompt MUST tell the worker to poll CI via a foreground bash loop using gh api repos/<owner>/<repo>/commits/<sha>/check-runs, sleep 30s, re-poll. Hard cap 15 minutes. NEVER use an in-context monitor for this. If checks stay pending past the cap, STOP and report; do not retry blindly.
We observed a worker push a branch + open a PR, and GitHub Actions did not fire on it. Force-pushing the same branch after a no-op rebase triggered CI normally. Root cause not fully isolated — could be path-filter quirk, draft-PR gating, or a brief Actions hiccup.
Fix: after git push, the worker must verify CI fired within 60s via gh api .../actions/runs?head_sha=<sha>. If no run exists, force-push a no-op (git commit --amend --no-edit && git push -f) to nudge it. If that still doesn't work, escalate to the planner.
When PR-1 lands while PR-2 is still in flight, PR-2 will hit conflicts. Rebasing PR-2 onto the new main and force-pushing is cleaner than merging main into PR-2 (which leaves merge bubbles in the PR). The lekkertaal run #66 (prompt caching) rebased cleanly onto #65 (telemetry) mid-flow and ended up composing both behaviours — a real win.
Fix: worker prompt should say "if your push is rejected because main moved, rebase onto origin/main, resolve conflicts, force-push-with-lease. Do not merge main into your branch."
Even with isolation: "worktree" in the Agent tool, workers occasionally land on the SHARED main checkout instead of a fresh worktree. When that happens, parallel workers stomp each other's uncommitted edits as they git checkout between branches. Observed 4+ times in the lekkertaal run (PRs #79, #82, #100, #103).
Symptoms: a worker reports "the main checkout had uncommitted edits from another worker" or "my edits were wiped when another worker switched branches".
Fix in worker prompt: every dispatch prompt should include a self-check at the top:
ON ARRIVAL:
1. Run `git worktree list` and `pwd`. Confirm your pwd is NOT
<main-repo-path>. If it IS, create your own worktree at
`<main-repo-path>-wt-<issue-number>` and `cd` into it.
2. `git status` should show clean (no other worker's edits).
3. Only then start work.
This guarantees isolation even when the harness's isolation: "worktree" doesn't take effect for some reason. The cost is one extra mkdir + checkout per worker; the benefit is no lost work.
Fix in planner: when 3+ workers are dispatched in parallel and ANY two report worktree collisions, the planner should add a pre-dispatch step that creates the worktrees explicitly and passes the path to each worker.
If the repo has a husky pre-push hook running pnpm quality-checks (format + lint + build + test + audit), pushing a branch already proves the same checks GitHub Actions will run. Waiting for GH CI to repeat the gauntlet adds 1-2 minutes per worker for zero additional safety.
Default (--trust-local-ci, ON): workers run local checks, push, immediately squash-merge. If pre-push hook is missing or SKIP_QUALITY_CHECKS=1 was used, fail loudly and do not merge.
Opt-out (--wait-for-gh-ci): workers still poll gh api .../check-runs and only merge on green. Use when:
Setup-time prompt (US-0): the planner asks the user once per repo:
"This repo has a pre-push hook running [<list of checks>]. GitHub Actions runs effectively the same checks. Skip GH CI wait and merge immediately after a clean push? [Y/n]"
The answer is persisted in .swarm.json as trust-local-ci: true|false.
Safety net: GH CI still runs on the merged commit on main. If a fluke bug lands, the post-merge CI run will flag it on main, and the planner can revert. The frequency of this happening (zero in the lekkertaal 8-PR run) does not justify the per-PR wait cost.
From the nutmeg double-shift, 2026-06-11: slices #167 and #168 had disjoint predicted territories, passed the static file-area pass, and were dispatched in the same wave. Both workers then independently created admin/sources.tsx — the work drifted into the same file mid-implementation, and the collision only surfaced at merge time.
Fix: the live claims registry (US-3c2). Workers append {worker, issue, paths, ts} to .swarm/claims.jsonl BEFORE editing an area and re-check before any scope expansion; a clash with a live claim means stop-and-flag, not edit. Prediction handles the conflicts you can see from the slice bodies; claims handle the ones that only appear once the work is underway. Cooperative, not enforced — the merge-time rebase + DoD re-run is still the hard gate.
Two failure modes from the nutmeg night-shift ns-20260611-1624:
Stale squash: a merger ran git push --force-with-lease (post-rebase) and immediately gh pr merge --squash --admin. The squash captured the STALE pre-push branch state — GitHub's view of the head hadn't caught up — so main landed red and fix-forward #170 was required.
Fix: the mandatory stale-squash guard in the merger template (US-5): after any force-push, re-read gh pr view <PR#> --json headRefOid and only merge when it equals the local rebased HEAD SHA; on mismatch wait and re-check. Applied manually for waves 5-6 of the same run — zero recurrence.
Worker stall: a wave-5 worker died mid-task with 111 lines of uncommitted work — all lost, and the stall itself was invisible because the worker waited silently on a hung command.
Fix: stall-hardening in the worker template (US-3 Rules): commit each coherent unit as soon as it exists so progress survives a kill, and never wait silently more than 5 minutes on a build/test — kill, diagnose verbosely, report. Attempt 2 with this protocol shipped clean.
/ro:ralph — single-agent serial sibling/ro:grill-me — upstream PRD grilling/ro:write-a-prd — PRD writer that emits .ralph/<name>/prd.md/ro:pushover — notification fired at end of --afk runs/ro:drain-pr-queue — clean up old swarm-opened PRs after the run/ro:swarm — friendly alias for this skill/ro:night-shift — wraps this skill in a multi-wave drain + refill loop with explicit-scope opening grill. Use when you want indefinite overnight runs across the whole backlog, not just one PRDvaults/llm-wiki-skill-lab/wiki/skills/planner-worker.md — skill-lab page with provenancedevelopment
Close the loop on a Linear ticket when its work ships - move the status and post a deploy comment with the PR link, what shipped, and a try-it link, mentioning the collaborator. Used as the tail of /ro:linear-nightshift for every merged mirror, or manually after an ad-hoc build. Triggers on "linear update", "update the linear ticket", "mark NUT-x done", "tell eoin it shipped", "/ro:linear-update".
devops
Run a night-shift against a collaborator's Linear board. Pulls the team's Grilled tickets (/ro:linear-grill moves a ticket to Grilled once its questions are answered), VERIFIES the questions were actually answered (unanswered → bounce the ticket to the "Question for <name>" state), mirrors verified tickets to ephemeral GitHub issues with ready-for-agent, then runs the standard /ro:night-shift machinery on GitHub. Tail-calls /ro:linear-update for everything that merged + deployed. Triggers on "linear nightshift", "nightshift linear", "drain the linear board", "run the shift off linear", "/ro:linear-nightshift".
development
Grill a collaborator's Linear tickets and move every processed ticket to where it belongs. Resolves the board from the repo's .ro-linear.json, reads the collaborator's Backlog / Ready-for-agent issues, then per ticket either posts 3-5 decision-extracting questions (state moves to "Question for <name>") or confirms it build-ready (state moves to "Grilled", the gate /ro:linear-nightshift consumes); shipped-and-confirmed tickets close as Done. The async-collaborator counterpart of /ro:day-shift for people who never touch GitHub. Triggers on "grill linear", "grill eoin's tickets", "linear grill", "add questions to the linear tickets", "/ro:linear-grill".
development
--- name: about-page description: Add a standard About page to any web app, what it is, the tech stack, and an FAQ, wired into a footer link with a sticky footer. Built with Spartan + Tailwind (the canonical component layer) and falls back to semantic HTML so it ships reliably. Use whenever building, polishing, or shipping an app, every app should have one. Triggers on "add an about page", "about page", "footer about link", or as a standard step in app build/polish. category: frontend argument-h