skills/golem-powers/large-plan/SKILL.md
Scaffold and execute folder-based multi-phase implementation plans with async agent collaboration. Creates docs.local/plans phases with acceptance criteria, owners, and PR boundaries. Use for large features, multi-PR refactors, parallel cmux agent work, or complex specs. NOT for single-file changes, simple bugs, short tasks, or brainstorming.
npx skillsauth add etanhey/golems large-planInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Invoke as: /large-plan (single segment).
Source: ~/Gits/golems/skills/golem-powers/large-plan/ (symlinked at ~/.claude/commands/large-plan).
Scaffold folder-based plans with phase folders, execute them through the branch-PR-review cycle, and coordinate async agent collaboration.
| What you want to do | Workflow | |---------------------|----------| | Create a new plan from a description | workflows/scaffold.md | | Execute the next phase in a plan | workflows/execute-phase.md | | Start async collab on a phase | workflows/collab.md |
| Script | Purpose | Usage |
|--------|---------|-------|
| scripts/scaffold-plan.sh | Create folder-based plan structure | bash scripts/scaffold-plan.sh <plan-dir> <plan-name> <phase-count> |
Large plans are folder-based: one folder per phase, each containing a README.md (steps) and findings.md (shared knowledge). A main README.md acts as the index with a progress table and routing.
plan-dir/
README.md # Index: progress table, routing, execution rules
collab.md # Created when parallel phases exist (see below)
phase-1-name/
README.md # Steps for this phase
findings.md # Shared knowledge room (agents write here)
phase-2-name/
README.md
findings.md
...
EVERY plan must decide this at scaffold time. Analyze the dependency graph:
Phases with NO cross-dependencies → Parallel (collab.md + multiple agents)
Phases that depend on each other → Sequential (execute-phase, one at a time)
Mixed → Rounds (parallel within round, sequential between rounds)
Decision tree:
Depends On fieldscollab.md at plan root## Execution Strategy to the main README.md showing rounds and parallelismExample:
## Execution Strategy
| Round | Phases | Mode | Agents |
|-------|--------|------|--------|
| 1 | Phase 1, Phase 2 | **parallel** (collab) | brainClaude, golemsClaude |
| 2 | Phase 3 (depends on 1+2) | sequential | mainClaude |
| 3 | Phase 4, Phase 5 | **parallel** (collab) | brainClaude, golemsClaude |
When a round has parallel phases, the orchestrator:
collab.md using the collab protocolScaffold plan → Analyze dependencies → Group into rounds
|
┌───────────────────────────┘
▼
Round has 1 phase? → Execute sequentially (execute-phase)
Round has 2+ phases? → Create collab.md, spawn agents in parallel
|
▼
All round phases done → Advance to next round → Repeat
Root cause (April 5 overnight sprint): orcClaude missed the second track — user wanted 9 entity files enhanced for morning walk + code PRs by dawn. Agent only scaffolded the code track.
At scaffold time, ALWAYS ask: "Are there non-code deliverables alongside the code phases?"
Common non-code deliverables:
If yes, add a separate phase or parallel track for the non-code work. Non-code deliverables are often the user's PRIMARY goal — the code is just infrastructure supporting it.
master -> feature/phase-N-name -> implement -> commit -> push -> PR
^ |
|__ merge <-- approve <-- fix <-- review (CodeRabbit + Cursor Bugbot + DeepSource)
Each phase README follows this template:
# Phase N: Name
> [Back to main plan](../README.md)
## Goal
One sentence describing what this phase achieves.
## Time
- **Estimate:** NNmin (basis: [complexity/rolling avg from prior phases])
- **Started:** HH:MM
- **Completed:** —
- **Actual:** —
- **Error ratio:** —
## Round
Round M (parallel with Phase X, Phase Y) OR Round M (sequential).
## Tools
- **Research:** [gemini|cursor|codex] — what to research
- **Code:** [cursor|haiku|sonnet] — what to implement
- **MCPs:** [list relevant MCP servers]
## Steps
1. Step one
2. Step two
3. ...
## Depends On
- Phase X (for Y reason)
## Status
- [ ] Step one
- [ ] Step two
Each phase findings.md is the shared collaboration room:
# Phase N Findings
## Decisions
- [timestamp] Decision: ...
## Research
- [timestamp] Agent: Found that ...
## Task Board
| Task | Owner | Status |
|------|-------|--------|
| Research X | gemini | done |
| Implement Y | cursor | in progress |
When a round has 2+ independent phases, use the full collab protocol defined in workflows/collab.md.
The orchestrator MUST:
collab.md at plan root using the template from the collab workflowdoneKey rule: If the human has to tell you to update the collab file, the collab has failed. Agents must self-coordinate.
Complexity tiers (from collab workflow):
See workflows/collab.md for the full protocol, mandatory sections, update gates, message format, and anti-patterns.
MANDATORY for every phase:
| Skill | When | Why |
|-------|------|-----|
| /pr-loop | Every phase completion | The FULL loop — branch through MERGED. Not optional. |
| /superpowers:test-driven-development | All implementation | Red-green-refactor. No code without failing test first. |
| /superpowers:verification-before-completion | Before claiming "done" | Evidence before assertions. Always. |
| /never-fabricate | Before reporting results | Read() files before summarizing them. |
Optional per phase:
| Skill | When to use |
|-------|-------------|
| /coderabbit | Verify phase output with targeted review |
| Manual QA checklist | Generate test plans per phase from the diff |
| /prd | Create PRDs from phase specs |
| /pr-loop step 5 | CodeRabbit review + atomic commit |
| /create-pr | Create PR (step 7 of pr-loop) |
After push, automated reviewers comment. Classify each:
| Type | Action | |------|--------| | Real bug | FIX immediately | | Style preference | Fix if genuinely better | | Over-engineering | SKIP | | Out of context | Comment explaining why |
Repeat push-fix cycle until no real bugs remain.
Claude Code features are listed first. If running on Codex or Cursor, use the universal fallback. Full adapter docs: adapters/
| Feature | Claude Code | Universal Fallback |
|---------|-------------|-------------------|
| Parallel phase agents | Agent(isolation="worktree", run_in_background=true) | Spawn CLI agents manually: cd <wt> && codex --full-auto "phase N" |
| Phase worktree isolation | Agent(isolation="worktree") — auto-creates + cleans up | git worktree add -b feature/phase-N ../<dir> master |
| Collab file monitoring | CronCreate or /loop 5m | nohup bash -c 'while true; do grep done collab.md; sleep 300; done' & |
| Cron cleanup (plan done) | CronDelete(<id>) — mandatory | kill <bg-monitor-pid> |
| Plan mode (spec first) | EnterPlanMode → ExitPlanMode | Write plan to docs.local/plan/<name>/README.md manually |
| Memory persistence | brain_store() / brain_search() via BrainLayer | Append to <plan-dir>/findings.md |
| Session resume | claude --resume | Not available — pass <plan-dir>/README.md in next session's context |
| Background phase execution | Agent(run_in_background=true) | nohup codex --full-auto "..." > phase.log 2>&1 & |
Data from April 5 overnight sprint (brainlayer, Codex workers): estimated 90min/phase, actual 15min average. Started at 6x overestimate, auto-calibrated to 1.25x by phase 7. Record timestamps at phase start + PR creation. Without tracking, estimates never calibrate.
The main README.md progress table MUST include estimate and actual columns:
## Progress
| Phase | Status | Estimate | Started | Completed | Actual | Error |
|-------|--------|----------|---------|-----------|--------|-------|
| 1. Setup | ✅ done | 30min | 1:15 AM | 1:28 AM | 13min | 2.3x |
| 2. Search | ✅ done | 30min | 1:30 AM | 1:42 AM | 12min | 2.5x |
| 3. Hybrid | 🔄 active | 15min* | 1:45 AM | — | — | — |
| 4. Evals | ⏳ pending | 15min* | — | — | — | — |
*Auto-recalibrated from rolling avg of phases 1-2 (12.5min → round to 15min)
Rolling calibration: 2.3x → 2.5x → tracking...
brain_store(
content: "CLOCK IN [plan-name / Phase N]: Started HH:MM. Estimate: NNmin. Basis: [first phase=complexity, later=rolling avg].",
tags: ["time-tracking", "clock-in", "<project>"],
importance: 5
)
Fill in the phase template's Time section: Started, Estimate.
brain_store(
content: "CLOCK OUT [plan-name / Phase N]: PR merged HH:MM. Actual: NNmin. Estimated: NNmin. Error: X.Xx. Rolling avg (last 3): NNmin.",
tags: ["time-tracking", "clock-out", "<project>"],
importance: 5
)
Fill in the phase template's Time section: Completed, Actual, Error ratio. Update the main README progress table.
Once 3 phases have actuals:
rolling_avg = average(last 3 actuals)
remaining_phases × rolling_avg = estimated total remaining
Report: "Phases 1-3 done in 38min total. Rolling avg: 12.7min.
Remaining 4 phases: ~51min at current pace.
Sprint total ETA: ~89min (original estimate was 630min = 7.1x overestimate)"
Rule: After 3+ phases, new estimates MUST be within 2x of rolling average. Don't keep estimating 90min when actuals are 15min.
User correction (April 5): "No, I'm saying it will take probably hours, not weeks" — after orc estimated a 2-week timeline for work that took one evening. Time tracking turns this from a repeated correction into self-correcting behavior.
| Gate | Check |
|------|-------|
| Typed right | No any, proper interfaces |
| Documented | JSDoc on exports, CLAUDE.md updated if needed |
| DRY | No duplicated logic |
| Tests pass | bun test / npm test green |
| Build passes | No compile errors |
Closes the self-audit-as-evaluator substitution loophole. Observed at P5 fix queue 2026-05-17 — agent self-graded "evaluator replay PASS" without dispatching a separate evaluator. /goal hook silently passed.
Every /large-plan output that produces code, scripts, configs, or plist drafts MUST end with a Phase N+1 that:
Agent(subagent_type=evaluator, ...) or equivalent platform fallback.ITERATE verdict with specific fixes.subagent_type ≠ producing agent.Template: workflows/phase-evaluator.md — minimal evaluator-subagent dispatch (prompt format, scoring rubric link).
Done-gate semantics:
| Producing agent emits | /goal hook treats as |
|-----------------------|----------------------|
| TASK_DONE without evaluator dispatch transcript | FAIL (substitution loophole) |
| TASK_NEEDS_EVALUATOR + transcript of separate evaluator scoring ≥8/10 | PASS |
| TASK_NEEDS_EVALUATOR + evaluator ITERATE verdict | RE-DISPATCH (do not declare done) |
Evidence: 4-of-4 /goal outputs 2026-05-17 night surfaced critical issues only when externally evaluated. P5 fix queue silently substituted self-audit for the required external evaluator replay (skillcreator-p5fix mine [1438]).
tools
The human-eval UX contract for Phoenix views: turn-by-turn scrollable replay (not a scorecard), hide-but-copyable IDs, collapsed thinking, identity chips, tool filters, tiny frozen starter datasets, mark-wrong-in-thread, mobile-first. Use when: building or reviewing ANY Phoenix/eval view, annotation UI, session replay, or human-grading surface. Triggers: phoenix view, eval UI, annotation view, session replay, human eval UX, grading interface. NOT for: Phoenix data pipelines/ingest (capture scripts have their own specs).
tools
macOS systems specialist — AppKit NSPanel architecture, launchd services, socket activation, MCP bridge resilience, syspolicyd, and high-frequency SwiftUI dashboards. Use when building menu-bar apps, LaunchAgents, debugging syspolicyd/Gatekeeper/TCC, resilient UDS/MCP bridges, or SwiftUI dashboards at 10Hz+.
development
Bulk LLM-judging protocol for fleet-dispatched verdict runs (KG cluster, eval harness). Use when: dispatching or running judge workers (J1/J2/RT), planning bulk-apply from verdict JSONL, or triaging evidence_degraded outputs. Triggers: judge fleet, bulk judge, R3 verdicts, kg-judge, RT gate, evidence_degraded. NOT for: single-item code review, Phoenix view UX (use phoenix-human-view), or non-judge eval pipelines.
development
Quiet-down protocol for sprint close: when the fleet wraps, delete ALL polling crons and monitors, send ONE final dashboard + ONE message, then go SILENT. Use when: fleet wraps, all workers done, overnight queue exhausted, sprint close, Etan asleep/away with nothing approved left. Triggers: fleet wrap, wrap the fleet, stand down, going quiet, sprint close. NOT for: mid-sprint monitoring (keep your loops), spawning a successor (use /session-handoff first).