skills/mav-claude-code-recovery/SKILL.md
Patterns for Claude Code workflow resilience — state persistence, crash recovery, command failure handling, subagent failure handling, and artefact durability. Not about application-level error handling.
npx skillsauth add thermiteau/maverick mav-claude-code-recoveryInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Patterns for making Claude Code workflows resilient to failures. This covers Claude Code's own recovery mechanisms, not error handling in generated application code.
Maintain a state file to track workflow progress. The mav-github-issue-workflow skill defines the canonical state file format for issue-driven work (.claude/issue-state.json). This section covers the general principles.
| Field | Why | |---|---| | Current phase | Resume from the right point | | Branch name | Avoid re-deriving or creating duplicates | | GitHub comment IDs | Update existing comments instead of posting new ones | | Issue/task reference | Know what we're working on |
Write to the state file after each significant milestone, not before:
digraph state_writes {
"Complete a phase" [shape=box];
"Write state" [shape=box];
"Start next phase" [shape=box];
"Failure mid-phase?" [shape=diamond];
"State reflects last completed phase" [shape=box];
"Resume from last completed phase" [shape=box];
"Complete a phase" -> "Write state";
"Write state" -> "Start next phase";
"Start next phase" -> "Failure mid-phase?";
"Failure mid-phase?" -> "State reflects last completed phase" [label="yes"];
"State reflects last completed phase" -> "Resume from last completed phase";
"Failure mid-phase?" -> "Complete a phase" [label="no — phase completes"];
}
Writing after completion means the state file always reflects a consistent, completed milestone. If a failure occurs mid-phase, the state file points to the last successfully completed phase, and the incomplete phase can be re-executed from scratch.
jq '.phase = "design"' .claude/issue-state.json > .claude/issue-state.tmp && mv .claude/issue-state.tmp .claude/issue-state.json
Post artefacts (designs, plans, status updates) to GitHub immediately after they are produced. This ensures they survive Claude Code session failures.
| Artefact | Where to post | When | |---|---|---| | Solution design | Issue comment | Immediately after design is approved/produced | | Implementation plan | Issue comment | Immediately after plan is approved/produced | | Completion summary | Issue comment | After all steps pass verification | | Blocking questions | Issue comment | When blocked and unable to resolve |
If you batch artefacts and post them at the end:
Post each artefact as soon as it is ready. Use the mav-github-issue-workflow comment patterns to capture comment IDs for later updates.
When starting a session that may be resuming previous work:
digraph resume {
"Session starts" [shape=box];
"State file exists?" [shape=diamond];
"Read state file" [shape=box];
"Validate state against reality" [shape=box];
"State valid?" [shape=diamond];
"Resume from recorded phase" [shape=box];
"Reconcile state" [shape=box];
"Fresh start" [shape=box];
"Session starts" -> "State file exists?";
"State file exists?" -> "Read state file" [label="yes"];
"State file exists?" -> "Fresh start" [label="no"];
"Read state file" -> "Validate state against reality";
"Validate state against reality" -> "State valid?";
"State valid?" -> "Resume from recorded phase" [label="yes"];
"State valid?" -> "Reconcile state" [label="no — mismatch"];
"Reconcile state" -> "Resume from recorded phase";
}
Before resuming, verify that the recorded state matches what actually exists:
| State field | Validation check |
|---|---|
| Branch | Does the branch exist locally? git branch --list $BRANCH |
| Branch | Does it exist on remote? git ls-remote --heads origin $BRANCH |
| Phase | Are the expected commits present? git log --oneline -5 |
| Comment IDs | Do the comments exist on the issue? gh api repos/$REPO/issues/comments/$ID --jq '.id' |
| Issue | Is the issue still open? gh issue view $ISSUE --json state -q '.state' |
If validation finds discrepancies:
| Mismatch | Action |
|---|---|
| Branch exists locally but not on remote | Branch was never pushed — continue from implementation phase |
| Branch exists on remote but not locally | git fetch origin && git checkout $BRANCH |
| Branch does not exist anywhere | State is stale — start fresh, but check if a PR was already created |
| Comment ID does not exist | Comment was deleted — post a new one and update state |
| Issue is closed | Check if a PR was merged — work may already be complete |
| Phase says implement but no commits on branch | Implementation was interrupted — restart implementation phase |
When resuming, briefly report to the user what was recovered:
Resuming work on issue #42 (feat/42-add-export).
- Phase: implementation (3 of 5 steps completed)
- Branch: feat/42-add-export (exists locally and on remote)
- Design and plan already posted to issue
- Continuing from step 4.
When a tool or command fails during execution:
digraph cmd_failure {
"Command fails" [shape=box];
"Read error output" [shape=box];
"Known failure pattern?" [shape=diamond];
"Apply known fix" [shape=box];
"Diagnose root cause" [shape=box];
"Can fix autonomously?" [shape=diamond];
"Fix and retry" [shape=box];
"Report to user with diagnosis" [shape=box];
"Command fails" -> "Read error output";
"Read error output" -> "Known failure pattern?";
"Known failure pattern?" -> "Apply known fix" [label="yes"];
"Known failure pattern?" -> "Diagnose root cause" [label="no"];
"Apply known fix" -> "Fix and retry";
"Diagnose root cause" -> "Can fix autonomously?";
"Can fix autonomously?" -> "Fix and retry" [label="yes"];
"Can fix autonomously?" -> "Report to user with diagnosis" [label="no"];
}
| Error pattern | Likely cause | Fix |
|---|---|---|
| ENOENT / file not found | Wrong path or file not yet created | Verify path, check if a prerequisite step was skipped |
| EACCES / permission denied | File permissions or sandbox restriction | Check permissions, do not bypass sandbox |
| npm ERR! / dependency resolution | Lock file out of sync or missing dependency | Run pnpm install / npm install |
| tsc type errors after code change | Introduced type mismatch | Read the error, fix the type issue |
| Test timeout | Test is hanging or async issue | Check for missing await, unclosed handles |
| gh: Not Found | Wrong repo, issue number, or permissions | Verify repo and issue exist, check auth |
| Git conflict markers in file | Unresolved merge conflict | Follow mav-git-workflow merge conflict procedure |
--force, --no-verify, or || true to make a failing command succeed. Fix the root cause.Multi-story (do-epic) workflows have additional state on GitHub that
single-issue recovery does not. On entry, always re-hydrate from GitHub
per mav-durability-on-gh before touching anything local.
| Signal | Location | What it means |
| --- | --- | --- |
| maverick-dag marker | Epic issue | DAG exists — do not rebuild, read it |
| maverick-state marker | Epic issue | Current story statuses — trust over local cache |
| maverick-claim + maverick-lease | Each claimed issue | Who holds what, and whether they are alive |
| maverick-bprop marker | Epic issue | A block walk was in progress — must resume it first |
| blocked-by:#N labels | Each story | Which stories are currently blocked |
If a local .claude/epic-state.json exists but the GitHub markers
disagree, GitHub wins — overwrite the local cache.
digraph epic_resume {
"Session starts" [shape=box];
"maverick-bprop present?" [shape=diamond];
"Resume block propagation — idempotent walk from the ejected story" [shape=box];
"Claim still live on any issue?" [shape=diamond];
"Re-verify heartbeat — extend or release" [shape=box];
"Stale claims present?" [shape=diamond];
"Decide takeover vs defer" [shape=box];
"Proceed with next wave" [shape=box];
"Session starts" -> "maverick-bprop present?";
"maverick-bprop present?" -> "Resume block propagation — idempotent walk from the ejected story" [label="yes"];
"maverick-bprop present?" -> "Claim still live on any issue?" [label="no"];
"Resume block propagation — idempotent walk from the ejected story" -> "Claim still live on any issue?";
"Claim still live on any issue?" -> "Re-verify heartbeat — extend or release" [label="yes"];
"Claim still live on any issue?" -> "Stale claims present?" [label="no"];
"Stale claims present?" -> "Decide takeover vs defer" [label="yes"];
"Stale claims present?" -> "Proceed with next wave" [label="no"];
"Re-verify heartbeat — extend or release" -> "Proceed with next wave";
"Decide takeover vs defer" -> "Proceed with next wave";
}
If a maverick-bprop marker exists on the epic when you enter, the prior
instance was mid-block-walk when it died. The marker payload names the
ejected story, the full descendant set, and which descendants are already
labelled. Per mav-block-propagation:
labelled:
blocked-by:#<ejected> if not present (idempotent).maverick-state for that story.labelled in the marker.labelled, delete the marker.Only after maverick-bprop is cleared should you proceed with other
epic work. Leaving a partially-applied block set running invites
downstream PRs to merge against a broken base.
Local worktrees under .maverick/worktrees/ are a cache. After
re-hydration, walk them and reconcile:
| Worktree state | GitHub state | Action |
| --- | --- | --- |
| Exists | Story merged on GH | Destroy the worktree — work is done |
| Exists | Story ejected on GH | Keep — note the path for the human |
| Exists | Story blocked on GH | Destroy — abandon the work |
| Exists | Story in_flight by us (live lease) | Resume where you left off |
| Exists | Story in_flight by another instance | Destroy — they own it now |
| Missing | Story in_flight by us | Recreate from origin/<branch> |
See mav-durability-on-gh for the worktree-recreate command.
When a subagent fails or returns incomplete results:
digraph subagent_failure {
"Subagent returns" [shape=box];
"Result complete and correct?" [shape=diamond];
"Accept result" [shape=box];
"Partial result?" [shape=diamond];
"Dispatch new subagent with corrective context" [shape=box];
"Total failure?" [shape=diamond];
"Dispatch new subagent with simplified scope" [shape=box];
"Report to user" [shape=box];
"Subagent returns" -> "Result complete and correct?";
"Result complete and correct?" -> "Accept result" [label="yes"];
"Result complete and correct?" -> "Partial result?" [label="no"];
"Partial result?" -> "Dispatch new subagent with corrective context" [label="yes — some work done"];
"Partial result?" -> "Total failure?" [label="no work done"];
"Total failure?" -> "Dispatch new subagent with simplified scope" [label="scope too large"];
"Total failure?" -> "Report to user" [label="unclear cause"];
}
development
--- name: do-test description: Write or update tests for a code change. Operates in two modes: `unit` (module-scoped, fast, deterministic) and `integration` (crosses module / service / database boundaries). Intended to be invoked once per testable change from inside a do-issue-* or do-epic phase. Mode is required. argument-hint: mode: unit or integration user-invocable: true disable-model-invocation: false --- **Depends on:** mav-bp-unit-testing, mav-bp-integration-testing, mav-local-verificati
development
Implement a focused code change. Use this skill as the wrapper for any implementation work so the Maverick workflow report captures what was done and so the agent applies the project's coding standards before editing. Intended to be invoked once per task from inside a do-issue-* or do-epic phase, not standalone.
testing
How to stack a PR on top of an unmerged sibling branch, and how to retarget it to the repo's default branch once the sibling merges. Prevents orphan-merge incidents when a dependent story is ready before its parent.
development
Claim, lease, heartbeat, and release protocols for when multiple Claude Code instances may act on the same issue or epic concurrently. GitHub labels and marker comments are the coordination surface; local state is a cache.