skills/error-recovery/SKILL.md
Use when encountering failures - assess severity, preserve evidence, execute rollback decision tree, and verify post-recovery state
npx skillsauth add troykelly/codex-skills error-recoveryInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Handle failures gracefully with structured recovery.
Core principle: When things break, don't panic. Assess, preserve, recover, verify.
Announce at start: "I'm using error-recovery to handle this failure."
Error Detected
│
▼
┌─────────────┐
│ 1. ASSESS │ ← Severity? Scope? Impact?
└──────┬──────┘
│
▼
┌─────────────┐
│ 2. PRESERVE │ ← Capture evidence before it's lost
└──────┬──────┘
│
▼
┌─────────────┐
│ 3. RECOVER │ ← Follow decision tree
└──────┬──────┘
│
▼
┌─────────────┐
│ 4. VERIFY │ ← Confirm clean state
└──────┬──────┘
│
▼
┌─────────────┐
│ 5. DOCUMENT │ ← Record what happened
└─────────────┘
| Level | Description | Examples | |-------|-------------|----------| | Critical | System unusable, data at risk | Build completely broken, tests cause data loss | | Major | Significant functionality broken | Feature doesn't work, many tests failing | | Minor | Isolated issue, workaround exists | Single test flaky, style error | | Info | Warning only, not blocking | Deprecation notice, performance hint |
## Error Assessment
**Error:** [Description of error]
**Location:** [Where it occurred]
### Severity Checklist
- [ ] Is the system still functional?
- [ ] Is any data at risk?
- [ ] Are other features affected?
- [ ] Is this blocking progress?
### Scope
- Files affected: [list]
- Features affected: [list]
- Users affected: [none/some/all]
Capture BEFORE attempting fixes:
# Capture error output
pnpm test 2>&1 | tee error-log.txt
# Or from failed command
./failing-command 2>&1 | tee error-log.txt
## Stack Trace
Error: Connection refused at Database.connect (src/db/connection.ts:45) at UserService.init (src/services/user.ts:23) at main (src/index.ts:12)
# Git state
git status
git diff
# Environment state
env | grep -E "NODE|NPM|PATH"
# Dependency state
pnpm list
For UI errors, capture screenshots before changes.
What type of failure?
│
┌────┴────┬────────────┬────────────┐
│ │ │ │
Code Build Environment External
Error Error Issue Service
│ │ │ │
▼ ▼ ▼ ▼
┌────┐ ┌────┐ ┌────┐ ┌────┐
│Git │ │Clean│ │Re- │ │Wait/│
│reco│ │build│ │init │ │Retry│
│very│ │ │ │ │ │ │
└────┘ └────┘ └────┘ └────┘
Single file broken:
# Revert just that file
git checkout HEAD -- path/to/file.ts
Feature broken (multiple files):
# Find last good commit
git log --oneline
# Revert to that commit (soft reset keeps changes staged)
git reset --soft [GOOD_COMMIT]
# Or hard reset (discards changes)
git reset --hard [GOOD_COMMIT]
Working directory is a mess:
# Stash current changes
git stash
# Verify clean state
git status
# Optionally recover stash later
git stash pop
# Clean build artifacts
rm -rf node_modules dist build .cache
# Reinstall dependencies
pnpm install --frozen-lockfile # Clean install from lock file
# Rebuild
pnpm build
# Check environment
env | grep -E "NODE|PNPM"
# Reset Node modules
rm -rf node_modules
pnpm install --frozen-lockfile
# If using nvm, verify version
nvm use
# Re-run init script
./scripts/init.sh
# Check if service is up
curl -I https://service.example.com/health
# If down, wait and retry
sleep 60
curl -I https://service.example.com/health
# If still down, check status page
# Document as external blocker
After recovery, verify clean state:
# Clean working directory
git status
# Expected: "nothing to commit, working tree clean" or known changes
# Tests pass
pnpm test
# Build succeeds
pnpm build
# Types check
pnpm typecheck
# Run the specific thing that was broken
pnpm test --grep "specific test"
# Or verify the feature manually
gh issue comment [ISSUE_NUMBER] --body "## Error Recovery
**Error encountered:** [Description]
**Severity:** Major
**Evidence:**
\`\`\`
[Error output]
\`\`\`
**Recovery actions:**
1. [Action 1]
2. [Action 2]
**Verification:**
- [x] Tests pass
- [x] Build succeeds
**Root cause:** [If known]
**Prevention:** [If applicable]
"
// Store for future reference
mcp__memory__add_observations({
observations: [{
entityName: "Issue #[NUMBER]",
contents: [
"Encountered [error type] on [date]",
"Caused by: [root cause]",
"Resolved by: [recovery action]"
]
}]
});
# What changed?
git diff HEAD~3
# Did dependencies change?
git diff HEAD~3 pnpm-lock.yaml
# Clean reinstall
rm -rf node_modules && pnpm install --frozen-lockfile
# Check for environment differences
# - Node version
# - OS differences
# - Env vars
# Run with CI-like settings
CI=true pnpm test
# Check TypeScript errors
pnpm typecheck
# Check for circular dependencies
pnpm dlx madge --circular src/
# Clean build
rm -rf dist && pnpm build
# Don't panic
# Find last known good state
git log --oneline
# Reset to that state
git reset --hard [GOOD_COMMIT]
# Verify
pnpm test
# Start again more carefully
If recovery fails after 2-3 attempts:
## Escalation: Unrecoverable Error
**Issue:** #[NUMBER]
**Error:** [Description]
**Recovery attempts:**
1. [Attempt 1] - [Result]
2. [Attempt 2] - [Result]
**Current state:** [Broken/Partially working]
**Evidence preserved:** [Links to logs, screenshots]
**Requesting help with:** [Specific question]
Mark issue as Blocked and await human input.
When error occurs:
This skill is called by:
issue-driven-development - When errors occurci-monitoring - CI failuresThis skill may trigger:
research-after-failure - If cause is unknownissue-lifecycledata-ai
Defines behavior protocol for spawned worker agents. Injected into worker prompts. Covers startup, progress reporting, exit conditions, and handover preparation.
development
Defines context handover format when workers hit turn limit. Posts structured handover to GitHub issue comments enabling replacement workers to continue seamlessly.
data-ai
Use to spawn isolated worker processes for autonomous issue work. Creates git worktrees, constructs worker prompts, and handles worker lifecycle.
tools
Entry point for ALL work requests - triages scope from trivial to massive, asks clarifying questions, and routes to appropriate planning skills. Use this when receiving any new work request.