skills/self-healing/SKILL.md
Active runtime recovery for coding agents: when something breaks mid-task, diagnose the root cause, write a fix, VERIFY by re-running the broken thing, then file a `HEAL-` entry to `.learnings/HEALS.md` with proof. Use whenever a command, test, build, or lint fails or exits non-zero; on missing tooling, dependency/lockfile mismatch, wrong runtime version, venv or permission errors, port conflicts, dirty git state, or a missing `.env`; when the agent needs a helper or one-off script that doesn't exist yet; when an external API, tool, or MCP errors or rate-limits; or when a test flakes. Search `HEALS.md` by `Pattern-Key` first — most heals are recurrences, so increment `Recurrence-Count` instead of duplicating. Verify is mandatory: mark `pending-verify` honestly if sandboxed, `abandoned` if the fix can't be made to work. Pairs with `self-improvement` (which promotes recurring heals to durable memory) but owns the verify-before-persist discipline self-improvement doesn't.
npx skillsauth add pskoett/pskoett-ai-skills self-healingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Active runtime recovery for coding agents. When something breaks, run the loop: diagnose → patch → verify → file. Leave behind a reusable, verified artifact instead of a swept-under-the-rug failure.
The premise mirrors browser-use/browser-harness: the harness improves itself every run. An agent that hits a gap doesn't fail — it writes the fix during execution, verifies it works, and files the durable artifact for future runs. Coding tasks deserve the same loop.
When a coding agent hits a wall mid-task, the default failure modes are:
All three turn a one-time failure into a recurrence. The next agent on the same project hits the same wall.
This skill enforces one discipline: verify before persist. A patch isn't real until you've re-run the failing operation and watched it succeed. When it does, file the verified fix so the next run benefits.
These two skills are deliberately split. Run both — they feed each other but don't overlap.
| Aspect | self-healing (this skill) | self-improvement |
| ----------- | -------------------------------------------------------------------- | ------------------------------------------------------------- |
| When | During execution, failure is live | After the fact, at natural breakpoints |
| Verb | Heal now — restore working state | Remember for later — accumulate knowledge |
| Outcome | Verified patch + (optional) reusable artifact | Logged learning, correction, request |
| Verify | Mandatory — no persist without proof | Not required |
| Files | .learnings/HEALS.md + .learnings/heals/<HEAL-ID>/ (lazy) | .learnings/ERRORS.md, LEARNINGS.md, FEATURE_REQUESTS.md |
| Trigger | Failure observed mid-task | Correction, knowledge gap, feature request, recurrence |
Boundary rule: if you're capturing a fact, a correction, or a wish — that's self-improvement. If you're applying and verifying a fix to a live failure — that's self-healing.
● failure observed
│
● 1. DIAGNOSE capture context — command, error, env, what was attempted
│ search HEALS.md for the same Pattern-Key first
│ (most heals are recurrences; don't reinvent)
│
● 2. PATCH write the fix — script, helper, env tweak, alt command
│ artifacts → .learnings/heals/<HEAL-ID>/ (only if needed)
│
● 3. VERIFY re-run the failing op — must succeed
│ ↻ if still failing: refine and retry, cap at 3 attempts
│ ✗ if uncrackable: file Status: abandoned with notes
│
● 4. FILE write HEAL-YYYYMMDD-XXX to .learnings/HEALS.md
│ with Pattern-Key, status, verification proof
│
✓ working state restored, heal persisted
(conditional) PROMOTE if Pattern-Key recurrence ≥ 3 across distinct tasks,
append a Handoff block → self-improvement promotes to memory
If you abandon a heal mid-loop, don't pretend it succeeded. File a HEAL- entry with Status: abandoned and notes on what didn't work. The next agent learns from the dead end too.
Self-healing fires on active failures during execution — the agent has just observed something not working and needs to make it work to continue. Five shapes:
Any invocation exits non-zero or produces wrong output. Don't acknowledge and retry verbatim — diagnose, patch, verify.
Examples: npm install errors when a pnpm-lock.yaml is present (switch tool); pytest fails with ModuleNotFoundError (activate the venv); tsc flags a stale type (regenerate the client); eslint reports a config error (install the missing parser).
The agent needs something that doesn't exist yet — a script, a helper, a wrapper, a glue function. Write it in the moment. This is the closest analog to browser-harness's agent_helpers.py.
Examples: dedupe a CSV by custom key (write a small Python helper); bootstrap 12 microservices the same way (write scripts/bootstrap-all.sh); bulk-rename branches matching a pattern (write a gh-based shell helper).
The local environment isn't what the project expects. Detect, patch, verify.
Examples: runtime version mismatch (nvm use, pyenv local, rustup override); stale dependency cache after a branch switch; dirty git state blocking a checkout; missing .env (copy from .env.example and surface gaps).
A service the agent depends on returns something unexpected. Find a workaround and capture it.
Examples: an MCP tool returns InputValidationError because the schema changed (patch the call shape); a public API hits a rate limit (back off, switch endpoint, batch); an upstream lib bumped a default and broke a script (pin the version).
The agent catches itself about to redo the failing step. That self-recognition is a heal forming — capture the alternate approach as the patch.
command not found / module not found / permission deniedAppend to .learnings/HEALS.md (create if missing):
## [HEAL-YYYYMMDD-XXX] short_kebab_name
**Logged**: ISO-8601 timestamp
**Status**: verified | pending-verify | abandoned
**Trigger**: tool-failure | missing-capability | env-issue | external-change | <free-form>
**Active-Context**: (optional) — current skill, task phase, or workflow stage; omit if not applicable
**Area**: free-form tag — what part of the system (`build`, `tests`, `ci`, `auth`, `data-pipeline`, `mobile`, ...)
**Priority**: low | medium | high | critical
### Failure
What broke — concrete: the command, the error message, the action that was blocked. Include exit codes and verbatim error lines.
### Diagnosis
The root cause as understood after investigation. Why the obvious approach didn't work. Not a guess — what was actually verified during the heal.
### Fix
The patch that was applied. Verbatim commands, code snippets, or pointers to files under `.learnings/heals/<HEAL-ID>/`. Keep it minimal — just enough to reproduce.
### Verification
What was run after the fix and what it returned. Exit code, output snippet, test pass count. **This is the proof.** Without it, the entry is `pending-verify` or `abandoned`.
### Artifacts
(omit this section if no files were generated; otherwise list relative paths under `.learnings/heals/<HEAL-ID>/`)
### Metadata
- Related Files: path/to/file.ext
- See Also: HEAL-... | LRN-... | ERR-... (related entries)
- Pattern-Key: lower.snake.case key for recurrence detection (e.g. `env.lockfile_mismatch`)
- Recurrence-Count: 1
- First-Seen / Last-Seen: YYYY-MM-DD
---
verified = the verify step passed. pending-verify = patch applied but couldn't be fully proven (sandboxed/offline/CI-only) — surface to the user. abandoned = patch didn't work or diagnosis was wrong — document what was tried.domain-skills/<site>/.frontend, data-pipeline, ci, auth, terraform, mobile, embedded — anything that fits your project shape.env.lockfile_mismatch is good; fixed_thing_tuesday isn't.Format: HEAL-YYYYMMDD-XXX. XXX is sequential 3-digit or 3-char random alphanumeric. Examples: HEAL-20260524-001, HEAL-20260524-A7B.
Only create .learnings/heals/<HEAL-ID>/ when the heal generated something worth preserving. One-line fixes don't need a folder; the HEAL entry text is enough. Abandoned heals with no applied patch also skip the folder.
.learnings/
├── HEALS.md
├── ERRORS.md / LEARNINGS.md / FEATURE_REQUESTS.md (self-improvement)
└── heals/
└── HEAL-20260524-001/
├── helper.sh
├── patch.diff
└── notes.md
Put here: generated scripts/helpers, patch files, supplementary notes, output captures that document the diagnosis. Don't put here: project source changes (those go in the project tree, referenced via Related Files); secrets; output already captured in the HEAL text.
Verify is the load-bearing wall. The whole point of self-healing over self-improvement is that the fix is proven, not theorized.
| Failure shape | Verification | | ------------------------------------- | ------------------------------------------------------------------ | | Tool / command / test / build / lint | Re-run the original invocation; expect exit 0 / pass | | Missing capability | Invoke the helper end-to-end on a real input; expect the intent | | Environment drift | Re-run the operation that triggered the diagnosis | | External service workaround | Re-run the failed call with the patch; expect a usable response |
When you genuinely can't run the verify step (no network, no real remote, sandboxed shell, CI-only reproduction), file Status: pending-verify with:
pending-verify is honest. Faking verified is the failure mode this skill exists to prevent.
Most heals don't need a separate proof script — the verify step is just re-running the failing thing. Build a proper proof script when:
Status: abandoned with notes on what you tried. Surface to the user. Don't flail.|| true, --ignore-errors) — that's hidingPrefer reversible patches. If your heal modifies project files, capture the diff in patch.diff. If the heal is destructive (deletes generated files, rewrites locks), note it explicitly — a future agent reading the HEAL needs to know what was destroyed.
Most heals are recurrences. Before filing a new HEAL, search:
grep -n "Pattern-Key: <your-pattern-key>" .learnings/HEALS.md
If found:
Recurrence-CountLast-SeenAdd a Handoff block to an existing entry when all are true:
Recurrence-Count >= 3### Handoff
- **Promoted To**: self-improvement at YYYY-MM-DD
- **Promotion Target**: CLAUDE.md | AGENTS.md | .github/copilot-instructions.md | new-skill
- **Distilled Rule**: One-line prevention guidance derived from the heal
Then self-improvement (or a learning aggregator) takes over: distills the rule, writes it into the right context file, or extracts a reusable skill. The HEAL stays for traceability.
pending-verify or abandoned.pytest.skip, it.skip, xit). A flaky CI isn't healed by --retry. Find the root cause; if you can't, abandon honestly.HEALS.md by Pattern-Key. Most heals are recurrences.scripts/, Makefile, justfile, package.json, pyproject.toml first. Heal = write what's missing, not what's there..learnings/heals/<HEAL-ID>/ if nothing goes in it.env.node_version_mismatch is reusable; fixed_the_thing_on_tuesday isn't..learnings/heals/ are reference material; if a script becomes load-bearing, promote it to scripts/.mkdir -p .learnings # heals/ is lazy — created only when artifacts exist
touch .learnings/HEALS.md
Gitignore choices match self-improvement. Keep heals local (.learnings/ in .gitignore) or share them as team knowledge (don't gitignore — they become reviewable durable context).
Automatic triggering on command failures is optional and agent-specific. See references/hooks.md for Claude Code / Codex configuration.
The skill is agent-agnostic. The .learnings/HEALS.md format is plain markdown — any agent (Claude Code, Codex CLI, Copilot, Cursor, Aider, ...) can read and write it. Agents without hook support can be reminded via their instruction file (e.g. .github/copilot-instructions.md). See references/hooks.md for examples.
How self-healing slots into a larger skill pipeline (with upstream surfacing of past heals, downstream promotion of recurrences, and machine-verification gates) is documented in references/pipeline-integration.md. Not required to use this skill — it stands alone.
references/examples.md — canonical HEAL entry shapes (command failure, missing capability, env drift, external API workaround, abandoned heal)references/interop-with-self-improvement.md — decision table and handoff payload between the two skillsreferences/pipeline-integration.md — how self-healing relates to upstream/downstream skills in a larger pipelinereferences/hooks.md — automatic triggering setup for Claude Code / Codexdevelopment
Control-plane workflow for coordinating multi-agent, multi-session project work from a single Codex, GitHub Copilot, or agent-app control session. Use this skill whenever the user asks to orchestrate agents, create or steer worker sessions, run a workflow-like effort, fan out audits/research/migrations, coordinate parallel implementation streams, monitor other project sessions, or compare this control-session pattern to Claude Code dynamic workflows. This skill is especially relevant when the current session can spawn persistent project sessions and those sessions can spawn their own subagents, creating a two-level orchestration hierarchy.
tools
Active runtime recovery for coding agents: when something breaks mid-task, diagnose the root cause, write a fix, VERIFY by re-running the broken thing, then file a `HEAL-` entry to `.learnings/HEALS.md` with proof. Use whenever a command, test, build, or lint fails or exits non-zero; on missing tooling, dependency/lockfile mismatch, wrong runtime version, venv or permission errors, port conflicts, dirty git state, or a missing `.env`; when the agent needs a helper or one-off script that doesn't exist yet; when an external API, tool, or MCP errors or rate-limits; or when a test flakes. Search `HEALS.md` by `Pattern-Key` first — most heals are recurrences, so increment `Recurrence-Count` instead of duplicating. Verify is mandatory: mark `pending-verify` honestly if sandboxed, `abandoned` if the fix can't be made to work. Pairs with `self-improvement` (which promotes recurring heals to durable memory) but owns the verify-before-persist discipline self-improvement doesn't.
development
Control-plane workflow for coordinating multi-agent, multi-session project work from a single Codex, GitHub Copilot, or agent-app control session. Use this skill whenever the user asks to orchestrate agents, create or steer worker sessions, run a workflow-like effort, fan out audits/research/migrations, coordinate parallel implementation streams, monitor other project sessions, or compare this control-session pattern to Claude Code dynamic workflows. This skill is especially relevant when the current session can spawn persistent project sessions and those sessions can spawn their own subagents, creating a two-level orchestration hierarchy.
tools
CI-only self-healing workflow using gh-aw (GitHub Agentic Workflows) for active runtime recovery on pull requests and scheduled runs. When a CI check fails (test, build, lint, deploy, scan), this skill diagnoses the failure from CI logs, proposes a verified patch as a PR comment or follow-up commit, and commits a HEAL entry to `.learnings/HEALS.md`. Verify-before-persist discipline preserved: a HEAL is only `verified` if a re-run check passes in the same workflow; otherwise it ships as `pending-verify` for human follow-up. Recurrent heal patterns across PRs accumulate `Recurrence-Count` and append a `Handoff` block at ≥3 to flag promotion via self-improvement-ci. Use this skill when: you want headless heal-loop execution in CI/scheduled pipelines, you want recurring failure patterns captured automatically, or you want PRs that surface non-obvious environmental / tooling fixes without human triage. For interactive/local sessions, use `self-healing` instead.