plugin/skills/self-healing/SKILL.md
Active runtime recovery for coding agents: when something breaks mid-task, diagnose the root cause, write a fix, VERIFY by re-running the broken thing, then file a `HEAL-` entry to `.learnings/HEALS.md` with proof. Use whenever a command, test, build, or lint fails or exits non-zero; on missing tooling, dependency/lockfile mismatch, wrong runtime version, venv or permission errors, port conflicts, dirty git state, or a missing `.env`; when the agent needs a helper or one-off script that doesn't exist yet; when an external API, tool, or MCP errors or rate-limits; or when a test flakes. Search `HEALS.md` by `Pattern-Key` first — most heals are recurrences, so increment `Recurrence-Count` instead of duplicating. Verify is mandatory: mark `pending-verify` honestly if sandboxed, `abandoned` if the fix can't be made to work. Pairs with `self-improvement` (which promotes recurring heals to durable memory) but owns the verify-before-persist discipline self-improvement doesn't.
npx skillsauth add pskoett/pskoett-ai-skills self-healingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Active runtime recovery for coding agents. When something breaks, run the loop: diagnose → patch → verify → file. Leave behind a reusable, verified artifact instead of a swept-under-the-rug failure.
The premise mirrors browser-use/browser-harness: the harness improves itself every run. An agent that hits a gap doesn't fail — it writes the fix during execution, verifies it works, and files the durable artifact for future runs. Coding tasks deserve the same loop.
When a coding agent hits a wall mid-task, the default failure modes are:
All three turn a one-time failure into a recurrence. The next agent on the same project hits the same wall.
This skill enforces one discipline: verify before persist. A patch isn't real until you've re-run the failing operation and watched it succeed. When it does, file the verified fix so the next run benefits.
These two skills are deliberately split. Run both — they feed each other but don't overlap.
| Aspect | self-healing (this skill) | self-improvement |
| ----------- | -------------------------------------------------------------------- | ------------------------------------------------------------- |
| When | During execution, failure is live | After the fact, at natural breakpoints |
| Verb | Heal now — restore working state | Remember for later — accumulate knowledge |
| Outcome | Verified patch + (optional) reusable artifact | Logged learning, correction, request |
| Verify | Mandatory — no persist without proof | Not required |
| Files | .learnings/HEALS.md + .learnings/heals/<HEAL-ID>/ (lazy) | .learnings/ERRORS.md, LEARNINGS.md, FEATURE_REQUESTS.md |
| Trigger | Failure observed mid-task | Correction, knowledge gap, feature request, recurrence |
Boundary rule: if you're capturing a fact, a correction, or a wish — that's self-improvement. If you're applying and verifying a fix to a live failure — that's self-healing.
● failure observed
│
● 1. DIAGNOSE capture context — command, error, env, what was attempted
│ search HEALS.md for the same Pattern-Key first
│ (most heals are recurrences; don't reinvent)
│
● 2. PATCH write the fix — script, helper, env tweak, alt command
│ artifacts → .learnings/heals/<HEAL-ID>/ (only if needed)
│
● 3. VERIFY re-run the failing op — must succeed
│ ↻ if still failing: refine and retry, cap at 3 attempts
│ ✗ if uncrackable: file Status: abandoned with notes
│
● 4. FILE write HEAL-YYYYMMDD-XXX to .learnings/HEALS.md
│ with Pattern-Key, status, verification proof
│
✓ working state restored, heal persisted
(conditional) PROMOTE if Pattern-Key recurrence ≥ 3 across distinct tasks,
append a Handoff block → self-improvement promotes to memory
If you abandon a heal mid-loop, don't pretend it succeeded. File a HEAL- entry with Status: abandoned and notes on what didn't work. The next agent learns from the dead end too.
Self-healing fires on active failures during execution — the agent has just observed something not working and needs to make it work to continue. Five shapes:
Any invocation exits non-zero or produces wrong output. Don't acknowledge and retry verbatim — diagnose, patch, verify.
Examples: npm install errors when a pnpm-lock.yaml is present (switch tool); pytest fails with ModuleNotFoundError (activate the venv); tsc flags a stale type (regenerate the client); eslint reports a config error (install the missing parser).
The agent needs something that doesn't exist yet — a script, a helper, a wrapper, a glue function. Write it in the moment. This is the closest analog to browser-harness's agent_helpers.py.
Examples: dedupe a CSV by custom key (write a small Python helper); bootstrap 12 microservices the same way (write scripts/bootstrap-all.sh); bulk-rename branches matching a pattern (write a gh-based shell helper).
The local environment isn't what the project expects. Detect, patch, verify.
Examples: runtime version mismatch (nvm use, pyenv local, rustup override); stale dependency cache after a branch switch; dirty git state blocking a checkout; missing .env (copy from .env.example and surface gaps).
A service the agent depends on returns something unexpected. Find a workaround and capture it.
Examples: an MCP tool returns InputValidationError because the schema changed (patch the call shape); a public API hits a rate limit (back off, switch endpoint, batch); an upstream lib bumped a default and broke a script (pin the version).
The agent catches itself about to redo the failing step. That self-recognition is a heal forming — capture the alternate approach as the patch.
command not found / module not found / permission deniedAppend to .learnings/HEALS.md (create if missing):
## [HEAL-YYYYMMDD-XXX] short_kebab_name
**Logged**: ISO-8601 timestamp
**Status**: verified | pending-verify | abandoned
**Trigger**: tool-failure | missing-capability | env-issue | external-change | <free-form>
**Active-Context**: (optional) — current skill, task phase, or workflow stage; omit if not applicable
**Area**: free-form tag — what part of the system (`build`, `tests`, `ci`, `auth`, `data-pipeline`, `mobile`, ...)
**Priority**: low | medium | high | critical
### Failure
What broke — concrete: the command, the error message, the action that was blocked. Include exit codes and verbatim error lines.
### Diagnosis
The root cause as understood after investigation. Why the obvious approach didn't work. Not a guess — what was actually verified during the heal.
### Fix
The patch that was applied. Verbatim commands, code snippets, or pointers to files under `.learnings/heals/<HEAL-ID>/`. Keep it minimal — just enough to reproduce.
### Verification
What was run after the fix and what it returned. Exit code, output snippet, test pass count. **This is the proof.** Without it, the entry is `pending-verify` or `abandoned`.
### Artifacts
(omit this section if no files were generated; otherwise list relative paths under `.learnings/heals/<HEAL-ID>/`)
### Metadata
- Related Files: path/to/file.ext
- See Also: HEAL-... | LRN-... | ERR-... (related entries)
- Pattern-Key: lower.snake.case key for recurrence detection (e.g. `env.lockfile_mismatch`)
- Recurrence-Count: 1
- First-Seen / Last-Seen: YYYY-MM-DD
---
verified = the verify step passed. pending-verify = patch applied but couldn't be fully proven (sandboxed/offline/CI-only) — surface to the user. abandoned = patch didn't work or diagnosis was wrong — document what was tried.domain-skills/<site>/.frontend, data-pipeline, ci, auth, terraform, mobile, embedded — anything that fits your project shape.env.lockfile_mismatch is good; fixed_thing_tuesday isn't.Format: HEAL-YYYYMMDD-XXX. XXX is sequential 3-digit or 3-char random alphanumeric. Examples: HEAL-20260524-001, HEAL-20260524-A7B.
Only create .learnings/heals/<HEAL-ID>/ when the heal generated something worth preserving. One-line fixes don't need a folder; the HEAL entry text is enough. Abandoned heals with no applied patch also skip the folder.
.learnings/
├── HEALS.md
├── ERRORS.md / LEARNINGS.md / FEATURE_REQUESTS.md (self-improvement)
└── heals/
└── HEAL-20260524-001/
├── helper.sh
├── patch.diff
└── notes.md
Put here: generated scripts/helpers, patch files, supplementary notes, output captures that document the diagnosis. Don't put here: project source changes (those go in the project tree, referenced via Related Files); secrets; output already captured in the HEAL text.
Verify is the load-bearing wall. The whole point of self-healing over self-improvement is that the fix is proven, not theorized.
| Failure shape | Verification | | ------------------------------------- | ------------------------------------------------------------------ | | Tool / command / test / build / lint | Re-run the original invocation; expect exit 0 / pass | | Missing capability | Invoke the helper end-to-end on a real input; expect the intent | | Environment drift | Re-run the operation that triggered the diagnosis | | External service workaround | Re-run the failed call with the patch; expect a usable response |
When you genuinely can't run the verify step (no network, no real remote, sandboxed shell, CI-only reproduction), file Status: pending-verify with:
pending-verify is honest. Faking verified is the failure mode this skill exists to prevent.
Most heals don't need a separate proof script — the verify step is just re-running the failing thing. Build a proper proof script when:
Status: abandoned with notes on what you tried. Surface to the user. Don't flail.|| true, --ignore-errors) — that's hidingPrefer reversible patches. If your heal modifies project files, capture the diff in patch.diff. If the heal is destructive (deletes generated files, rewrites locks), note it explicitly — a future agent reading the HEAL needs to know what was destroyed.
Most heals are recurrences. Before filing a new HEAL, search:
grep -n "Pattern-Key: <your-pattern-key>" .learnings/HEALS.md
If found:
Recurrence-CountLast-SeenAdd a Handoff block to an existing entry when all are true:
Recurrence-Count >= 3### Handoff
- **Promoted To**: self-improvement at YYYY-MM-DD
- **Promotion Target**: CLAUDE.md | AGENTS.md | .github/copilot-instructions.md | new-skill
- **Distilled Rule**: One-line prevention guidance derived from the heal
Then self-improvement (or a learning aggregator) takes over: distills the rule, writes it into the right context file, or extracts a reusable skill. The HEAL stays for traceability.
pending-verify or abandoned.pytest.skip, it.skip, xit). A flaky CI isn't healed by --retry. Find the root cause; if you can't, abandon honestly.HEALS.md by Pattern-Key. Most heals are recurrences.scripts/, Makefile, justfile, package.json, pyproject.toml first. Heal = write what's missing, not what's there..learnings/heals/<HEAL-ID>/ if nothing goes in it.env.node_version_mismatch is reusable; fixed_the_thing_on_tuesday isn't..learnings/heals/ are reference material; if a script becomes load-bearing, promote it to scripts/.mkdir -p .learnings # heals/ is lazy — created only when artifacts exist
touch .learnings/HEALS.md
Gitignore choices match self-improvement. Keep heals local (.learnings/ in .gitignore) or share them as team knowledge (don't gitignore — they become reviewable durable context).
Automatic triggering on command failures is optional and agent-specific. See references/hooks.md for Claude Code / Codex configuration.
The skill is agent-agnostic. The .learnings/HEALS.md format is plain markdown — any agent (Claude Code, Codex CLI, Copilot, Cursor, Aider, ...) can read and write it. Agents without hook support can be reminded via their instruction file (e.g. .github/copilot-instructions.md). See references/hooks.md for examples.
How self-healing slots into a larger skill pipeline (with upstream surfacing of past heals, downstream promotion of recurrences, and machine-verification gates) is documented in references/pipeline-integration.md. Not required to use this skill — it stands alone.
references/examples.md — canonical HEAL entry shapes (command failure, missing capability, env drift, external API workaround, abandoned heal)references/interop-with-self-improvement.md — decision table and handoff payload between the two skillsreferences/pipeline-integration.md — how self-healing relates to upstream/downstream skills in a larger pipelinereferences/hooks.md — automatic triggering setup for Claude Code / Codexdevelopment
Implementation + audit loop using parallel agent teams with structured simplify, harden, and document passes. Spawns implementation agents to do the work, then audit agents to find complexity, security gaps, and spec deviations, then loops until code compiles cleanly, all tests pass, and auditors find zero issues or the loop cap is reached. Use when: implementing features from a spec or plan, hardening existing code, fixing a batch of issues, or any multi-file task that benefits from a build-verify-fix cycle.
tools
Active runtime recovery for coding agents: when something breaks mid-task, diagnose the root cause, write a fix, VERIFY by re-running the broken thing, then file a `HEAL-` entry to `.learnings/HEALS.md` with proof. Use whenever a command, test, build, or lint fails or exits non-zero; on missing tooling, dependency/lockfile mismatch, wrong runtime version, venv or permission errors, port conflicts, dirty git state, or a missing `.env`; when the agent needs a helper or one-off script that doesn't exist yet; when an external API, tool, or MCP errors or rate-limits; or when a test flakes. Search `HEALS.md` by `Pattern-Key` first — most heals are recurrences, so increment `Recurrence-Count` instead of duplicating. Verify is mandatory: mark `pending-verify` honestly if sandboxed, `abandoned` if the fix can't be made to work. Pairs with `self-improvement` (which promotes recurring heals to durable memory) but owns the verify-before-persist discipline self-improvement doesn't.
development
Control-plane workflow for coordinating multi-agent, multi-session project work from a single Codex, GitHub Copilot, or agent-app control session. Use this skill whenever the user asks to orchestrate agents, create or steer worker sessions, run a workflow-like effort, fan out audits/research/migrations, coordinate parallel implementation streams, monitor other project sessions, or compare this control-session pattern to Claude Code dynamic workflows. This skill is especially relevant when the current session can spawn persistent project sessions and those sessions can spawn their own subagents, creating a two-level orchestration hierarchy.
development
Control-plane workflow for coordinating multi-agent, multi-session project work from a single Codex, GitHub Copilot, or agent-app control session. Use this skill whenever the user asks to orchestrate agents, create or steer worker sessions, run a workflow-like effort, fan out audits/research/migrations, coordinate parallel implementation streams, monitor other project sessions, or compare this control-session pattern to Claude Code dynamic workflows. This skill is especially relevant when the current session can spawn persistent project sessions and those sessions can spawn their own subagents, creating a two-level orchestration hierarchy.