Self-Healing

Active runtime recovery for coding agents. When something breaks, run the loop: diagnose → patch → verify → file. Leave behind a reusable, verified artifact instead of a swept-under-the-rug failure.

The premise mirrors browser-use/browser-harness: the harness improves itself every run. An agent that hits a gap doesn't fail — it writes the fix during execution, verifies it works, and files the durable artifact for future runs. Coding tasks deserve the same loop.

What this skill is for

When a coding agent hits a wall mid-task, the default failure modes are:

Paper over it — "let me try a different approach" — and lose the recovery
Pretend the fix worked — without re-running the broken thing
Symptom-fix — skip the test, swallow the error, retry until green

All three turn a one-time failure into a recurrence. The next agent on the same project hits the same wall.

This skill enforces one discipline: verify before persist. A patch isn't real until you've re-run the failing operation and watched it succeed. When it does, file the verified fix so the next run benefits.

Relationship to self-improvement

These two skills are deliberately split. Run both — they feed each other but don't overlap.

| Aspect | self-healing (this skill) | self-improvement | | ----------- | -------------------------------------------------------------------- | ------------------------------------------------------------- | | When | During execution, failure is live | After the fact, at natural breakpoints | | Verb | Heal now — restore working state | Remember for later — accumulate knowledge | | Outcome | Verified patch + (optional) reusable artifact | Logged learning, correction, request | | Verify | Mandatory — no persist without proof | Not required | | Files | .learnings/HEALS.md + .learnings/heals/<HEAL-ID>/ (lazy) | .learnings/ERRORS.md, LEARNINGS.md, FEATURE_REQUESTS.md | | Trigger | Failure observed mid-task | Correction, knowledge gap, feature request, recurrence |

Boundary rule: if you're capturing a fact, a correction, or a wish — that's self-improvement. If you're applying and verifying a fix to a live failure — that's self-healing.

The Heal Loop

  ● failure observed
  │
  ● 1. DIAGNOSE  capture context — command, error, env, what was attempted
  │              search HEALS.md for the same Pattern-Key first
  │              (most heals are recurrences; don't reinvent)
  │
  ● 2. PATCH     write the fix — script, helper, env tweak, alt command
  │              artifacts → .learnings/heals/<HEAL-ID>/  (only if needed)
  │
  ● 3. VERIFY    re-run the failing op — must succeed
  │              ↻ if still failing: refine and retry, cap at 3 attempts
  │              ✗ if uncrackable: file Status: abandoned with notes
  │
  ● 4. FILE      write HEAL-YYYYMMDD-XXX to .learnings/HEALS.md
  │              with Pattern-Key, status, verification proof
  │
  ✓ working state restored, heal persisted

  (conditional) PROMOTE  if Pattern-Key recurrence ≥ 3 across distinct tasks,
                          append a Handoff block → self-improvement promotes to memory

If you abandon a heal mid-loop, don't pretend it succeeded. File a HEAL- entry with Status: abandoned and notes on what didn't work. The next agent learns from the dead end too.

When to trigger

Self-healing fires on active failures during execution — the agent has just observed something not working and needs to make it work to continue. Five shapes:

1. Tool failure (command / test / build / lint)

Any invocation exits non-zero or produces wrong output. Don't acknowledge and retry verbatim — diagnose, patch, verify.

Examples: npm install errors when a pnpm-lock.yaml is present (switch tool); pytest fails with ModuleNotFoundError (activate the venv); tsc flags a stale type (regenerate the client); eslint reports a config error (install the missing parser).

2. Missing capability / tool gap

The agent needs something that doesn't exist yet — a script, a helper, a wrapper, a glue function. Write it in the moment. This is the closest analog to browser-harness's agent_helpers.py.

Examples: dedupe a CSV by custom key (write a small Python helper); bootstrap 12 microservices the same way (write scripts/bootstrap-all.sh); bulk-rename branches matching a pattern (write a gh-based shell helper).

3. Environment issue

The local environment isn't what the project expects. Detect, patch, verify.

Examples: runtime version mismatch (nvm use, pyenv local, rustup override); stale dependency cache after a branch switch; dirty git state blocking a checkout; missing .env (copy from .env.example and surface gaps).

4. External service / API change

A service the agent depends on returns something unexpected. Find a workaround and capture it.

Examples: an MCP tool returns InputValidationError because the schema changed (patch the call shape); a public API hits a rate limit (back off, switch endpoint, batch); an upstream lib bumped a default and broke a script (pin the version).

5. About-to-retry-the-same-broken-approach

The agent catches itself about to redo the failing step. That self-recognition is a heal forming — capture the alternate approach as the patch.

Detection signals to watch for

Non-zero exit codes
Stack traces in tool output
The same operation failing twice with the same error
"I'll try a different approach" — capture it as a heal
command not found / module not found / permission denied
Stale assertions, snapshot mismatches, type errors that weren't there before
"Weird" output that suggests environmental rather than logical bugs

HEAL Entry Format

Append to .learnings/HEALS.md (create if missing):

## [HEAL-YYYYMMDD-XXX] short_kebab_name

**Logged**: ISO-8601 timestamp
**Status**: verified | pending-verify | abandoned
**Trigger**: tool-failure | missing-capability | env-issue | external-change | <free-form>
**Active-Context**: (optional) — current skill, task phase, or workflow stage; omit if not applicable
**Area**: free-form tag — what part of the system (`build`, `tests`, `ci`, `auth`, `data-pipeline`, `mobile`, ...)
**Priority**: low | medium | high | critical

### Failure
What broke — concrete: the command, the error message, the action that was blocked. Include exit codes and verbatim error lines.

### Diagnosis
The root cause as understood after investigation. Why the obvious approach didn't work. Not a guess — what was actually verified during the heal.

### Fix
The patch that was applied. Verbatim commands, code snippets, or pointers to files under `.learnings/heals/<HEAL-ID>/`. Keep it minimal — just enough to reproduce.

### Verification
What was run after the fix and what it returned. Exit code, output snippet, test pass count. **This is the proof.** Without it, the entry is `pending-verify` or `abandoned`.

### Artifacts
(omit this section if no files were generated; otherwise list relative paths under `.learnings/heals/<HEAL-ID>/`)

### Metadata
- Related Files: path/to/file.ext
- See Also: HEAL-... | LRN-... | ERR-... (related entries)
- Pattern-Key: lower.snake.case key for recurrence detection (e.g. `env.lockfile_mismatch`)
- Recurrence-Count: 1
- First-Seen / Last-Seen: YYYY-MM-DD

---

Field guidance

Status — verified = the verify step passed. pending-verify = patch applied but couldn't be fully proven (sandboxed/offline/CI-only) — surface to the user. abandoned = patch didn't work or diagnosis was wrong — document what was tried.
Trigger — free-form is fine. The listed values are common shapes; what matters is that the failure shape is described enough for future agents to match against.
Active-Context — optional. Use it if your environment has a meaningful "what was I doing" tag (an active skill, a current task phase, a build stage, an agent role). Skip if not applicable. The browser-harness analog is the per-domain scoping of domain-skills/<site>/.
Area — free-form. Pick whatever helps future agents find this. frontend, data-pipeline, ci, auth, terraform, mobile, embedded — anything that fits your project shape.
Pattern-Key — lower.snake.case, stable, reusable across projects. Two heals with the same key are recurrences. env.lockfile_mismatch is good; fixed_thing_tuesday isn't.

ID generation

Format: HEAL-YYYYMMDD-XXX. XXX is sequential 3-digit or 3-char random alphanumeric. Examples: HEAL-20260524-001, HEAL-20260524-A7B.

Artifacts directory (lazy)

Only create .learnings/heals/<HEAL-ID>/ when the heal generated something worth preserving. One-line fixes don't need a folder; the HEAL entry text is enough. Abandoned heals with no applied patch also skip the folder.

.learnings/
├── HEALS.md
├── ERRORS.md / LEARNINGS.md / FEATURE_REQUESTS.md  (self-improvement)
└── heals/
    └── HEAL-20260524-001/
        ├── helper.sh
        ├── patch.diff
        └── notes.md

Put here: generated scripts/helpers, patch files, supplementary notes, output captures that document the diagnosis. Don't put here: project source changes (those go in the project tree, referenced via Related Files); secrets; output already captured in the HEAL text.

Verification rules

Verify is the load-bearing wall. The whole point of self-healing over self-improvement is that the fix is proven, not theorized.

What counts as proof

| Failure shape | Verification | | ------------------------------------- | ------------------------------------------------------------------ | | Tool / command / test / build / lint | Re-run the original invocation; expect exit 0 / pass | | Missing capability | Invoke the helper end-to-end on a real input; expect the intent | | Environment drift | Re-run the operation that triggered the diagnosis | | External service workaround | Re-run the failed call with the patch; expect a usable response |

Sandboxed / offline / CI-only failures

When you genuinely can't run the verify step (no network, no real remote, sandboxed shell, CI-only reproduction), file Status: pending-verify with:

The exact command the user / CI should run
The acceptance criteria — what counts as proof
A simulated proof if you can construct one (e.g. a dry-run mode, a stub of the failing call, a sandbox script)

pending-verify is honest. Faking verified is the failure mode this skill exists to prevent.

When to invest in a proof script

Most heals don't need a separate proof script — the verify step is just re-running the failing thing. Build a proper proof script when:

The heal generates a reusable helper that needs to be exercised across cases
The failure can't be reproduced live but can be reproduced in a sandbox (clean git repo, mock service, fake input)
You expect the heal to be re-applied across projects — the proof script then doubles as a regression check

If verification fails

Once — refine the patch and retry. First diagnosis is often wrong.
Twice — step back and reconsider the diagnosis. Maybe the root cause is elsewhere.
Three times — stop. File Status: abandoned with notes on what you tried. Surface to the user. Don't flail.

What does NOT count as verification

"It looks right" / "I think this should work"
Re-running a different command than the one that originally failed
Suppressing the failure (|| true, --ignore-errors) — that's hiding
Skipping or deleting the failing test — that's regression
Passing because the cache was warm from before the fix

Reversibility

Prefer reversible patches. If your heal modifies project files, capture the diff in patch.diff. If the heal is destructive (deletes generated files, rewrites locks), note it explicitly — a future agent reading the HEAL needs to know what was destroyed.

Recurrence and promotion

Most heals are recurrences. Before filing a new HEAL, search:

grep -n "Pattern-Key: <your-pattern-key>" .learnings/HEALS.md

If found:

Increment Recurrence-Count
Update Last-Seen
Add the current occurrence as a See Also link
Do not create a duplicate entry

Promotion threshold

Add a Handoff block to an existing entry when all are true:

Recurrence-Count >= 3
Seen across at least 2 distinct tasks
The fix is generalizable (not project-specific in a way that's already in a memory file)

### Handoff
- **Promoted To**: self-improvement at YYYY-MM-DD
- **Promotion Target**: CLAUDE.md | AGENTS.md | .github/copilot-instructions.md | new-skill
- **Distilled Rule**: One-line prevention guidance derived from the heal

Then self-improvement (or a learning aggregator) takes over: distills the rule, writes it into the right context file, or extracts a reusable skill. The HEAL stays for traceability.

Anti-patterns

Logging without verifying. A HEAL filed before the fix is proven turns this into noisier self-improvement. If verify hasn't passed, the entry is pending-verify or abandoned.
Healing the symptom, not the cause. A failing test isn't healed by skipping it (pytest.skip, it.skip, xit). A flaky CI isn't healed by --retry. Find the root cause; if you can't, abandon honestly.
Generating a new fix without trying existing ones first. Search HEALS.md by Pattern-Key. Most heals are recurrences.
Inventing helpers when the project already has them. Look in scripts/, Makefile, justfile, package.json, pyproject.toml first. Heal = write what's missing, not what's there.
Scope creep. A heal is scoped to one failure. Cleanup belongs in a quality pass; refactors are features. Scope creep makes heals unreviewable.
Empty artifact folders. Don't create .learnings/heals/<HEAL-ID>/ if nothing goes in it.

Best practices

Heal eagerly, file always. Even abandoned heals teach the next agent what doesn't work.
Verify before persist. The non-negotiable rule.
Minimal and reversible patches. A 3-line fix is a heal; a 300-line refactor is a feature.
Stable Pattern-Keys. env.node_version_mismatch is reusable; fixed_the_thing_on_tuesday isn't.
Reference, don't duplicate. Cross-link related HEAL/LRN/ERR via See Also.
Hand off recurrences. A heal seen 3 times deserves to be in the project's permanent memory.
Don't gate the main tree on heal artifacts. Files under .learnings/heals/ are reference material; if a script becomes load-bearing, promote it to scripts/.

Setup

mkdir -p .learnings        # heals/ is lazy — created only when artifacts exist
touch .learnings/HEALS.md

Gitignore choices match self-improvement. Keep heals local (.learnings/ in .gitignore) or share them as team knowledge (don't gitignore — they become reviewable durable context).

Hook integration

Automatic triggering on command failures is optional and agent-specific. See references/hooks.md for Claude Code / Codex configuration.

Multi-agent use

The skill is agent-agnostic. The .learnings/HEALS.md format is plain markdown — any agent (Claude Code, Codex CLI, Copilot, Cursor, Aider, ...) can read and write it. Agents without hook support can be reminded via their instruction file (e.g. .github/copilot-instructions.md). See references/hooks.md for examples.

Pipeline integration

How self-healing slots into a larger skill pipeline (with upstream surfacing of past heals, downstream promotion of recurrences, and machine-verification gates) is documented in references/pipeline-integration.md. Not required to use this skill — it stands alone.

Self-Healing

What this skill is for

When a coding agent hits a wall mid-task, the default failure modes are:

Paper over it — "let me try a different approach" — and lose the recovery
Pretend the fix worked — without re-running the broken thing
Symptom-fix — skip the test, swallow the error, retry until green

All three turn a one-time failure into a recurrence. The next agent on the same project hits the same wall.

Relationship to self-improvement

These two skills are deliberately split. Run both — they feed each other but don't overlap.

Boundary rule: if you're capturing a fact, a correction, or a wish — that's self-improvement. If you're applying and verifying a fix to a live failure — that's self-healing.

The Heal Loop

  ● failure observed
  │
  ● 1. DIAGNOSE  capture context — command, error, env, what was attempted
  │              search HEALS.md for the same Pattern-Key first
  │              (most heals are recurrences; don't reinvent)
  │
  ● 2. PATCH     write the fix — script, helper, env tweak, alt command
  │              artifacts → .learnings/heals/<HEAL-ID>/  (only if needed)
  │
  ● 3. VERIFY    re-run the failing op — must succeed
  │              ↻ if still failing: refine and retry, cap at 3 attempts
  │              ✗ if uncrackable: file Status: abandoned with notes
  │
  ● 4. FILE      write HEAL-YYYYMMDD-XXX to .learnings/HEALS.md
  │              with Pattern-Key, status, verification proof
  │
  ✓ working state restored, heal persisted

  (conditional) PROMOTE  if Pattern-Key recurrence ≥ 3 across distinct tasks,
                          append a Handoff block → self-improvement promotes to memory

If you abandon a heal mid-loop, don't pretend it succeeded. File a HEAL- entry with Status: abandoned and notes on what didn't work. The next agent learns from the dead end too.

When to trigger

Self-healing fires on active failures during execution — the agent has just observed something not working and needs to make it work to continue. Five shapes:

1. Tool failure (command / test / build / lint)

Any invocation exits non-zero or produces wrong output. Don't acknowledge and retry verbatim — diagnose, patch, verify.

2. Missing capability / tool gap

The agent needs something that doesn't exist yet — a script, a helper, a wrapper, a glue function. Write it in the moment. This is the closest analog to browser-harness's agent_helpers.py.

3. Environment issue

The local environment isn't what the project expects. Detect, patch, verify.

4. External service / API change

A service the agent depends on returns something unexpected. Find a workaround and capture it.

5. About-to-retry-the-same-broken-approach

The agent catches itself about to redo the failing step. That self-recognition is a heal forming — capture the alternate approach as the patch.

Detection signals to watch for

Non-zero exit codes
Stack traces in tool output
The same operation failing twice with the same error
"I'll try a different approach" — capture it as a heal
command not found / module not found / permission denied
Stale assertions, snapshot mismatches, type errors that weren't there before
"Weird" output that suggests environmental rather than logical bugs

HEAL Entry Format

Append to .learnings/HEALS.md (create if missing):

## [HEAL-YYYYMMDD-XXX] short_kebab_name

**Logged**: ISO-8601 timestamp
**Status**: verified | pending-verify | abandoned
**Trigger**: tool-failure | missing-capability | env-issue | external-change | <free-form>
**Active-Context**: (optional) — current skill, task phase, or workflow stage; omit if not applicable
**Area**: free-form tag — what part of the system (`build`, `tests`, `ci`, `auth`, `data-pipeline`, `mobile`, ...)
**Priority**: low | medium | high | critical

### Failure
What broke — concrete: the command, the error message, the action that was blocked. Include exit codes and verbatim error lines.

### Diagnosis
The root cause as understood after investigation. Why the obvious approach didn't work. Not a guess — what was actually verified during the heal.

### Fix
The patch that was applied. Verbatim commands, code snippets, or pointers to files under `.learnings/heals/<HEAL-ID>/`. Keep it minimal — just enough to reproduce.

### Verification
What was run after the fix and what it returned. Exit code, output snippet, test pass count. **This is the proof.** Without it, the entry is `pending-verify` or `abandoned`.

### Artifacts
(omit this section if no files were generated; otherwise list relative paths under `.learnings/heals/<HEAL-ID>/`)

### Metadata
- Related Files: path/to/file.ext
- See Also: HEAL-... | LRN-... | ERR-... (related entries)
- Pattern-Key: lower.snake.case key for recurrence detection (e.g. `env.lockfile_mismatch`)
- Recurrence-Count: 1
- First-Seen / Last-Seen: YYYY-MM-DD

---

Field guidance

Status — verified = the verify step passed. pending-verify = patch applied but couldn't be fully proven (sandboxed/offline/CI-only) — surface to the user. abandoned = patch didn't work or diagnosis was wrong — document what was tried.
Trigger — free-form is fine. The listed values are common shapes; what matters is that the failure shape is described enough for future agents to match against.
Active-Context — optional. Use it if your environment has a meaningful "what was I doing" tag (an active skill, a current task phase, a build stage, an agent role). Skip if not applicable. The browser-harness analog is the per-domain scoping of domain-skills/<site>/.
Area — free-form. Pick whatever helps future agents find this. frontend, data-pipeline, ci, auth, terraform, mobile, embedded — anything that fits your project shape.
Pattern-Key — lower.snake.case, stable, reusable across projects. Two heals with the same key are recurrences. env.lockfile_mismatch is good; fixed_thing_tuesday isn't.

ID generation

Format: HEAL-YYYYMMDD-XXX. XXX is sequential 3-digit or 3-char random alphanumeric. Examples: HEAL-20260524-001, HEAL-20260524-A7B.

Artifacts directory (lazy)

.learnings/
├── HEALS.md
├── ERRORS.md / LEARNINGS.md / FEATURE_REQUESTS.md  (self-improvement)
└── heals/
    └── HEAL-20260524-001/
        ├── helper.sh
        ├── patch.diff
        └── notes.md

Verification rules

Verify is the load-bearing wall. The whole point of self-healing over self-improvement is that the fix is proven, not theorized.

What counts as proof

Sandboxed / offline / CI-only failures

When you genuinely can't run the verify step (no network, no real remote, sandboxed shell, CI-only reproduction), file Status: pending-verify with:

The exact command the user / CI should run
The acceptance criteria — what counts as proof
A simulated proof if you can construct one (e.g. a dry-run mode, a stub of the failing call, a sandbox script)

pending-verify is honest. Faking verified is the failure mode this skill exists to prevent.

When to invest in a proof script

Most heals don't need a separate proof script — the verify step is just re-running the failing thing. Build a proper proof script when:

The heal generates a reusable helper that needs to be exercised across cases
The failure can't be reproduced live but can be reproduced in a sandbox (clean git repo, mock service, fake input)
You expect the heal to be re-applied across projects — the proof script then doubles as a regression check

If verification fails

Once — refine the patch and retry. First diagnosis is often wrong.
Twice — step back and reconsider the diagnosis. Maybe the root cause is elsewhere.
Three times — stop. File Status: abandoned with notes on what you tried. Surface to the user. Don't flail.

What does NOT count as verification

"It looks right" / "I think this should work"
Re-running a different command than the one that originally failed
Suppressing the failure (|| true, --ignore-errors) — that's hiding
Skipping or deleting the failing test — that's regression
Passing because the cache was warm from before the fix

Reversibility

Recurrence and promotion

Most heals are recurrences. Before filing a new HEAL, search:

grep -n "Pattern-Key: <your-pattern-key>" .learnings/HEALS.md

If found:

Increment Recurrence-Count
Update Last-Seen
Add the current occurrence as a See Also link
Do not create a duplicate entry

Promotion threshold

Add a Handoff block to an existing entry when all are true:

Recurrence-Count >= 3
Seen across at least 2 distinct tasks
The fix is generalizable (not project-specific in a way that's already in a memory file)

### Handoff
- **Promoted To**: self-improvement at YYYY-MM-DD
- **Promotion Target**: CLAUDE.md | AGENTS.md | .github/copilot-instructions.md | new-skill
- **Distilled Rule**: One-line prevention guidance derived from the heal

Then self-improvement (or a learning aggregator) takes over: distills the rule, writes it into the right context file, or extracts a reusable skill. The HEAL stays for traceability.

Anti-patterns

Logging without verifying. A HEAL filed before the fix is proven turns this into noisier self-improvement. If verify hasn't passed, the entry is pending-verify or abandoned.
Healing the symptom, not the cause. A failing test isn't healed by skipping it (pytest.skip, it.skip, xit). A flaky CI isn't healed by --retry. Find the root cause; if you can't, abandon honestly.
Generating a new fix without trying existing ones first. Search HEALS.md by Pattern-Key. Most heals are recurrences.
Inventing helpers when the project already has them. Look in scripts/, Makefile, justfile, package.json, pyproject.toml first. Heal = write what's missing, not what's there.
Scope creep. A heal is scoped to one failure. Cleanup belongs in a quality pass; refactors are features. Scope creep makes heals unreviewable.
Empty artifact folders. Don't create .learnings/heals/<HEAL-ID>/ if nothing goes in it.

Best practices

Heal eagerly, file always. Even abandoned heals teach the next agent what doesn't work.
Verify before persist. The non-negotiable rule.
Minimal and reversible patches. A 3-line fix is a heal; a 300-line refactor is a feature.
Stable Pattern-Keys. env.node_version_mismatch is reusable; fixed_the_thing_on_tuesday isn't.
Reference, don't duplicate. Cross-link related HEAL/LRN/ERR via See Also.
Hand off recurrences. A heal seen 3 times deserves to be in the project's permanent memory.
Don't gate the main tree on heal artifacts. Files under .learnings/heals/ are reference material; if a script becomes load-bearing, promote it to scripts/.

Setup

mkdir -p .learnings        # heals/ is lazy — created only when artifacts exist
touch .learnings/HEALS.md

Gitignore choices match self-improvement. Keep heals local (.learnings/ in .gitignore) or share them as team knowledge (don't gitignore — they become reviewable durable context).

Hook integration

Automatic triggering on command failures is optional and agent-specific. See references/hooks.md for Claude Code / Codex configuration.

Adoption

pskoett/self-healing

$ install --global

Security Scan Results

SKILL.md

Self-Healing

What this skill is for

Relationship to self-improvement

The Heal Loop

When to trigger

1. Tool failure (command / test / build / lint)

2. Missing capability / tool gap

3. Environment issue

4. External service / API change

5. About-to-retry-the-same-broken-approach

Detection signals to watch for

HEAL Entry Format

Field guidance

ID generation

Artifacts directory (lazy)

Verification rules

What counts as proof

Sandboxed / offline / CI-only failures

When to invest in a proof script

If verification fails

What does NOT count as verification

Reversibility

Recurrence and promotion

Promotion threshold

Anti-patterns

Best practices

Setup

Hook integration

Multi-agent use

Pipeline integration

See also

Related Skills

pskoett/agent-teams-simplify-and-harden