Self-Healing Codebase

Automated fault detection and repair. When a test fails, a build breaks, or a runtime error is logged, the healing pipeline wakes up, finds the root cause, generates a minimal fix, validates it, and applies it -- all without human intervention, as long as confidence is above the safety threshold.

Trigger Conditions

The healing pipeline activates on any of these events:

| Trigger | Signal | Source | |---------|--------|--------| | Test failure | Any test exits non-zero | CI logs, npm test output | | Build break | Compiler/bundler error | CI logs, npm run build output | | Runtime error | Unhandled exception, 5xx response | Application logs, Vercel logs | | Type error | tsc --noEmit fails | Pre-commit hook, CI | | Lint failure | ESLint/Prettier error blocking CI | CI logs |

Signals NOT Handled

The following always require human review -- do NOT auto-heal:

Security vulnerabilities (auth bypass, injection, XSS)
Database migrations (schema changes, data loss risk)
Dependency version conflicts (breaking API changes)
Flaky tests (non-deterministic failures -- use replay agent instead)

The 4-Phase Healing Pipeline

DETECT    ->    DIAGNOSE    ->    FIX    ->    VALIDATE
  |                |               |               |
Trigger        Root cause       Minimal         Re-run
fires          identified       patch           test suite
               confidence       applied         Pass? DONE
               scored           to branch       Fail? ROLLBACK

Phase 1: Detect

Capture the full failure context:

- Error message (exact, not summarized)
- Stack trace (all frames, not truncated)
- File and line number of origin
- Recent git changes touching that file (last 3 commits)
- Environment (Node version, OS, CI vs local)

Pass all context to the Diagnose phase. Incomplete context = low confidence = no auto-fix.

Phase 2: Diagnose

The sleuth agent performs root cause analysis:

Read the failing file at the reported line
Trace the call stack upward (max 3 frames)
Check if a recent git commit introduced the regression (git log -5 -- <file>)
Classify the failure type:

| Type | Examples | Confidence Boost | |------|----------|-----------------| | Regression | Line added in last commit causes error | +20% | | Null reference | Cannot read property X of undefined | +10% | | Import error | Module not found, wrong path | +15% | | Type mismatch | TS type error at known location | +15% | | Config error | Missing env var, wrong format | +10% | | Logic error | Wrong condition, off-by-one | 0% (harder to auto-fix) |

Confidence score = base 50% + type bonus + evidence quality bonus (0-30%).

If final confidence < 80%: log the diagnosis, notify in thoughts/HEALING.md, stop. Do not attempt fix.

Phase 3: Fix

The spark agent generates the minimal fix:

Before touching any file:

git stash push -u -m "self-healing: pre-fix stash for <error-id>"

Fix rules:

Change the minimum number of lines necessary
Do not refactor surrounding code
Do not add features
Do not change function signatures
Add a comment: // self-healing fix: <short reason>

Fix size limits:

Maximum 20 lines changed per fix
Maximum 3 files touched per fix
If fix requires more than this, confidence is forced to 0 -- escalate to human

Phase 4: Validate

The verifier agent runs the relevant test suite:

# For test failures: run the specific failing test + its file's full suite
npm test -- --testPathPattern="<failing-file>"

# For build breaks: re-run the build
npm run build

# For type errors: re-run type check
npx tsc --noEmit

# For runtime errors: run integration tests if available
npm run test:integration

Decision:

Validation PASS:
  -> git stash drop (discard the safety stash)
  -> Commit the fix: "fix: self-healing repair for <error-id>"
  -> Log to thoughts/HEALING.md
  -> Update MTTR metric

Validation FAIL:
  -> git stash pop --index (restore pre-fix state exactly)
  -> Log failure to thoughts/HEALING.md
  -> Escalate to human with full diagnosis report
  -> Do NOT retry automatically (prevents loop)

Safety Rules

Confidence gate: Never apply a fix when confidence < 80%
Scope gate: Never touch security code, DB migrations, or auth logic
Size gate: Never change more than 20 lines or 3 files in one heal
No retry loop: If validation fails, stop. Human takes over.
Always stash: Every fix attempt must be preceded by a git stash
One fix per trigger: One error = one heal attempt. If the error recurs after a successful fix, it is a different problem -- do not re-heal automatically.

Agent Roles

| Agent | Phase | Responsibility | |-------|-------|----------------| | sleuth | Diagnose | Root cause analysis, confidence scoring | | spark | Fix | Minimal patch generation | | verifier | Validate | Test suite execution, pass/fail decision | | self-learner | Post-heal | Record the pattern for future prevention | | coroner | Post-heal | Check if the same bug exists elsewhere in the codebase |

Healing Log Format

Append each healing event to thoughts/HEALING.md:

## Healing Event: 2026-04-07T10:15:00Z

### Trigger
- Type: Test failure
- File: src/api/users.test.ts
- Error: "Cannot read properties of undefined (reading 'id')"
- Stack: users.test.ts:42 -> users.ts:87 -> db/query.ts:31

### Diagnosis (sleuth)
- Root cause: db/query.ts:31 returns `null` when no row found; callers assume non-null
- Regression: commit abc123 (2026-04-06) removed the null guard
- Classification: Null reference (confidence: +10%)
- Evidence: stack trace + git blame confirm exact line + recent commit
- Final confidence: 85%

### Fix (spark)
- File: src/api/users.ts line 87
- Change: Added null check before accessing .id
- Lines changed: 3
- Stash: self-healing-20260407-101500

### Validation (verifier)
- Command: npm test -- --testPathPattern="src/api/users"
- Result: PASS (14 tests, 0 failures)
- Decision: KEEP

### Outcome
- Status: HEALED
- Time to repair: 4 minutes
- MTTR contribution: recorded

---

## Healing Event: 2026-04-07T14:22:00Z

### Trigger
- Type: Build break
- Error: Module '"./config"' has no exported member 'DATABASE_URL'

### Diagnosis (sleuth)
- Root cause: config.ts was refactored, export renamed to DB_URL
- Confidence: 91%

### Fix (spark)
- File: src/api/db.ts line 3
- Change: import { DB_URL as DATABASE_URL } from './config'
- Lines changed: 1

### Validation (verifier)
- Command: npm run build
- Result: PASS

### Outcome
- Status: HEALED
- Time to repair: 2 minutes

Metrics

Track in thoughts/HEALING.md header section (update after each event):

# Self-Healing Metrics (updated: 2026-04-07)

| Metric | Value |
|--------|-------|
| Total healing events | 23 |
| Successful auto-fixes | 19 (82.6%) |
| Escalated to human | 4 (17.4%) |
| Average MTTR (auto-fixed) | 3.8 minutes |
| Average MTTR (escalated) | 47 minutes |
| Most common trigger | Test failure (61%) |
| Most common fix type | Null reference guard |

Key Metrics Defined

MTTR (Mean Time To Repair): Time from trigger detection to validation PASS
Auto-fix success rate: Healed / (Healed + Escalated)
False fix rate: Fixes that passed validation but broke something else (caught by post-deploy monitoring)

Scope Limits (Non-Negotiable)

The healing pipeline will NEVER auto-fix:

- Files matching: **/auth/**, **/security/**, **/middleware/auth*
- Files matching: **/migrations/**, **/schema.*
- Environment variables or secrets
- Package versions in package.json / requirements.txt
- CI/CD configuration (.github/**, vercel.json, Dockerfile)
- Any file touched by a security-related commit message

If the failing code is in a scoped-out path, the pipeline logs the trigger and immediately escalates with the full diagnosis -- no fix attempt.

Activation

This skill is referenced by:

PostToolUse hooks when test/build commands exit non-zero
The sentinel agent during incident response
The verifier agent when a final validation fails

To invoke manually:

Use self-healing to diagnose and fix the failing test in src/api/users.test.ts.
Error: "Cannot read properties of undefined (reading 'id')"
Stack trace: [paste full trace]

Self-Healing Codebase

Trigger Conditions

The healing pipeline activates on any of these events:

Signals NOT Handled

The following always require human review -- do NOT auto-heal:

Security vulnerabilities (auth bypass, injection, XSS)
Database migrations (schema changes, data loss risk)
Dependency version conflicts (breaking API changes)
Flaky tests (non-deterministic failures -- use replay agent instead)

The 4-Phase Healing Pipeline

DETECT    ->    DIAGNOSE    ->    FIX    ->    VALIDATE
  |                |               |               |
Trigger        Root cause       Minimal         Re-run
fires          identified       patch           test suite
               confidence       applied         Pass? DONE
               scored           to branch       Fail? ROLLBACK

Phase 1: Detect

Capture the full failure context:

- Error message (exact, not summarized)
- Stack trace (all frames, not truncated)
- File and line number of origin
- Recent git changes touching that file (last 3 commits)
- Environment (Node version, OS, CI vs local)

Pass all context to the Diagnose phase. Incomplete context = low confidence = no auto-fix.

Phase 2: Diagnose

The sleuth agent performs root cause analysis:

Read the failing file at the reported line
Trace the call stack upward (max 3 frames)
Check if a recent git commit introduced the regression (git log -5 -- <file>)
Classify the failure type:

Confidence score = base 50% + type bonus + evidence quality bonus (0-30%).

If final confidence < 80%: log the diagnosis, notify in thoughts/HEALING.md, stop. Do not attempt fix.

Phase 3: Fix

The spark agent generates the minimal fix:

Before touching any file:

git stash push -u -m "self-healing: pre-fix stash for <error-id>"

Fix rules:

Change the minimum number of lines necessary
Do not refactor surrounding code
Do not add features
Do not change function signatures
Add a comment: // self-healing fix: <short reason>

Fix size limits:

Maximum 20 lines changed per fix
Maximum 3 files touched per fix
If fix requires more than this, confidence is forced to 0 -- escalate to human

Phase 4: Validate

The verifier agent runs the relevant test suite:

# For test failures: run the specific failing test + its file's full suite
npm test -- --testPathPattern="<failing-file>"

# For build breaks: re-run the build
npm run build

# For type errors: re-run type check
npx tsc --noEmit

# For runtime errors: run integration tests if available
npm run test:integration

Decision:

Validation PASS:
  -> git stash drop (discard the safety stash)
  -> Commit the fix: "fix: self-healing repair for <error-id>"
  -> Log to thoughts/HEALING.md
  -> Update MTTR metric

Validation FAIL:
  -> git stash pop --index (restore pre-fix state exactly)
  -> Log failure to thoughts/HEALING.md
  -> Escalate to human with full diagnosis report
  -> Do NOT retry automatically (prevents loop)

Safety Rules

Confidence gate: Never apply a fix when confidence < 80%
Scope gate: Never touch security code, DB migrations, or auth logic
Size gate: Never change more than 20 lines or 3 files in one heal
No retry loop: If validation fails, stop. Human takes over.
Always stash: Every fix attempt must be preceded by a git stash
One fix per trigger: One error = one heal attempt. If the error recurs after a successful fix, it is a different problem -- do not re-heal automatically.

Agent Roles

Healing Log Format

Append each healing event to thoughts/HEALING.md:

## Healing Event: 2026-04-07T10:15:00Z

### Trigger
- Type: Test failure
- File: src/api/users.test.ts
- Error: "Cannot read properties of undefined (reading 'id')"
- Stack: users.test.ts:42 -> users.ts:87 -> db/query.ts:31

### Diagnosis (sleuth)
- Root cause: db/query.ts:31 returns `null` when no row found; callers assume non-null
- Regression: commit abc123 (2026-04-06) removed the null guard
- Classification: Null reference (confidence: +10%)
- Evidence: stack trace + git blame confirm exact line + recent commit
- Final confidence: 85%

### Fix (spark)
- File: src/api/users.ts line 87
- Change: Added null check before accessing .id
- Lines changed: 3
- Stash: self-healing-20260407-101500

### Validation (verifier)
- Command: npm test -- --testPathPattern="src/api/users"
- Result: PASS (14 tests, 0 failures)
- Decision: KEEP

### Outcome
- Status: HEALED
- Time to repair: 4 minutes
- MTTR contribution: recorded

---

## Healing Event: 2026-04-07T14:22:00Z

### Trigger
- Type: Build break
- Error: Module '"./config"' has no exported member 'DATABASE_URL'

### Diagnosis (sleuth)
- Root cause: config.ts was refactored, export renamed to DB_URL
- Confidence: 91%

### Fix (spark)
- File: src/api/db.ts line 3
- Change: import { DB_URL as DATABASE_URL } from './config'
- Lines changed: 1

### Validation (verifier)
- Command: npm run build
- Result: PASS

### Outcome
- Status: HEALED
- Time to repair: 2 minutes

Metrics

Track in thoughts/HEALING.md header section (update after each event):

# Self-Healing Metrics (updated: 2026-04-07)

| Metric | Value |
|--------|-------|
| Total healing events | 23 |
| Successful auto-fixes | 19 (82.6%) |
| Escalated to human | 4 (17.4%) |
| Average MTTR (auto-fixed) | 3.8 minutes |
| Average MTTR (escalated) | 47 minutes |
| Most common trigger | Test failure (61%) |
| Most common fix type | Null reference guard |

Key Metrics Defined

MTTR (Mean Time To Repair): Time from trigger detection to validation PASS
Auto-fix success rate: Healed / (Healed + Escalated)
False fix rate: Fixes that passed validation but broke something else (caught by post-deploy monitoring)

Scope Limits (Non-Negotiable)

The healing pipeline will NEVER auto-fix:

- Files matching: **/auth/**, **/security/**, **/middleware/auth*
- Files matching: **/migrations/**, **/schema.*
- Environment variables or secrets
- Package versions in package.json / requirements.txt
- CI/CD configuration (.github/**, vercel.json, Dockerfile)
- Any file touched by a security-related commit message

If the failing code is in a scoped-out path, the pipeline logs the trigger and immediately escalates with the full diagnosis -- no fix attempt.

Activation

This skill is referenced by:

PostToolUse hooks when test/build commands exit non-zero
The sentinel agent during incident response
The verifier agent when a final validation fails

To invoke manually:

Use self-healing to diagnose and fix the failing test in src/api/users.test.ts.
Error: "Cannot read properties of undefined (reading 'id')"
Stack trace: [paste full trace]

Adoption

rubicanjr/self-healing

$ install --global

Security Scan Results

SKILL.md

Self-Healing Codebase

Trigger Conditions

Signals NOT Handled

The 4-Phase Healing Pipeline

Phase 1: Detect

Phase 2: Diagnose

Phase 3: Fix

Phase 4: Validate

Safety Rules

Agent Roles

Healing Log Format

Metrics

Key Metrics Defined

Scope Limits (Non-Negotiable)

Activation

Related Skills

rubicanjr/workflow-router

rubicanjr/wiring

rubicanjr/websocket-patterns

rubicanjr/visual-verdict

rubicanjr/self-healing

$ install --global

Security Scan Results

SKILL.md

Self-Healing Codebase

Trigger Conditions

Signals NOT Handled

The 4-Phase Healing Pipeline

Phase 1: Detect

Phase 2: Diagnose

Phase 3: Fix

Phase 4: Validate

Safety Rules

Agent Roles

Healing Log Format

Metrics

Key Metrics Defined

Scope Limits (Non-Negotiable)

Activation

Related Skills

rubicanjr/workflow-router

rubicanjr/wiring

rubicanjr/websocket-patterns

rubicanjr/visual-verdict