skills/self-healing-ci/SKILL.md
CI-only self-healing workflow using gh-aw (GitHub Agentic Workflows) for active runtime recovery on pull requests and scheduled runs. When a CI check fails (test, build, lint, deploy, scan), this skill diagnoses the failure from CI logs, proposes a verified patch as a PR comment or follow-up commit, and commits a HEAL entry to `.learnings/HEALS.md`. Verify-before-persist discipline preserved: a HEAL is only `verified` if a re-run check passes in the same workflow; otherwise it ships as `pending-verify` for human follow-up. Recurrent heal patterns across PRs accumulate `Recurrence-Count` and append a `Handoff` block at ≥3 to flag promotion via self-improvement-ci. Use this skill when: you want headless heal-loop execution in CI/scheduled pipelines, you want recurring failure patterns captured automatically, or you want PRs that surface non-obvious environmental / tooling fixes without human triage. For interactive/local sessions, use `self-healing` instead.
npx skillsauth add pskoett/pskoett-ai-skills self-healing-ciInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
CI-only variant of self-healing. Runs the diagnose → patch → verify → file loop headlessly against pull-request and scheduled workflow events.
gh skill install pskoett/pskoett-skills self-healing-ci
Fallback using the Agent Skills CLI:
npx skills add pskoett/pskoett-skills/skills/self-healing-ci
Run self-healing in CI without interactive chat loops:
HEAL- entry to .learnings/HEALS.md with verification proof (or pending-verify if the workflow can't re-run the check)Pattern-Key before filing new ones — deduplicate recurrencesHandoff block at Recurrence-Count >= 3 for promotion via self-improvement-ciUse self-healing for interactive/local sessions.
CI agents do not have peak task context from the original implementation session. The agent is reading CI logs and code, not riding peak context after a focused implementation. Implications:
pending-verify and surface to the PR authorverified — re-run the failing check in the same workflow rungh auth status)gh-aw installed for authoring/validation:gh extension install github/gh-aw
.learnings/HEALS.md committed to the repo (or created on first run; see references/workflow-example.md for the bootstrap pattern)The CI skill must:
.learnings/HEALS.md — nothing else from the PR author's machineverified requires this; pending-verify is honest if it cannotHEAL- entry only on a successful re-run — abandoned heals are still filed, but in a separate commit clearly labeledself_healing_ci:
source:
pr_number: 123
commit_sha: "abc123def"
failed_check: "test (node 20)"
workflow_run_id: 4567891234
heal:
heal_id: "HEAL-20260524-001"
status: "verified" # verified | pending-verify | abandoned
trigger: "tool-failure" # free-form
active_context: "ci" # optional
area: "tests" # free-form
pattern_key: "env.lockfile_mismatch"
diagnosis: "Project uses pnpm; CI workflow ran `npm ci`."
fix:
summary: "Switch the CI install step from `npm ci` to `pnpm install --frozen-lockfile`."
diff_path: ".learnings/heals/HEAL-20260524-001/patch.diff" # only if files generated
verification:
command: "pnpm install --frozen-lockfile"
exit_code: 0
output_excerpt: "Lockfile is up to date, resolution step is skipped"
recurrence_count: 1
promotion_ready: false # true at recurrence_count >= 3
summary:
heals_filed: 1
verified: 1
pending_verify: 0
abandoned: 0
promotion_candidates: 0
In CI the verify step is operationalized as re-running the failed check inside the same workflow run after applying the proposed patch:
| Original failure | Verify step in CI |
|------------------|-------------------|
| pnpm test failed | Re-run pnpm test after the patch |
| Build (tsc, cargo build) failed | Re-run the build step |
| Lint (eslint, ruff) failed | Re-run the lint step |
| Deploy preview failed | Re-run the deploy step (if the workflow allows) |
| Snapshot diff | Re-run with deterministic stubs if applicable |
If the re-run isn't feasible (the check requires secrets only available in production workflows; the failure is transient; the patch needs human review before commit), the HEAL ships as pending-verify with explicit notes on what would prove it.
Never fake verified. Faking is the exact failure mode this skill exists to prevent — and in CI, the consequences propagate further than in interactive sessions because future PRs may apply the unverified "fix" automatically.
.learnings/HEALS.md by Pattern-Key before filing new healsRecurrence-Count, update Last-Seen, append the new occurrence to See AlsoRecurrence-Count >= 3Handoff block to the existing HEAL with a Promotion Target (CLAUDE.md / AGENTS.md / .github/copilot-instructions.md / new-skill) and a one-line Distilled Ruleself-improvement-ci consumes the Handoff blocks and proposes the promotion as a PR| Trigger | Use case |
|---------|----------|
| workflow_run (completed, conclusion: failure) | Most common — react to other workflows failing |
| pull_request (with if: guard on check status) | Run on every PR but skip if all checks passed |
| schedule (nightly) | Look for stale flakes, surface patterns the per-PR runs missed |
| workflow_dispatch | Manual replay against a specific PR or commit |
Authoring patterns and example .github/workflows/*.lock.yml files live in references/workflow-example.md. Keep example workflows out of .github/workflows until you've explicitly decided to enable CI automation.
The interactive skill's anti-patterns all apply. CI-specific ones to watch:
if: github.actor != 'github-actions[bot]' to prevent infinite loops.Pattern-Key collisions. If two PRs hit the same Pattern-Key with different root causes, the keys are too coarse — refine them rather than letting them merge.self-healing — the interactive skill this mirrors; same file format, same verify disciplineself-improvement-ci — receives heal Handoff blocks; proposes promotion to memory filessimplify-and-harden-ci — quality pass after heals stabilize the PRverify-gate — the interactive verify gate; self-healing-ci's verify is the CI workflow re-runreferences/workflow-example.md — gh-aw workflow templates and authoring notestools
Active runtime recovery for coding agents: when something breaks mid-task, diagnose the root cause, write a fix, VERIFY by re-running the broken thing, then file a `HEAL-` entry to `.learnings/HEALS.md` with proof. Use whenever a command, test, build, or lint fails or exits non-zero; on missing tooling, dependency/lockfile mismatch, wrong runtime version, venv or permission errors, port conflicts, dirty git state, or a missing `.env`; when the agent needs a helper or one-off script that doesn't exist yet; when an external API, tool, or MCP errors or rate-limits; or when a test flakes. Search `HEALS.md` by `Pattern-Key` first — most heals are recurrences, so increment `Recurrence-Count` instead of duplicating. Verify is mandatory: mark `pending-verify` honestly if sandboxed, `abandoned` if the fix can't be made to work. Pairs with `self-improvement` (which promotes recurring heals to durable memory) but owns the verify-before-persist discipline self-improvement doesn't.
development
Control-plane workflow for coordinating multi-agent, multi-session project work from a single Codex, GitHub Copilot, or agent-app control session. Use this skill whenever the user asks to orchestrate agents, create or steer worker sessions, run a workflow-like effort, fan out audits/research/migrations, coordinate parallel implementation streams, monitor other project sessions, or compare this control-session pattern to Claude Code dynamic workflows. This skill is especially relevant when the current session can spawn persistent project sessions and those sessions can spawn their own subagents, creating a two-level orchestration hierarchy.
tools
Active runtime recovery for coding agents: when something breaks mid-task, diagnose the root cause, write a fix, VERIFY by re-running the broken thing, then file a `HEAL-` entry to `.learnings/HEALS.md` with proof. Use whenever a command, test, build, or lint fails or exits non-zero; on missing tooling, dependency/lockfile mismatch, wrong runtime version, venv or permission errors, port conflicts, dirty git state, or a missing `.env`; when the agent needs a helper or one-off script that doesn't exist yet; when an external API, tool, or MCP errors or rate-limits; or when a test flakes. Search `HEALS.md` by `Pattern-Key` first — most heals are recurrences, so increment `Recurrence-Count` instead of duplicating. Verify is mandatory: mark `pending-verify` honestly if sandboxed, `abandoned` if the fix can't be made to work. Pairs with `self-improvement` (which promotes recurring heals to durable memory) but owns the verify-before-persist discipline self-improvement doesn't.
development
Control-plane workflow for coordinating multi-agent, multi-session project work from a single Codex, GitHub Copilot, or agent-app control session. Use this skill whenever the user asks to orchestrate agents, create or steer worker sessions, run a workflow-like effort, fan out audits/research/migrations, coordinate parallel implementation streams, monitor other project sessions, or compare this control-session pattern to Claude Code dynamic workflows. This skill is especially relevant when the current session can spawn persistent project sessions and those sessions can spawn their own subagents, creating a two-level orchestration hierarchy.