skills/self-healing/SKILL.md
Detect service failures and system anomalies. Diagnose root causes. Auto-remediate using NixOS rollback, service restart, or config repair. Every action logged to the hash-chained audit ledger.
npx skillsauth add bolivian-peru/os-moda self-healingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You can detect, diagnose, and fix system problems automatically.
When you detect a service failure or anomaly during a heartbeat check:
Check teachd for historical patterns first — teachd observes the system 24/7 between conversations
teach_context({ context: "nginx failure service down" })
teach_patterns({ min_confidence: 0.5 })
This surfaces slow-burn issues (memory leaks, recurring failures, correlated events) that you wouldn't catch in a single conversation.
Confirm the failure — don't act on a single check
service_status({ service: "nginx" })
If the service is down, check again after 10 seconds. If still down, proceed.
Check journal logs for why it failed
journal_logs({ unit: "nginx", lines: 30, priority: "err" })
Recall past incidents — have we seen this before?
memory_recall({ query: "nginx failure", timeframe: "30d" })
Determine the root cause before acting:
shell_exec({ command: "systemctl restart nginx" })
Safe, fast, fixes 80% of issues.
If restart doesn't fix it, or if config is corrupted:
shell_exec({ command: "nixos-rebuild switch --rollback" })
This atomically reverts to the last known-good NixOS generation.
If you can identify the exact config issue:
file_read({ path: "/etc/nixos/configuration.nix" })
Fix the config, then rebuild:
shell_exec({ command: "nixos-rebuild switch" })
If disk/memory is the problem:
shell_exec({ command: "nix-collect-garbage -d" })
shell_exec({ command: "journalctl --vacuum-size=500M" })
ALWAYS do these three things:
Verify the fix worked
service_status({ service: "nginx" })
Store in memory
memory_store({
summary: "nginx failure: config missing, rolled back to gen 47",
detail: "Full diagnosis and fix details...",
category: "diagnosis",
tags: "self-healing,nginx,rollback"
})
Log to audit ledger — this happens automatically through agentd
After fixing an issue, compose a clear message for the user:
I detected nginx was down at 03:17 UTC.
Cause: The configuration file /etc/nginx/nginx.conf referenced
a missing upstream. This was introduced in NixOS generation 48
(rebuilt 2 hours ago).
Action: Rolled back to generation 47. nginx is running again.
All health checks pass.
Audit entry: #52 (hash: a7f3b2...)
Keep it factual, concise, and include the audit reference.
devops
Multi-perspective risk analysis using structured persona debate before deploying changes
development
Build software via spec-driven development (github/spec-kit). Whenever the user asks for a feature larger than a one-line tweak, scaffold a spec-kit project, capture WHAT + WHY, declare tech stack, break into tasks, then iterate the implementation until tests pass.
development
Manage NixOS packages declaratively. Search, install (via configuration.nix rebuild), remove, rollback, and list generations. Understands the NixOS declarative model.
data-ai
Monitor system health: CPU, memory, disk, network, processes, services, and logs. Present data naturally. Correlate issues across subsystems. Alert on thresholds. Diagnose root causes.