skills/incident-response/SKILL.md
Structured incident response workflow — severity classification, diagnosis delegation, auto-fix decisions, notification, and post-incident review. Use when an automated failure event is received or when the user reports a production incident.
npx skillsauth add nixopus/agent incident-responseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Classify the incident by severity before acting:
| Severity | Criteria | Response time | Action | |---|---|---|---| | Critical | App completely down, all users affected | Immediate | Diagnose + attempt auto-fix + notify | | High | App degraded, errors for some users | Within minutes | Diagnose + attempt auto-fix + notify | | Medium | Non-user-facing failure (build failed, deploy failed) | Within session | Diagnose + fix suggestion + notify | | Low | Warning, non-critical issue detected | Informational | Notify only |
| Signal | Severity |
|---|---|
| Container exited, restart_count > 3 | Critical |
| HTTP probe returns 502/503/504 | Critical |
| Build failed | Medium |
| Container OOM-killed once | High |
| Health endpoint returns unhealthy | High |
| Deployment succeeded but no traffic | High |
| SSL certificate expiring soon | Medium |
| Disk usage > 85% | Medium |
Collect context about the affected resource:
get_application — app details, current deployment, configured portget_application_deployments — recent deployment historyget_deployment_logs — if build/deploy failedlist_containers → get_container — container statusget_container_logs — runtime errorsDelegate to the diagnostic agent with full context:
failure-diagnosis skill for pattern matchingBased on diagnosis:
| Root cause type | Action | |---|---| | Code error (syntax, missing dep, config) | Auto-fix via PR | | Dockerfile issue (wrong base image, missing file) | Auto-fix via PR | | Environment variable missing or wrong | Notify user — env vars need manual input | | Infrastructure (server resources, Docker daemon) | Notify user — requires manual intervention | | Database connection failed | Notify user — check database status and credentials | | External service down | Notify user — nothing to fix on our side | | Unknown | Notify user with gathered evidence |
Delegate to the GitHub agent:
auto-fix/<short-description> (e.g. auto-fix/missing-prisma-schema)fix: <description of what was fixed>Send to all configured notification channels:
If fix PR created:
Failure detected for [app name].
Root cause: [one-line summary].
Auto-fix PR: [pr_url]
Review and merge to trigger redeploy.
If no fix possible:
Failure detected for [app name].
Root cause: [one-line summary].
Recommended action: [specific next step].
If diagnosis inconclusive:
Issue detected for [app name].
Findings: [what was observed].
Unable to determine root cause automatically.
Please investigate: [specific things to check].
If the fix PR is merged and a new deployment triggers:
failure-diagnosis — Pattern tables for identifying root causesrollback-strategy — When to rollback vs fix forwardpost-deploy-verification — Verify fix worked after mergeYour prompt contains the full incident context formatted by the event pipeline. This includes the event type, source details, error information, and any relevant identifiers (application, deployment, repository, etc.). Use all provided context to drive your investigation.
tools
Compressed catalog of all Nixopus API operations for the nixopus_api() tool
development
Deploy static file sites — Caddy/nginx serving, Staticfile config, and Dockerfile patterns. Use when deploying a static HTML site with no server-side runtime, or when index.html or a Staticfile is detected at the project root.
devops
Deploy shell script applications — interpreter detection, setup scripts, and Dockerfile patterns. Use when deploying a shell script project, or when start.sh is detected.
development
Self-healing loop for failed deployments — diagnose, fix, redeploy up to 3 attempts, then escalate or rollback. Load when a deployment fails or build errors occur.