areas/software/platform/skills/incident-response/SKILL.md
# Skill: Incident Response Runbooks ## When to load When responding to a production alert, diagnosing an outage, or writing a postmortem. ## Severity Classification | Severity | Definition | Response Time | |:---|:---|:---| | P0 | Complete outage, data loss | Immediate | | P1 | Significant degradation, key feature broken | 15 min | | P2 | Minor degradation, workaround exists | 1 hour | | P3 | Non-user-facing | Next business day | ## P0 Response Playbook ``` T+0: ACKNOWLEDGE — "I'm on this
npx skillsauth add sawrus/agent-guides areas/software/platform/skills/incident-responseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When responding to a production alert, diagnosing an outage, or writing a postmortem.
| Severity | Definition | Response Time | |:---|:---|:---| | P0 | Complete outage, data loss | Immediate | | P1 | Significant degradation, key feature broken | 15 min | | P2 | Minor degradation, workaround exists | 1 hour | | P3 | Non-user-facing | Next business day |
T+0: ACKNOWLEDGE — "I'm on this"
T+5: SCOPE — What's broken? Who's affected? Since when?
T+10: COMMUNICATE — Post status page update; notify stakeholders
T+15: MITIGATE — Rollback > fix. Prefer reversible actions.
Order: rollback deploy → feature flag off → scale up → redirect traffic
T+30: STABILIZE — Confirm metrics returning to normal
T+60: DOCUMENT — Write preliminary postmortem
T+24h: POSTMORTEM — Full 5-whys analysis, action items
# High error rate: check recent deploys
kubectl rollout history deployment/api
kubectl logs -l app=api --since=10m | grep ERROR | tail -50
kubectl rollout undo deployment/api # If recent deploy
# DB connection exhaustion
psql -c "SELECT count(*), state FROM pg_stat_activity GROUP BY state;"
psql -c "SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE state = 'idle' AND age(clock_timestamp(), state_change) > interval '10 minutes';"
testing
QA Expert for writing E2E tests, test scenarios, test plans, and ensuring test coverage quality.
development
Expert UI/UX design intelligence for creating distinctive, high-craft, and mobile-first interfaces. Focuses on premium aesthetics, touch-first ergonomics, and Flutter performance.
development
Code Review Expert for static analysis, security auditing, architecture review, and ensuring code quality standards.
development
Babysit a GitHub pull request after creation by continuously polling review comments, CI checks/workflow runs, and mergeability state until the PR is merged/closed or user help is required. Diagnose failures, retry likely flaky failures up to 3 times, auto-fix/push branch-related issues when appropriate, and keep watching open PRs so fresh review feedback is surfaced promptly. Use when the user asks Codex to monitor a PR, watch CI, handle review comments, or keep an eye on failures and feedback on an open PR.