plugins/incident-response/skills/incident-runbook-templates/SKILL.md
Create structured incident response runbooks with step-by-step procedures, escalation paths, and recovery actions. Use this skill when building a service outage runbook for a payment processing system; creating database incident procedures covering connection pool exhaustion, replication lag, and disk space alerts; onboarding new on-call engineers who need step-by-step recovery guides written for a 3 AM brain; or standardizing escalation matrices across multiple engineering teams.
npx skillsauth add wshobson/agents incident-runbook-templatesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Production-ready templates for incident response runbooks covering detection, triage, mitigation, resolution, and communication.
| Severity | Impact | Response Time | Example | | -------- | -------------------------- | ----------------- | ----------------------- | | SEV1 | Complete outage, data loss | 15 min | Production down | | SEV2 | Major degradation | 30 min | Critical feature broken | | SEV3 | Minor impact | 2 hours | Non-critical bug | | SEV4 | Minimal impact | Next business day | Cosmetic issue |
1. Overview & Impact
2. Detection & Alerts
3. Initial Triage
4. Mitigation Steps
5. Root Cause Investigation
6. Resolution Procedures
7. Verification & Rollback
8. Communication Templates
9. Escalation Matrix
Detailed pattern documentation lives in references/details.md. Read that file when the navigation tier above is insufficient.
Steps often assume preconditions that are true in a healthy environment but not during an outage. For each command in your runbook, add a prerequisite check and a "what to do if this command fails" note:
# Step: Check pod status
kubectl get pods -n payments
# Prerequisites: kubectl configured, kubeconfig points to correct cluster
# If this fails: run `aws eks update-kubeconfig --name prod-cluster --region us-east-1`
# Expected output: pods in Running state
Add a numbered checklist at the top of the runbook that mirrors the section numbers, so responders can track progress under stress without reading the full document:
## Quick Checklist
- [ ] 1. Declare incident severity and open war room
- [ ] 2. Check service health (Section 4.1)
- [ ] 3. Check recent deployments (Section 4.1)
- [ ] 4. Roll back if deploy is suspect (Section 4.1)
- [ ] 5. Post initial notification to #payments-incidents
- [ ] 6. Escalate if > 15 min unresolved
Runbooks rot because they're updated manually. Include a "Last Verified" date and owner at the top, and add a CI check that validates all curl endpoints and kubectl context names are still valid:
## Runbook Metadata
| Field | Value |
|---|---|
| Last verified | 2024-11-15 |
| Owner | @platform-team |
| Review cadence | After every SEV1/SEV2 |
Assign a dedicated incident communicator role (separate from the incident commander) whose only job is to post status updates. Add a standing agenda in the communication template:
Update every 15 minutes (even if no new information):
- Current status (Investigating / Mitigating / Monitoring)
- Impact (what is broken, who is affected, % of traffic)
- What we are doing right now
- Next update in: 15 minutes
Add explicit warnings before destructive SQL commands and require a dry-run output check before executing:
-- WARNING: This terminates active connections. Verify count first.
-- DRY RUN (check count before terminating):
SELECT count(*) FROM pg_stat_activity WHERE state = 'idle' AND query_start < now() - interval '10 minutes';
-- EXECUTE only after verifying count is reasonable (< 50):
SELECT pg_terminate_backend(pid) FROM pg_stat_activity
WHERE state = 'idle' AND query_start < now() - interval '10 minutes';
postmortem-writing - After resolving an incident, use postmortem templates to capture root cause and preventive actionson-call-handoff-patterns - Structure shift handoffs so the incoming responder has full context on active incidentsdevelopment
This skill should be used when the user asks to "optimize a prompt", "improve prompt performance", "design a prompt template", "write better prompts", "debug prompt issues", "use chain-of-thought", "structured prompting", "few-shot prompting", or wants to apply advanced prompt engineering patterns for production LLM applications.
development
Schedule and publish social media posts across 13 platforms (X, LinkedIn, Instagram, Facebook Pages, TikTok, Discord, Telegram, YouTube, Reddit, WordPress, Pinterest) via the SocialClaw API. Use when the user wants to publish, schedule, or manage social media content programmatically. Requires SOCIALCLAW_API_KEY.
development
Implement modern responsive layouts using container queries, fluid typography, CSS Grid, and mobile-first breakpoint strategies. Use when building adaptive interfaces, implementing fluid layouts, or creating component-level responsive behavior.
development
Master React Native styling, navigation, and Reanimated animations for cross-platform mobile development. Use when building React Native apps, implementing navigation patterns, or creating performant animations.