plugins/vibeworks-library/skills/incident-response/SKILL.md
Incident response procedures — triage, communication, investigation, mitigation, and post-incident review. Use when handling production incidents or writing runbooks.
npx skillsauth add claude-code-community-ireland/claude-code-resources incident-responseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Assign a severity level immediately upon incident detection. Severity determines response urgency, communication cadence, and escalation path.
| Severity | Name | Description | Examples | Response Time | Update Cadence | Responders | |----------|------|-------------|----------|---------------|----------------|------------| | P0 | Total Outage | Complete service unavailability. All users affected. Revenue-impacting. | Site completely down, data corruption, security breach with active exploitation | < 15 minutes | Every 15 minutes | All hands on deck, executive notification | | P1 | Major Degradation | Core functionality severely impaired. Large portion of users affected. | Payment processing broken, authentication failing for most users, major data pipeline stalled | < 30 minutes | Every 30 minutes | On-call engineer + team lead, stakeholder notification | | P2 | Partial Impact | Non-core functionality broken or core functionality degraded for a subset of users. | Search feature down, slow responses in one region, intermittent errors for some users | < 2 hours | Every 2 hours | On-call engineer | | P3 | Minor Issue | Cosmetic issues, minor bugs, or issues with workarounds available. | UI glitch, non-critical background job delayed, minor data inconsistency | Next business day | Daily (if ongoing) | Assigned engineer |
When an incident is detected (alert, user report, or monitoring), follow this triage sequence:
Answer these questions within the first 5 minutes:
Based on the impact assessment, assign a severity level using the table above. Document the severity and reasoning.
| Severity | Who to Notify | |----------|--------------| | P0 | On-call engineer, engineering manager, VP Engineering, customer support lead, communications | | P1 | On-call engineer, engineering manager, customer support lead | | P2 | On-call engineer, relevant team lead | | P3 | Assigned engineer (via ticket) |
For P0 and P1 incidents, explicitly assign these roles:
| Role | Responsibility | |------|---------------| | Incident Commander (IC) | Coordinates response, makes decisions, delegates tasks. Does NOT debug. | | Technical Lead | Leads investigation and mitigation. Communicates findings to IC. | | Communications Lead | Drafts and sends status updates to stakeholders and customers. | | Scribe | Documents timeline, actions taken, and decisions in real time. |
Create a dedicated communication channel (Slack channel, incident bridge call):
#incident-2024-03-15-payments-downINCIDENT DECLARED: [P0/P1/P2] - [Brief description]
Impact: [Who is affected and how]
Start time: [When it began, UTC]
Current status: [Investigating / Identified / Mitigating]
Incident Commander: [Name]
Channel: #incident-[date]-[topic]
Next update in [15/30] minutes.
INCIDENT UPDATE: [P0/P1/P2] - [Brief description]
Status: [Investigating / Identified / Mitigating / Resolved]
Duration: [Time since start]
Current state: [What is happening now]
Actions taken: [What has been tried]
Next steps: [What will be done next]
Next update in [15/30] minutes.
INCIDENT RESOLVED: [P0/P1/P2] - [Brief description]
Duration: [Total time from detection to resolution]
Root cause: [One-sentence summary]
Resolution: [What fixed it]
Customer impact: [Summary of user-facing impact]
Follow-up: Post-incident review scheduled for [date/time].
Monitoring for recurrence.
[Service Name] Status Update
We are aware of an issue affecting [description of user impact].
Our team is actively working to resolve this.
Started: [Time, timezone]
Current status: [Brief, non-technical description]
We will provide updates every [30 minutes / 1 hour].
We apologize for the inconvenience.
Follow this structured approach rather than randomly checking things. Work from symptoms toward root cause.
Start with the service overview dashboard. Look for:
Most incidents are caused by recent changes. Check:
# What was deployed recently?
git log --oneline --since="2 hours ago" origin/main
# Any recent infrastructure changes?
# Check deployment pipeline history, Terraform runs, config changes
Questions to answer:
Search logs filtered to the incident timeframe:
# Example queries (adapt to your log aggregation platform)
# ELK/Kibana: level:ERROR AND service:order-service AND @timestamp >= "2024-03-15T14:00:00"
# Datadog: service:order-service status:error
# CloudWatch: filter @message like /ERROR/ | sort @timestamp desc
Look for:
Based on data gathered, form a hypothesis and test it:
| Hypothesis | How to Test | |------------|------------| | Bad deploy caused it | Compare error timeline with deploy timestamp. Roll back and observe. | | Database is overloaded | Check connection pool, slow query log, lock contention. | | External dependency is down | Check dependency status page, test connectivity, check timeout rates. | | Traffic spike overwhelmed the service | Check request rate, compare to normal baseline, check auto-scaling. | | DNS or certificate issue | Test DNS resolution, check certificate expiry, verify SSL handshake. | | Memory leak | Check memory usage trend, look for OOM kills in system logs. | | Data corruption | Query for inconsistent data, check recent migration or backfill jobs. |
After applying a fix:
When the root cause is identified (or even before, to reduce impact), apply the appropriate mitigation:
| Action | When to Use | How | Risk |
|--------|------------|-----|------|
| Rollback | Bad deploy identified | Revert to previous known-good version via deployment pipeline | May lose new features; verify database compatibility |
| Feature flag toggle | New feature causing issues | Disable the flag in your feature management system | Requires feature flags to be in place |
| Horizontal scaling | Service overwhelmed by traffic | Increase instance count via auto-scaler or manual scaling | Increased cost; may not help if bottleneck is downstream |
| Cache clear | Stale or corrupted cached data | Flush application cache (Redis FLUSHDB, CDN purge) | Temporary increase in origin load after flush |
| Circuit breaker | Failing dependency cascading | Activate circuit breaker to fail fast instead of waiting | Gracefully degraded experience for users |
| Traffic shedding | Total overload | Rate limit or redirect traffic, enable maintenance page | Users see errors or degraded service |
| Database failover | Primary database unresponsive | Promote replica to primary (if configured) | Brief downtime during promotion; verify replication lag |
| DNS redirect | Entire region or provider down | Update DNS to point to backup region or provider | Propagation delay (use low TTL proactively) |
| Restart | Process stuck, memory leak | Rolling restart of application instances | Brief capacity reduction during restart |
| Hotfix | Small targeted code fix needed | Fast-track a minimal change through deployment pipeline | Bypasses normal review; must be reviewed post-incident |
# Verify the last known-good version
git log --oneline -10 origin/main
# Tag the rollback point
git tag -a incident-rollback-2024-03-15 -m "Rolling back due to P1 incident"
# Trigger deployment of previous version
# (Adapt to your deployment pipeline)
# Example: Kubernetes
kubectl rollout undo deployment/order-service
# Verify rollback is deployed
kubectl rollout status deployment/order-service
# Monitor error rate and confirm reduction
Conduct a post-incident review (PIR) within 48 hours of resolution for P0/P1 incidents and within one week for P2 incidents.
# Post-Incident Review: [Incident Title]
**Date**: [date] | **Severity**: [P0/P1/P2] | **Duration**: [time] | **Author**: [name]
## Summary
[2-3 sentences: what happened, the impact, and the resolution.]
## Timeline (UTC)
| Time | Event |
|------|-------|
| 14:00 | Alert fires: order error rate > 5% |
| 14:05 | P1 declared, incident channel created |
| 14:15 | Recent deploy at 13:45 identified as suspect |
| 14:25 | Rollback deployed |
| 14:45 | Resolved, monitoring for recurrence |
## Impact
Users affected, duration, revenue/SLA impact, data impact.
## Root Cause
[Detailed technical explanation of what went wrong and why.]
## Five Whys
1. **Why** did orders fail? -> Payment validation threw an exception.
2. **Why** did validation throw? -> Null value for a non-nullable field.
3. **Why** was the field null? -> Migration added column but did not backfill.
4. **Why** was backfill missed? -> No checklist step for backfill verification.
5. **Why** no checklist step? -> Migration procedures were undocumented.
## Action Items
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| Add integration test for null field validation | @alice | High | YYYY-MM-DD | TODO |
| Lower alert threshold from 5% to 2% | @bob | High | YYYY-MM-DD | TODO |
| Add feature flag to payment flow | @carol | Medium | YYYY-MM-DD | TODO |
## Lessons Learned
- What went well: [e.g., quick detection, rapid team assembly]
- What could improve: [e.g., rollback automation, test coverage]
Post-incident reviews are learning opportunities, not blame sessions. Adhere to these principles:
| Principle | Practice | |-----------|----------| | Assume good intent | People made the best decisions they could with the information they had at the time. | | Focus on systems, not individuals | Ask "what allowed this to happen?" not "who caused this?" | | Separate the what from the who | Describe actions taken without naming individuals in the root cause. Use role titles if context is needed. | | Reward transparency | Publicly thank people who report incidents, share mistakes, or identify risks. | | Follow through on action items | PIR action items are tracked and completed. Unfixed systemic issues lead to repeat incidents. | | Share learnings broadly | Publish PIR summaries (redacted if needed) so other teams learn too. |
A runbook is a step-by-step guide for responding to a specific alert or operational scenario.
Every runbook follows this structure:
# Runbook: [Alert or Scenario Name]
## When to Use
[Describe the alert, symptom, or scenario that triggers this runbook.]
## Prerequisites
- Access to [systems, dashboards, tools]
- Permissions: [required roles or access levels]
## Steps
### 1. Verify the Problem
```bash
curl -s https://monitoring.example.com/api/v1/query \
--data-urlencode 'query=rate(http_errors_total{service="order-service"}[5m])'
Expected: Error rate below 0.01. If above, continue. If normal, check thresholds and close.
# Option A: Restart
kubectl rollout restart deployment/order-service
# Option B: Rollback
kubectl rollout undo deployment/order-service
Expected: Error rate drops below 0.01 within 5 minutes.
If unresolved: escalate to [team], contact [channel/phone], provide [context].
### Runbook Best Practices
| Practice | Reason |
|----------|--------|
| Use exact commands, not descriptions | Under stress, responders should copy-paste, not interpret |
| Include expected output | So responders know if the command worked |
| Provide verification after each step | Catch issues early, do not proceed blindly |
| Include a rollback for each step | If a mitigation step makes things worse |
| Test runbooks regularly | Outdated runbooks cause confusion during real incidents |
| Date-stamp and version runbooks | Know when it was last verified |
| Link from alert to runbook | Reduce time-to-runbook to one click |
## Incident Response Checklist
Quick reference during an active incident:
- [ ] Impact assessed (who, what, when, trending).
- [ ] Severity assigned and documented.
- [ ] Incident channel or bridge opened.
- [ ] Roles assigned (IC, Technical Lead, Comms, Scribe).
- [ ] Initial stakeholder notification sent.
- [ ] Timeline being recorded in real time.
- [ ] Investigation following structured methodology (dashboards, deploys, logs, hypothesize, test).
- [ ] Mitigation applied and impact reducing.
- [ ] Resolution verified with monitoring data.
- [ ] All-clear communication sent.
- [ ] Post-incident review scheduled (within 48 hours for P0/P1).
- [ ] Action items created with owners and due dates.
tools
Automate changelog generation from commits, PRs, and releases following Keep a Changelog format. Use when setting up release workflows, generating release notes, or standardizing commit conventions.
documentation
Project Guidelines Skill (Example)
development
Development skill from everything-claude-code
documentation
Create beautiful visual art in .png and .pdf documents using design philosophy. You should use this skill when the user asks to create a poster, piece of art, design, or other static piece. Create original visual designs, never copying existing artists' work to avoid copyright violations.