Incident Response

Severity Levels

Assign a severity level immediately upon incident detection. Severity determines response urgency, communication cadence, and escalation path.

| Severity | Name | Description | Examples | Response Time | Update Cadence | Responders | |----------|------|-------------|----------|---------------|----------------|------------| | P0 | Total Outage | Complete service unavailability. All users affected. Revenue-impacting. | Site completely down, data corruption, security breach with active exploitation | < 15 minutes | Every 15 minutes | All hands on deck, executive notification | | P1 | Major Degradation | Core functionality severely impaired. Large portion of users affected. | Payment processing broken, authentication failing for most users, major data pipeline stalled | < 30 minutes | Every 30 minutes | On-call engineer + team lead, stakeholder notification | | P2 | Partial Impact | Non-core functionality broken or core functionality degraded for a subset of users. | Search feature down, slow responses in one region, intermittent errors for some users | < 2 hours | Every 2 hours | On-call engineer | | P3 | Minor Issue | Cosmetic issues, minor bugs, or issues with workarounds available. | UI glitch, non-critical background job delayed, minor data inconsistency | Next business day | Daily (if ongoing) | Assigned engineer |

Escalation Rules

If a P2 is not resolved within 4 hours, escalate to P1.
If a P1 is not resolved within 2 hours, escalate to P0.
Any incident involving data breach or security compromise is automatically P0.
When in doubt, over-classify. It is better to downgrade than to under-respond.

Triage Procedure

When an incident is detected (alert, user report, or monitoring), follow this triage sequence:

Step 1: Assess Impact

Answer these questions within the first 5 minutes:

[ ] Who is affected? (All users, subset, internal only)
[ ] What is broken? (Specific feature, entire service, data integrity)
[ ] When did it start? (Check monitoring for onset time)
[ ] Is it getting worse? (Error rate trending up, stable, or recovering)

Step 2: Assign Severity

Based on the impact assessment, assign a severity level using the table above. Document the severity and reasoning.

Step 3: Assemble the Response Team

| Severity | Who to Notify | |----------|--------------| | P0 | On-call engineer, engineering manager, VP Engineering, customer support lead, communications | | P1 | On-call engineer, engineering manager, customer support lead | | P2 | On-call engineer, relevant team lead | | P3 | Assigned engineer (via ticket) |

Step 4: Designate Roles

For P0 and P1 incidents, explicitly assign these roles:

| Role | Responsibility | |------|---------------| | Incident Commander (IC) | Coordinates response, makes decisions, delegates tasks. Does NOT debug. | | Technical Lead | Leads investigation and mitigation. Communicates findings to IC. | | Communications Lead | Drafts and sends status updates to stakeholders and customers. | | Scribe | Documents timeline, actions taken, and decisions in real time. |

Step 5: Open Incident Channel

Create a dedicated communication channel (Slack channel, incident bridge call):

Name it clearly: #incident-2024-03-15-payments-down
Pin the initial assessment and severity level.
All discussion and decisions happen in this channel.

Communication Templates

Initial Alert

INCIDENT DECLARED: [P0/P1/P2] - [Brief description]

Impact: [Who is affected and how]
Start time: [When it began, UTC]
Current status: [Investigating / Identified / Mitigating]
Incident Commander: [Name]
Channel: #incident-[date]-[topic]

Next update in [15/30] minutes.

Status Update

INCIDENT UPDATE: [P0/P1/P2] - [Brief description]

Status: [Investigating / Identified / Mitigating / Resolved]
Duration: [Time since start]
Current state: [What is happening now]
Actions taken: [What has been tried]
Next steps: [What will be done next]

Next update in [15/30] minutes.

Resolution Notification

INCIDENT RESOLVED: [P0/P1/P2] - [Brief description]

Duration: [Total time from detection to resolution]
Root cause: [One-sentence summary]
Resolution: [What fixed it]
Customer impact: [Summary of user-facing impact]
Follow-up: Post-incident review scheduled for [date/time].

Monitoring for recurrence.

Customer-Facing Communication

[Service Name] Status Update

We are aware of an issue affecting [description of user impact].
Our team is actively working to resolve this.

Started: [Time, timezone]
Current status: [Brief, non-technical description]

We will provide updates every [30 minutes / 1 hour].
We apologize for the inconvenience.

Investigation Methodology

Follow this structured approach rather than randomly checking things. Work from symptoms toward root cause.

Step 1: Check Dashboards

Start with the service overview dashboard. Look for:

Error rate spikes (when exactly did they start?)
Latency increases (gradual degradation or sudden jump?)
Traffic anomalies (unexpected spike or drop?)
Resource utilization (CPU, memory, disk, connections)

Step 2: Check Recent Deploys

Most incidents are caused by recent changes. Check:

# What was deployed recently?
git log --oneline --since="2 hours ago" origin/main

# Any recent infrastructure changes?
# Check deployment pipeline history, Terraform runs, config changes

Questions to answer:

Was anything deployed in the last 2 hours?
Was there a configuration change (feature flags, environment variables)?
Was there an infrastructure change (scaling, migration, certificate renewal)?

Step 3: Examine Logs

Search logs filtered to the incident timeframe:

# Example queries (adapt to your log aggregation platform)
# ELK/Kibana: level:ERROR AND service:order-service AND @timestamp >= "2024-03-15T14:00:00"
# Datadog: service:order-service status:error
# CloudWatch: filter @message like /ERROR/ | sort @timestamp desc

Look for:

Error messages with stack traces.
Repeated error patterns (same error thousands of times).
New error types that were not present before the incident.
Correlation between errors and the timeline.

Step 4: Hypothesize and Test

Based on data gathered, form a hypothesis and test it:

| Hypothesis | How to Test | |------------|------------| | Bad deploy caused it | Compare error timeline with deploy timestamp. Roll back and observe. | | Database is overloaded | Check connection pool, slow query log, lock contention. | | External dependency is down | Check dependency status page, test connectivity, check timeout rates. | | Traffic spike overwhelmed the service | Check request rate, compare to normal baseline, check auto-scaling. | | DNS or certificate issue | Test DNS resolution, check certificate expiry, verify SSL handshake. | | Memory leak | Check memory usage trend, look for OOM kills in system logs. | | Data corruption | Query for inconsistent data, check recent migration or backfill jobs. |

Step 5: Verify the Fix

After applying a fix:

[ ] Error rate returning to baseline.
[ ] Latency returning to normal.
[ ] No new error patterns appearing.
[ ] Affected functionality manually verified.
[ ] Monitor for at least 15 minutes (P0/P1) or 30 minutes (P2) before declaring resolved.

Common Mitigation Actions

When the root cause is identified (or even before, to reduce impact), apply the appropriate mitigation:

| Action | When to Use | How | Risk | |--------|------------|-----|------| | Rollback | Bad deploy identified | Revert to previous known-good version via deployment pipeline | May lose new features; verify database compatibility | | Feature flag toggle | New feature causing issues | Disable the flag in your feature management system | Requires feature flags to be in place | | Horizontal scaling | Service overwhelmed by traffic | Increase instance count via auto-scaler or manual scaling | Increased cost; may not help if bottleneck is downstream | | Cache clear | Stale or corrupted cached data | Flush application cache (Redis FLUSHDB, CDN purge) | Temporary increase in origin load after flush | | Circuit breaker | Failing dependency cascading | Activate circuit breaker to fail fast instead of waiting | Gracefully degraded experience for users | | Traffic shedding | Total overload | Rate limit or redirect traffic, enable maintenance page | Users see errors or degraded service | | Database failover | Primary database unresponsive | Promote replica to primary (if configured) | Brief downtime during promotion; verify replication lag | | DNS redirect | Entire region or provider down | Update DNS to point to backup region or provider | Propagation delay (use low TTL proactively) | | Restart | Process stuck, memory leak | Rolling restart of application instances | Brief capacity reduction during restart | | Hotfix | Small targeted code fix needed | Fast-track a minimal change through deployment pipeline | Bypasses normal review; must be reviewed post-incident |

Rollback Procedure

# Verify the last known-good version
git log --oneline -10 origin/main

# Tag the rollback point
git tag -a incident-rollback-2024-03-15 -m "Rolling back due to P1 incident"

# Trigger deployment of previous version
# (Adapt to your deployment pipeline)
# Example: Kubernetes
kubectl rollout undo deployment/order-service

# Verify rollback is deployed
kubectl rollout status deployment/order-service

# Monitor error rate and confirm reduction

Post-Incident Review

Conduct a post-incident review (PIR) within 48 hours of resolution for P0/P1 incidents and within one week for P2 incidents.

Post-Incident Review Template

# Post-Incident Review: [Incident Title]

**Date**: [date] | **Severity**: [P0/P1/P2] | **Duration**: [time] | **Author**: [name]

## Summary
[2-3 sentences: what happened, the impact, and the resolution.]

## Timeline (UTC)
| Time | Event |
|------|-------|
| 14:00 | Alert fires: order error rate > 5% |
| 14:05 | P1 declared, incident channel created |
| 14:15 | Recent deploy at 13:45 identified as suspect |
| 14:25 | Rollback deployed |
| 14:45 | Resolved, monitoring for recurrence |

## Impact
Users affected, duration, revenue/SLA impact, data impact.

## Root Cause
[Detailed technical explanation of what went wrong and why.]

## Five Whys
1. **Why** did orders fail? -> Payment validation threw an exception.
2. **Why** did validation throw? -> Null value for a non-nullable field.
3. **Why** was the field null? -> Migration added column but did not backfill.
4. **Why** was backfill missed? -> No checklist step for backfill verification.
5. **Why** no checklist step? -> Migration procedures were undocumented.

## Action Items
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| Add integration test for null field validation | @alice | High | YYYY-MM-DD | TODO |
| Lower alert threshold from 5% to 2% | @bob | High | YYYY-MM-DD | TODO |
| Add feature flag to payment flow | @carol | Medium | YYYY-MM-DD | TODO |

## Lessons Learned
- What went well: [e.g., quick detection, rapid team assembly]
- What could improve: [e.g., rollback automation, test coverage]

Blameless Culture Principles

Post-incident reviews are learning opportunities, not blame sessions. Adhere to these principles:

| Principle | Practice | |-----------|----------| | Assume good intent | People made the best decisions they could with the information they had at the time. | | Focus on systems, not individuals | Ask "what allowed this to happen?" not "who caused this?" | | Separate the what from the who | Describe actions taken without naming individuals in the root cause. Use role titles if context is needed. | | Reward transparency | Publicly thank people who report incidents, share mistakes, or identify risks. | | Follow through on action items | PIR action items are tracked and completed. Unfixed systemic issues lead to repeat incidents. | | Share learnings broadly | Publish PIR summaries (redacted if needed) so other teams learn too. |

Runbook Authoring Guide

A runbook is a step-by-step guide for responding to a specific alert or operational scenario.

Runbook Structure

Every runbook follows this structure:

# Runbook: [Alert or Scenario Name]

## When to Use
[Describe the alert, symptom, or scenario that triggers this runbook.]

## Prerequisites
- Access to [systems, dashboards, tools]
- Permissions: [required roles or access levels]

## Steps

### 1. Verify the Problem
```bash
curl -s https://monitoring.example.com/api/v1/query \
  --data-urlencode 'query=rate(http_errors_total{service="order-service"}[5m])'

Expected: Error rate below 0.01. If above, continue. If normal, check thresholds and close.

2. Apply Mitigation

# Option A: Restart
kubectl rollout restart deployment/order-service
# Option B: Rollback
kubectl rollout undo deployment/order-service

3. Verify Resolution

Expected: Error rate drops below 0.01 within 5 minutes.

4. Escalation

If unresolved: escalate to [team], contact [channel/phone], provide [context].


### Runbook Best Practices

| Practice | Reason |
|----------|--------|
| Use exact commands, not descriptions | Under stress, responders should copy-paste, not interpret |
| Include expected output | So responders know if the command worked |
| Provide verification after each step | Catch issues early, do not proceed blindly |
| Include a rollback for each step | If a mitigation step makes things worse |
| Test runbooks regularly | Outdated runbooks cause confusion during real incidents |
| Date-stamp and version runbooks | Know when it was last verified |
| Link from alert to runbook | Reduce time-to-runbook to one click |

## Incident Response Checklist

Quick reference during an active incident:

- [ ] Impact assessed (who, what, when, trending).
- [ ] Severity assigned and documented.
- [ ] Incident channel or bridge opened.
- [ ] Roles assigned (IC, Technical Lead, Comms, Scribe).
- [ ] Initial stakeholder notification sent.
- [ ] Timeline being recorded in real time.
- [ ] Investigation following structured methodology (dashboards, deploys, logs, hypothesize, test).
- [ ] Mitigation applied and impact reducing.
- [ ] Resolution verified with monitoring data.
- [ ] All-clear communication sent.
- [ ] Post-incident review scheduled (within 48 hours for P0/P1).
- [ ] Action items created with owners and due dates.

Incident Response

Severity Levels

Assign a severity level immediately upon incident detection. Severity determines response urgency, communication cadence, and escalation path.

Escalation Rules

If a P2 is not resolved within 4 hours, escalate to P1.
If a P1 is not resolved within 2 hours, escalate to P0.
Any incident involving data breach or security compromise is automatically P0.
When in doubt, over-classify. It is better to downgrade than to under-respond.

Triage Procedure

When an incident is detected (alert, user report, or monitoring), follow this triage sequence:

Step 1: Assess Impact

Answer these questions within the first 5 minutes:

[ ] Who is affected? (All users, subset, internal only)
[ ] What is broken? (Specific feature, entire service, data integrity)
[ ] When did it start? (Check monitoring for onset time)
[ ] Is it getting worse? (Error rate trending up, stable, or recovering)

Step 2: Assign Severity

Based on the impact assessment, assign a severity level using the table above. Document the severity and reasoning.

Step 3: Assemble the Response Team

Step 4: Designate Roles

For P0 and P1 incidents, explicitly assign these roles:

Step 5: Open Incident Channel

Create a dedicated communication channel (Slack channel, incident bridge call):

Name it clearly: #incident-2024-03-15-payments-down
Pin the initial assessment and severity level.
All discussion and decisions happen in this channel.

Communication Templates

Initial Alert

INCIDENT DECLARED: [P0/P1/P2] - [Brief description]

Impact: [Who is affected and how]
Start time: [When it began, UTC]
Current status: [Investigating / Identified / Mitigating]
Incident Commander: [Name]
Channel: #incident-[date]-[topic]

Next update in [15/30] minutes.

Status Update

INCIDENT UPDATE: [P0/P1/P2] - [Brief description]

Status: [Investigating / Identified / Mitigating / Resolved]
Duration: [Time since start]
Current state: [What is happening now]
Actions taken: [What has been tried]
Next steps: [What will be done next]

Next update in [15/30] minutes.

Resolution Notification

INCIDENT RESOLVED: [P0/P1/P2] - [Brief description]

Duration: [Total time from detection to resolution]
Root cause: [One-sentence summary]
Resolution: [What fixed it]
Customer impact: [Summary of user-facing impact]
Follow-up: Post-incident review scheduled for [date/time].

Monitoring for recurrence.

Customer-Facing Communication

[Service Name] Status Update

We are aware of an issue affecting [description of user impact].
Our team is actively working to resolve this.

Started: [Time, timezone]
Current status: [Brief, non-technical description]

We will provide updates every [30 minutes / 1 hour].
We apologize for the inconvenience.

Investigation Methodology

Follow this structured approach rather than randomly checking things. Work from symptoms toward root cause.

Step 1: Check Dashboards

Start with the service overview dashboard. Look for:

Error rate spikes (when exactly did they start?)
Latency increases (gradual degradation or sudden jump?)
Traffic anomalies (unexpected spike or drop?)
Resource utilization (CPU, memory, disk, connections)

Step 2: Check Recent Deploys

Most incidents are caused by recent changes. Check:

# What was deployed recently?
git log --oneline --since="2 hours ago" origin/main

# Any recent infrastructure changes?
# Check deployment pipeline history, Terraform runs, config changes

Questions to answer:

Was anything deployed in the last 2 hours?
Was there a configuration change (feature flags, environment variables)?
Was there an infrastructure change (scaling, migration, certificate renewal)?

Step 3: Examine Logs

Search logs filtered to the incident timeframe:

# Example queries (adapt to your log aggregation platform)
# ELK/Kibana: level:ERROR AND service:order-service AND @timestamp >= "2024-03-15T14:00:00"
# Datadog: service:order-service status:error
# CloudWatch: filter @message like /ERROR/ | sort @timestamp desc

Look for:

Error messages with stack traces.
Repeated error patterns (same error thousands of times).
New error types that were not present before the incident.
Correlation between errors and the timeline.

Step 4: Hypothesize and Test

Based on data gathered, form a hypothesis and test it:

Step 5: Verify the Fix

After applying a fix:

[ ] Error rate returning to baseline.
[ ] Latency returning to normal.
[ ] No new error patterns appearing.
[ ] Affected functionality manually verified.
[ ] Monitor for at least 15 minutes (P0/P1) or 30 minutes (P2) before declaring resolved.

Common Mitigation Actions

When the root cause is identified (or even before, to reduce impact), apply the appropriate mitigation:

Rollback Procedure

# Verify the last known-good version
git log --oneline -10 origin/main

# Tag the rollback point
git tag -a incident-rollback-2024-03-15 -m "Rolling back due to P1 incident"

# Trigger deployment of previous version
# (Adapt to your deployment pipeline)
# Example: Kubernetes
kubectl rollout undo deployment/order-service

# Verify rollback is deployed
kubectl rollout status deployment/order-service

# Monitor error rate and confirm reduction

Post-Incident Review

Conduct a post-incident review (PIR) within 48 hours of resolution for P0/P1 incidents and within one week for P2 incidents.

Post-Incident Review Template

# Post-Incident Review: [Incident Title]

**Date**: [date] | **Severity**: [P0/P1/P2] | **Duration**: [time] | **Author**: [name]

## Summary
[2-3 sentences: what happened, the impact, and the resolution.]

## Timeline (UTC)
| Time | Event |
|------|-------|
| 14:00 | Alert fires: order error rate > 5% |
| 14:05 | P1 declared, incident channel created |
| 14:15 | Recent deploy at 13:45 identified as suspect |
| 14:25 | Rollback deployed |
| 14:45 | Resolved, monitoring for recurrence |

## Impact
Users affected, duration, revenue/SLA impact, data impact.

## Root Cause
[Detailed technical explanation of what went wrong and why.]

## Five Whys
1. **Why** did orders fail? -> Payment validation threw an exception.
2. **Why** did validation throw? -> Null value for a non-nullable field.
3. **Why** was the field null? -> Migration added column but did not backfill.
4. **Why** was backfill missed? -> No checklist step for backfill verification.
5. **Why** no checklist step? -> Migration procedures were undocumented.

## Action Items
| Action | Owner | Priority | Due Date | Status |
|--------|-------|----------|----------|--------|
| Add integration test for null field validation | @alice | High | YYYY-MM-DD | TODO |
| Lower alert threshold from 5% to 2% | @bob | High | YYYY-MM-DD | TODO |
| Add feature flag to payment flow | @carol | Medium | YYYY-MM-DD | TODO |

## Lessons Learned
- What went well: [e.g., quick detection, rapid team assembly]
- What could improve: [e.g., rollback automation, test coverage]

Blameless Culture Principles

Post-incident reviews are learning opportunities, not blame sessions. Adhere to these principles:

Runbook Authoring Guide

A runbook is a step-by-step guide for responding to a specific alert or operational scenario.

Runbook Structure

Every runbook follows this structure:

# Runbook: [Alert or Scenario Name]

## When to Use
[Describe the alert, symptom, or scenario that triggers this runbook.]

## Prerequisites
- Access to [systems, dashboards, tools]
- Permissions: [required roles or access levels]

## Steps

### 1. Verify the Problem
```bash
curl -s https://monitoring.example.com/api/v1/query \
  --data-urlencode 'query=rate(http_errors_total{service="order-service"}[5m])'

Expected: Error rate below 0.01. If above, continue. If normal, check thresholds and close.

2. Apply Mitigation

# Option A: Restart
kubectl rollout restart deployment/order-service
# Option B: Rollback
kubectl rollout undo deployment/order-service

3. Verify Resolution

Expected: Error rate drops below 0.01 within 5 minutes.

4. Escalation

If unresolved: escalate to [team], contact [channel/phone], provide [context].


### Runbook Best Practices

| Practice | Reason |
|----------|--------|
| Use exact commands, not descriptions | Under stress, responders should copy-paste, not interpret |
| Include expected output | So responders know if the command worked |
| Provide verification after each step | Catch issues early, do not proceed blindly |
| Include a rollback for each step | If a mitigation step makes things worse |
| Test runbooks regularly | Outdated runbooks cause confusion during real incidents |
| Date-stamp and version runbooks | Know when it was last verified |
| Link from alert to runbook | Reduce time-to-runbook to one click |

## Incident Response Checklist

Quick reference during an active incident:

- [ ] Impact assessed (who, what, when, trending).
- [ ] Severity assigned and documented.
- [ ] Incident channel or bridge opened.
- [ ] Roles assigned (IC, Technical Lead, Comms, Scribe).
- [ ] Initial stakeholder notification sent.
- [ ] Timeline being recorded in real time.
- [ ] Investigation following structured methodology (dashboards, deploys, logs, hypothesize, test).
- [ ] Mitigation applied and impact reducing.
- [ ] Resolution verified with monitoring data.
- [ ] All-clear communication sent.
- [ ] Post-incident review scheduled (within 48 hours for P0/P1).
- [ ] Action items created with owners and due dates.

Adoption

claude-code-community-ireland/incident-response

$ install --global

Security Scan Results

SKILL.md

Incident Response

Severity Levels

Escalation Rules

Triage Procedure

Step 1: Assess Impact

Step 2: Assign Severity

Step 3: Assemble the Response Team

Step 4: Designate Roles

Step 5: Open Incident Channel

Communication Templates

Initial Alert

Status Update

Resolution Notification

Customer-Facing Communication

Investigation Methodology

Step 1: Check Dashboards

Step 2: Check Recent Deploys

Step 3: Examine Logs

Step 4: Hypothesize and Test

Step 5: Verify the Fix

Common Mitigation Actions

Rollback Procedure

Post-Incident Review

Post-Incident Review Template

Blameless Culture Principles

Runbook Authoring Guide

Runbook Structure

2. Apply Mitigation

3. Verify Resolution

4. Escalation

Related Skills

claude-code-community-ireland/changelog-automation

claude-code-community-ireland/cc-skill-project-guidelines-example

claude-code-community-ireland/cc-skill-continuous-learning

claude-code-community-ireland/canvas-design

claude-code-community-ireland/incident-response

$ install --global

Security Scan Results

SKILL.md

Incident Response

Severity Levels

Escalation Rules

Triage Procedure

Step 1: Assess Impact

Step 2: Assign Severity

Step 3: Assemble the Response Team

Step 4: Designate Roles

Step 5: Open Incident Channel

Communication Templates

Initial Alert

Status Update

Resolution Notification

Customer-Facing Communication

Investigation Methodology

Step 1: Check Dashboards

Step 2: Check Recent Deploys

Step 3: Examine Logs

Step 4: Hypothesize and Test

Step 5: Verify the Fix

Common Mitigation Actions

Rollback Procedure

Post-Incident Review

Post-Incident Review Template

Blameless Culture Principles

Runbook Authoring Guide

Runbook Structure

2. Apply Mitigation

3. Verify Resolution

4. Escalation

Related Skills

claude-code-community-ireland/changelog-automation

claude-code-community-ireland/cc-skill-project-guidelines-example

claude-code-community-ireland/cc-skill-continuous-learning

claude-code-community-ireland/canvas-design