Incident Response Workflow

This skill defines the structured process for handling production incidents. It MUST be followed for all SEV1, SEV2, and SEV3 incidents.

See shared-patterns/incident-severity.md for severity definitions.

Incident Response Phases

| Phase | Focus | Owner | |-------|-------|-------| | 1. Detection | Identify and confirm incident | Monitoring/On-call | | 2. Declaration | Assess severity, declare incident | Incident Commander | | 3. Triage | Identify impact and initial hypothesis | Response Team | | 4. Mitigation | Restore service, implement workaround | Engineering Team | | 5. Resolution | Permanent fix, verification | Engineering Team | | 6. Post-Incident | RCA, action items, documentation | Incident Commander |

Phase 1: Detection

Trigger: Alert fires or user report received.

Required Actions

Acknowledge alert within SLA (see severity matrix)
Initial assessment:
- What is the symptom?
- What is affected?
- When did it start?
Check for related alerts - Is this isolated or part of larger issue?

Detection Checklist

[ ] Alert acknowledged in monitoring system
[ ] Initial symptom documented
[ ] Related alerts checked
[ ] Recent deployments checked
[ ] Known issue list checked

Phase 2: Declaration

Owner: First responder declares incident, assigns severity.

Severity Assignment

| Criteria | SEV1 | SEV2 | SEV3 | |----------|------|------|------| | Complete outage | X | | | | Data loss risk | X | | | | >50% users affected | | X | | | <50% users affected | | | X | | Workaround available | | | X |

See shared-patterns/incident-severity.md for complete definitions.

Declaration Actions

Create incident channel (if SEV1/SEV2):
- Format: #incident-YYYY-MM-DD-brief-description
- Post initial summary
Assign Incident Commander (IC):
- SEV1: Senior on-call or escalate to manager
- SEV2/SEV3: Primary on-call
Update status page (if customer-facing):
- Acknowledge incident
- Set appropriate severity
- Estimated update time

Declaration Template

**INCIDENT DECLARED**

**Severity:** SEV[1/2/3]
**Title:** [Brief description]
**Incident Commander:** @[name]
**Channel:** #incident-[date]-[slug]

**Impact:**
- Services affected: [list]
- Users affected: [count/percentage]
- Started: [timestamp UTC]

**Current Status:**
[Brief description of current state]

**Next Update:** [timestamp]

Phase 3: Triage

Owner: Incident Commander coordinates, engineering investigates.

Triage Questions (5 Whys Approach)

What is the exact symptom?
What changed recently? (deployments, config, traffic)
What is the blast radius?
What is the root cause hypothesis?
What is the quickest path to mitigation?

Triage Checklist

[ ] Service dependencies mapped
[ ] Recent changes identified
[ ] Error patterns analyzed
[ ] Resource utilization checked
[ ] Initial hypothesis formed

Communication During Triage

Update frequency by severity: | Severity | Internal Update | External Update | |----------|-----------------|-----------------| | SEV1 | Every 10 min | Every 15 min | | SEV2 | Every 15 min | Every 30 min | | SEV3 | Every 30 min | As needed |

Phase 4: Mitigation

Owner: Engineering implements fix, IC coordinates.

Mitigation Options (in order of preference)

Rollback - If recent deployment caused issue
Scale - If capacity related
Restart - If state corruption
Failover - If regional/AZ issue
Feature disable - If specific feature causes issue
Hotfix - If rollback not possible

Mitigation Checklist

[ ] Mitigation option selected with rationale
[ ] Change approved (SEV1: skip formal, document later)
[ ] Implementation tracked in incident channel
[ ] Verification criteria defined
[ ] Rollback plan ready

Mitigation Template

**MITIGATION IN PROGRESS**

**Action:** [description]
**Owner:** @[name]
**Started:** [timestamp]

**Verification:**
- [ ] [criterion 1]
- [ ] [criterion 2]

**Rollback Plan:**
[If mitigation fails, do X]

Phase 5: Resolution

Owner: Engineering confirms fix, IC verifies resolution.

Resolution Criteria

ALL must be true before marking resolved:

Primary symptom resolved - Users no longer affected
Monitoring confirms - Metrics returned to baseline
No related alerts - All triggered alerts cleared
Verification period passed - 15 min stability for SEV1/2

Resolution Checklist

[ ] Primary symptom verified resolved
[ ] Metrics returned to normal
[ ] All related alerts resolved
[ ] Verification period completed
[ ] Customer communication sent (if applicable)
[ ] Status page updated to resolved

Resolution Template

**INCIDENT RESOLVED**

**Duration:** [X hours Y minutes]
**Resolution Time:** [timestamp UTC]

**Root Cause:**
[Brief description of what caused the incident]

**Fix Applied:**
[What was done to resolve]

**Next Steps:**
- [ ] RCA scheduled for [date]
- [ ] Action items tracked in [location]

**Retrospective:** [date/time]

Phase 6: Post-Incident

Owner: Incident Commander schedules RCA, tracks action items.

RCA Requirements

| Severity | RCA Required | Timeline | |----------|--------------|----------| | SEV1 | MANDATORY | 48 hours | | SEV2 | MANDATORY | 1 week | | SEV3 | Optional | 2 weeks |

RCA Template

# Incident Post-Mortem: [Title]

**Incident ID:** INC-YYYY-NNNN
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** SEV[1/2/3]
**Author:** @[incident commander]

## Summary
[2-3 sentence summary of what happened]

## Impact
- **Users Affected:** [count/percentage]
- **Revenue Impact:** [if applicable]
- **SLA Impact:** [if applicable]

## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | [event] |

## Root Cause
[Technical description of the root cause]

## Contributing Factors
1. [Factor 1]
2. [Factor 2]

## What Went Well
1. [Item 1]
2. [Item 2]

## What Could Be Improved
1. [Item 1]
2. [Item 2]

## Action Items
| Item | Owner | Due Date | Status |
|------|-------|----------|--------|
| [action] | @[name] | YYYY-MM-DD | Open |

## Lessons Learned
[Key takeaways for the team]

Post-Incident Checklist

[ ] RCA document created
[ ] Blameless retrospective held
[ ] Action items assigned and tracked
[ ] Runbook updated (if applicable)
[ ] Monitoring improved (if gaps found)
[ ] Incident documented in knowledge base

Anti-Rationalization Table

| Rationalization | Why It's WRONG | Required Action | |-----------------|----------------|-----------------| | "Document later, fix first" | Memory fades in hours | Document AS you fix | | "Small incident, skip RCA" | Small incidents reveal systemic issues | RCA for SEV1/SEV2 minimum | | "Root cause is obvious" | Obvious != correct | Investigate with data | | "Skip verification period" | Premature resolution = reopen | Wait full verification period |

Pressure Resistance

| User Says | Your Response | |-----------|---------------| | "Mark resolved now, verify later" | "Cannot mark resolved until verification complete. This prevents reopened incidents." | | "Skip the RCA, we know what happened" | "RCA is mandatory for this severity. Schedule within required timeline." | | "No time for documentation" | "Real-time documentation takes 30 seconds per event. Memory loss causes worse rework." |

Dispatch Specialist

For complex incidents, dispatch the incident-responder agent:

Task tool:
  subagent_type: "ring:incident-responder"
  prompt: |
    INCIDENT: [description]
    SEVERITY: SEV[X]
    CURRENT STATUS: [state]
    REQUEST: [specific assistance needed]

Incident Response Workflow

This skill defines the structured process for handling production incidents. It MUST be followed for all SEV1, SEV2, and SEV3 incidents.

See shared-patterns/incident-severity.md for severity definitions.

Incident Response Phases

Phase 1: Detection

Trigger: Alert fires or user report received.

Required Actions

Acknowledge alert within SLA (see severity matrix)
Initial assessment:
- What is the symptom?
- What is affected?
- When did it start?
Check for related alerts - Is this isolated or part of larger issue?

Detection Checklist

[ ] Alert acknowledged in monitoring system
[ ] Initial symptom documented
[ ] Related alerts checked
[ ] Recent deployments checked
[ ] Known issue list checked

Phase 2: Declaration

Owner: First responder declares incident, assigns severity.

Severity Assignment

See shared-patterns/incident-severity.md for complete definitions.

Declaration Actions

Create incident channel (if SEV1/SEV2):
- Format: #incident-YYYY-MM-DD-brief-description
- Post initial summary
Assign Incident Commander (IC):
- SEV1: Senior on-call or escalate to manager
- SEV2/SEV3: Primary on-call
Update status page (if customer-facing):
- Acknowledge incident
- Set appropriate severity
- Estimated update time

Declaration Template

**INCIDENT DECLARED**

**Severity:** SEV[1/2/3]
**Title:** [Brief description]
**Incident Commander:** @[name]
**Channel:** #incident-[date]-[slug]

**Impact:**
- Services affected: [list]
- Users affected: [count/percentage]
- Started: [timestamp UTC]

**Current Status:**
[Brief description of current state]

**Next Update:** [timestamp]

Phase 3: Triage

Owner: Incident Commander coordinates, engineering investigates.

Triage Questions (5 Whys Approach)

What is the exact symptom?
What changed recently? (deployments, config, traffic)
What is the blast radius?
What is the root cause hypothesis?
What is the quickest path to mitigation?

Triage Checklist

[ ] Service dependencies mapped
[ ] Recent changes identified
[ ] Error patterns analyzed
[ ] Resource utilization checked
[ ] Initial hypothesis formed

Communication During Triage

Phase 4: Mitigation

Owner: Engineering implements fix, IC coordinates.

Mitigation Options (in order of preference)

Rollback - If recent deployment caused issue
Scale - If capacity related
Restart - If state corruption
Failover - If regional/AZ issue
Feature disable - If specific feature causes issue
Hotfix - If rollback not possible

Mitigation Checklist

[ ] Mitigation option selected with rationale
[ ] Change approved (SEV1: skip formal, document later)
[ ] Implementation tracked in incident channel
[ ] Verification criteria defined
[ ] Rollback plan ready

Mitigation Template

**MITIGATION IN PROGRESS**

**Action:** [description]
**Owner:** @[name]
**Started:** [timestamp]

**Verification:**
- [ ] [criterion 1]
- [ ] [criterion 2]

**Rollback Plan:**
[If mitigation fails, do X]

Phase 5: Resolution

Owner: Engineering confirms fix, IC verifies resolution.

Resolution Criteria

ALL must be true before marking resolved:

Primary symptom resolved - Users no longer affected
Monitoring confirms - Metrics returned to baseline
No related alerts - All triggered alerts cleared
Verification period passed - 15 min stability for SEV1/2

Resolution Checklist

[ ] Primary symptom verified resolved
[ ] Metrics returned to normal
[ ] All related alerts resolved
[ ] Verification period completed
[ ] Customer communication sent (if applicable)
[ ] Status page updated to resolved

Resolution Template

**INCIDENT RESOLVED**

**Duration:** [X hours Y minutes]
**Resolution Time:** [timestamp UTC]

**Root Cause:**
[Brief description of what caused the incident]

**Fix Applied:**
[What was done to resolve]

**Next Steps:**
- [ ] RCA scheduled for [date]
- [ ] Action items tracked in [location]

**Retrospective:** [date/time]

Phase 6: Post-Incident

Owner: Incident Commander schedules RCA, tracks action items.

RCA Requirements

| Severity | RCA Required | Timeline | |----------|--------------|----------| | SEV1 | MANDATORY | 48 hours | | SEV2 | MANDATORY | 1 week | | SEV3 | Optional | 2 weeks |

RCA Template

# Incident Post-Mortem: [Title]

**Incident ID:** INC-YYYY-NNNN
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** SEV[1/2/3]
**Author:** @[incident commander]

## Summary
[2-3 sentence summary of what happened]

## Impact
- **Users Affected:** [count/percentage]
- **Revenue Impact:** [if applicable]
- **SLA Impact:** [if applicable]

## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | [event] |

## Root Cause
[Technical description of the root cause]

## Contributing Factors
1. [Factor 1]
2. [Factor 2]

## What Went Well
1. [Item 1]
2. [Item 2]

## What Could Be Improved
1. [Item 1]
2. [Item 2]

## Action Items
| Item | Owner | Due Date | Status |
|------|-------|----------|--------|
| [action] | @[name] | YYYY-MM-DD | Open |

## Lessons Learned
[Key takeaways for the team]

Post-Incident Checklist

[ ] RCA document created
[ ] Blameless retrospective held
[ ] Action items assigned and tracked
[ ] Runbook updated (if applicable)
[ ] Monitoring improved (if gaps found)
[ ] Incident documented in knowledge base

Anti-Rationalization Table

Pressure Resistance

Dispatch Specialist

For complex incidents, dispatch the incident-responder agent:

Task tool:
  subagent_type: "ring:incident-responder"
  prompt: |
    INCIDENT: [description]
    SEVERITY: SEV[X]
    CURRENT STATUS: [state]
    REQUEST: [specific assistance needed]

Adoption

lerianstudio/ops-incident-response

$ install --global

Security Scan Results

SKILL.md

Incident Response Workflow

Incident Response Phases

Phase 1: Detection

Required Actions

Detection Checklist

Phase 2: Declaration

Severity Assignment

Declaration Actions

Declaration Template

Phase 3: Triage

Triage Questions (5 Whys Approach)

Triage Checklist

Communication During Triage

Phase 4: Mitigation

Mitigation Options (in order of preference)

Mitigation Checklist

Mitigation Template

Phase 5: Resolution

Resolution Criteria

Resolution Checklist

Resolution Template

Phase 6: Post-Incident

RCA Requirements

RCA Template

Post-Incident Checklist

Anti-Rationalization Table

Pressure Resistance

Dispatch Specialist

Related Skills

lerianstudio/ring:migrate-v4

lerianstudio/ring:writing-functional-docs

lerianstudio/ring:writing-api-docs

lerianstudio/ring:voice-and-tone

lerianstudio/ops-incident-response

$ install --global

Security Scan Results

SKILL.md

Incident Response Workflow

Incident Response Phases

Phase 1: Detection

Required Actions

Detection Checklist

Phase 2: Declaration

Severity Assignment

Declaration Actions

Declaration Template

Phase 3: Triage

Triage Questions (5 Whys Approach)

Triage Checklist

Communication During Triage

Phase 4: Mitigation

Mitigation Options (in order of preference)

Mitigation Checklist

Mitigation Template

Phase 5: Resolution

Resolution Criteria

Resolution Checklist

Resolution Template

Phase 6: Post-Incident

RCA Requirements

RCA Template

Post-Incident Checklist

Anti-Rationalization Table

Pressure Resistance

Dispatch Specialist

Related Skills

lerianstudio/ring:migrate-v4

lerianstudio/ring:writing-functional-docs

lerianstudio/ring:writing-api-docs

lerianstudio/ring:voice-and-tone