.archive/ops-team/skills/ops-incident-response/SKILL.md
Structured workflow for production incident management following SRE best practices. Covers incident declaration, triage, coordination, resolution, and post-mortem.
npx skillsauth add lerianstudio/ring ops-incident-responseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill defines the structured process for handling production incidents. It MUST be followed for all SEV1, SEV2, and SEV3 incidents.
See shared-patterns/incident-severity.md for severity definitions.
| Phase | Focus | Owner | |-------|-------|-------| | 1. Detection | Identify and confirm incident | Monitoring/On-call | | 2. Declaration | Assess severity, declare incident | Incident Commander | | 3. Triage | Identify impact and initial hypothesis | Response Team | | 4. Mitigation | Restore service, implement workaround | Engineering Team | | 5. Resolution | Permanent fix, verification | Engineering Team | | 6. Post-Incident | RCA, action items, documentation | Incident Commander |
Trigger: Alert fires or user report received.
Owner: First responder declares incident, assigns severity.
| Criteria | SEV1 | SEV2 | SEV3 | |----------|------|------|------| | Complete outage | X | | | | Data loss risk | X | | | | >50% users affected | | X | | | <50% users affected | | | X | | Workaround available | | | X |
See shared-patterns/incident-severity.md for complete definitions.
Create incident channel (if SEV1/SEV2):
#incident-YYYY-MM-DD-brief-descriptionAssign Incident Commander (IC):
Update status page (if customer-facing):
**INCIDENT DECLARED**
**Severity:** SEV[1/2/3]
**Title:** [Brief description]
**Incident Commander:** @[name]
**Channel:** #incident-[date]-[slug]
**Impact:**
- Services affected: [list]
- Users affected: [count/percentage]
- Started: [timestamp UTC]
**Current Status:**
[Brief description of current state]
**Next Update:** [timestamp]
Owner: Incident Commander coordinates, engineering investigates.
Update frequency by severity: | Severity | Internal Update | External Update | |----------|-----------------|-----------------| | SEV1 | Every 10 min | Every 15 min | | SEV2 | Every 15 min | Every 30 min | | SEV3 | Every 30 min | As needed |
Owner: Engineering implements fix, IC coordinates.
**MITIGATION IN PROGRESS**
**Action:** [description]
**Owner:** @[name]
**Started:** [timestamp]
**Verification:**
- [ ] [criterion 1]
- [ ] [criterion 2]
**Rollback Plan:**
[If mitigation fails, do X]
Owner: Engineering confirms fix, IC verifies resolution.
ALL must be true before marking resolved:
**INCIDENT RESOLVED**
**Duration:** [X hours Y minutes]
**Resolution Time:** [timestamp UTC]
**Root Cause:**
[Brief description of what caused the incident]
**Fix Applied:**
[What was done to resolve]
**Next Steps:**
- [ ] RCA scheduled for [date]
- [ ] Action items tracked in [location]
**Retrospective:** [date/time]
Owner: Incident Commander schedules RCA, tracks action items.
| Severity | RCA Required | Timeline | |----------|--------------|----------| | SEV1 | MANDATORY | 48 hours | | SEV2 | MANDATORY | 1 week | | SEV3 | Optional | 2 weeks |
# Incident Post-Mortem: [Title]
**Incident ID:** INC-YYYY-NNNN
**Date:** YYYY-MM-DD
**Duration:** X hours Y minutes
**Severity:** SEV[1/2/3]
**Author:** @[incident commander]
## Summary
[2-3 sentence summary of what happened]
## Impact
- **Users Affected:** [count/percentage]
- **Revenue Impact:** [if applicable]
- **SLA Impact:** [if applicable]
## Timeline
| Time (UTC) | Event |
|------------|-------|
| HH:MM | [event] |
## Root Cause
[Technical description of the root cause]
## Contributing Factors
1. [Factor 1]
2. [Factor 2]
## What Went Well
1. [Item 1]
2. [Item 2]
## What Could Be Improved
1. [Item 1]
2. [Item 2]
## Action Items
| Item | Owner | Due Date | Status |
|------|-------|----------|--------|
| [action] | @[name] | YYYY-MM-DD | Open |
## Lessons Learned
[Key takeaways for the team]
| Rationalization | Why It's WRONG | Required Action | |-----------------|----------------|-----------------| | "Document later, fix first" | Memory fades in hours | Document AS you fix | | "Small incident, skip RCA" | Small incidents reveal systemic issues | RCA for SEV1/SEV2 minimum | | "Root cause is obvious" | Obvious != correct | Investigate with data | | "Skip verification period" | Premature resolution = reopen | Wait full verification period |
| User Says | Your Response | |-----------|---------------| | "Mark resolved now, verify later" | "Cannot mark resolved until verification complete. This prevents reopened incidents." | | "Skip the RCA, we know what happened" | "RCA is mandatory for this severity. Schedule within required timeline." | | "No time for documentation" | "Real-time documentation takes 30 seconds per event. Memory loss causes worse rework." |
For complex incidents, dispatch the incident-responder agent:
Task tool:
subagent_type: "ring:incident-responder"
prompt: |
INCIDENT: [description]
SEVERITY: SEV[X]
CURRENT STATUS: [state]
REQUEST: [specific assistance needed]
development
Analyzes a Go service using lib-commons v2/v3 and generates a visual migration report showing every change needed to upgrade to lib-commons v4. Produces an interactive HTML page (via ring:visualize) and optionally generates refactoring tasks for ring:dev-cycle.
documentation
Patterns and structure for writing functional documentation including guides, conceptual explanations, tutorials, and best practices documentation.
development
Patterns and structure for writing API reference documentation including endpoint descriptions, request/response schemas, and error documentation.
documentation
Voice and tone guidelines for technical documentation. Ensures consistent, clear, and human writing across all documentation.