compliance/continuity/incident-management/SKILL.md
Implement incident management processes and escalation procedures. Configure on-call schedules and post-incident reviews. Use when managing production incidents.
npx skillsauth add bagelhole/devops-security-agent-skills incident-managementInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Implement effective incident management processes including severity definitions, escalation matrices, war room procedures, and blameless post-mortem templates.
severity_definitions:
SEV1_critical:
impact: "Complete service outage or data breach affecting all/most customers"
examples:
- Production site completely down
- Data breach confirmed or suspected
- Complete loss of a critical business function
- Security incident with active exploitation
response_time: "Immediate (within 5 minutes)"
update_frequency: "Every 15-30 minutes"
who_is_paged: "On-call engineer, engineering manager, incident commander, executive on-call"
communication: "Status page update, customer email, executive notification"
resolution_target: "< 1 hour to mitigate"
SEV2_major:
impact: "Major feature broken or severe degradation affecting many customers"
examples:
- Key feature completely non-functional
- Significant performance degradation (>5x latency)
- Data processing pipeline completely stalled
- Partial outage affecting a region or segment
response_time: "Within 15 minutes"
update_frequency: "Every 30-60 minutes"
who_is_paged: "On-call engineer, engineering manager"
communication: "Status page update if customer-facing"
resolution_target: "< 4 hours to mitigate"
SEV3_moderate:
impact: "Minor feature impaired or degradation affecting some customers"
examples:
- Non-critical feature broken
- Moderate performance degradation
- Elevated error rate (below threshold for SEV2)
- Single-customer impact on non-critical function
response_time: "Within 1 hour during business hours"
update_frequency: "Every 2-4 hours"
who_is_paged: "On-call engineer"
communication: "Internal only unless customer inquires"
resolution_target: "< 1 business day"
SEV4_low:
impact: "Cosmetic issue, minor inconvenience, or non-customer-facing problem"
examples:
- UI cosmetic bug
- Non-critical monitoring gap
- Internal tool degradation
- Documentation inaccuracy in production
response_time: "Next business day"
update_frequency: "As needed"
who_is_paged: "None (ticket created)"
communication: "None"
resolution_target: "Within sprint planning cycle"
escalation_matrix:
tier_1_on_call_engineer:
reached_via: "PagerDuty / OpsGenie alert"
responsibilities:
- Acknowledge alert within 5 minutes
- Assess severity and impact
- Begin troubleshooting
- Escalate to Tier 2 if unable to resolve within 30 minutes (SEV1/2)
escalation_trigger: "Cannot resolve, needs additional expertise, or severity upgrade"
tier_2_team_lead_or_sme:
reached_via: "PagerDuty escalation or direct page"
responsibilities:
- Provide subject matter expertise
- Assist with diagnosis and resolution
- Coordinate with other teams if cross-service issue
- Escalate to Tier 3 if broader coordination needed
escalation_trigger: "Multi-service issue, needs executive decision, or customer-facing SEV1"
tier_3_engineering_management:
reached_via: "PagerDuty escalation or direct call"
responsibilities:
- Assign incident commander (if not already)
- Allocate additional resources
- Make business decisions (feature disable, rollback, etc.)
- Coordinate external communication
escalation_trigger: "Business impact decision, extended outage, or PR/legal concern"
tier_4_executive:
reached_via: "Direct phone call"
responsibilities:
- Authorize extraordinary measures
- Manage board/investor communication
- Approve public statements
- Engage external resources (vendors, consultants)
escalation_trigger: "Major breach, extended SEV1, regulatory or legal implication"
time_based_escalation:
sev1:
"15 min no ack": "Re-page on-call + backup on-call"
"30 min unresolved": "Page team lead"
"1 hour unresolved": "Page engineering manager + executive on-call"
"2 hours unresolved": "All-hands engineering involvement"
sev2:
"30 min no ack": "Re-page on-call + backup on-call"
"1 hour unresolved": "Page team lead"
"4 hours unresolved": "Page engineering manager"
war_room:
activation: "Automatically for SEV1, on-demand for SEV2"
setup:
communication_channel:
primary: "Dedicated Slack channel (#incident-YYYY-MM-DD-brief-name)"
voice: "Zoom/Google Meet bridge (persistent link)"
backup: "Phone conference bridge"
channel_rules:
- "Only incident-related communication in the channel"
- "Use threads for side discussions"
- "Prefix messages with role (IC:, COMMS:, ENG:)"
roles:
incident_commander:
responsibilities:
- Own the incident from declaration to resolution
- Coordinate all response activities
- Make decisions on response actions
- Assign tasks to responders
- Determine when incident is resolved
- Schedule post-mortem
selection: "On-call IC roster, or senior engineer who declares the incident"
communications_lead:
responsibilities:
- Draft and publish status page updates
- Coordinate customer notifications
- Handle internal stakeholder updates
- Manage executive communication
- Document timeline in real-time
selection: "Designated from on-call comms roster or engineering manager"
technical_lead:
responsibilities:
- Lead technical diagnosis and troubleshooting
- Coordinate technical responders
- Recommend mitigation and resolution actions
- Verify fix effectiveness
selection: "Senior engineer with relevant system expertise"
scribe:
responsibilities:
- Document all actions, decisions, and findings
- Maintain real-time timeline
- Record who did what and when
- Capture screenshots and log excerpts
selection: "Any available team member (can be rotated)"
workflow:
1_declare:
- "IC declares incident with severity level"
- "War room channel and bridge created"
- "Roles assigned"
- "First status update posted"
2_assess:
- "Determine scope and customer impact"
- "Identify affected systems and services"
- "Establish working hypothesis"
3_mitigate:
- "Focus on restoring service first, root cause second"
- "IC approves all changes to production"
- "Changes documented in real-time"
- "Rollback if mitigation makes things worse"
4_resolve:
- "Confirm service restored to normal"
- "Verify monitoring shows healthy metrics"
- "IC declares incident resolved"
- "Final status page update"
5_follow_up:
- "Schedule post-mortem within 48 hours"
- "Assign action items from immediate findings"
- "Send internal summary"
on_call_schedule:
rotation_structure:
primary:
rotation: "Weekly"
handoff: "Monday 10:00 AM local time"
team_size: "Minimum 5 engineers in rotation"
secondary:
rotation: "Weekly (offset from primary)"
activation: "If primary does not acknowledge within 10 minutes"
expectations:
response_time: "Acknowledge alert within 5 minutes"
availability: "Reachable by phone and laptop within 15 minutes"
handoff: "Document any ongoing issues during handoff"
compensation: "Per company on-call compensation policy"
health:
max_consecutive_weeks: 2
minimum_gap_between_rotations: "2 weeks"
post_incident_rest: "If engaged for 4+ hours overnight, late start next day"
burnout_monitoring: "Track pages per person per week, rebalance if needed"
pagerduty_configuration:
escalation_policy:
- level_1:
target: "Primary on-call"
timeout: "5 minutes"
- level_2:
target: "Secondary on-call"
timeout: "10 minutes"
- level_3:
target: "Engineering manager"
timeout: "15 minutes"
notification_rules:
high_urgency:
- "Push notification immediately"
- "Phone call after 1 minute"
- "SMS after 2 minutes"
low_urgency:
- "Push notification"
- "Email after 5 minutes"
# Post-Incident Review: [Incident Title]
**Date:** YYYY-MM-DD
**Severity:** SEV[1-4]
**Duration:** [Start time] to [End time] ([X hours Y minutes])
**Incident Commander:** [Name]
**Author:** [Name]
**Status:** Draft / In Review / Final
## Executive Summary
[2-3 sentence summary of what happened, the impact, and the resolution]
## Impact
- **Customer impact:** [Number/percentage of customers affected, what they experienced]
- **Duration of impact:** [How long customers were affected]
- **Revenue impact:** [Estimated financial impact, if applicable]
- **Data impact:** [Any data loss or corruption]
- **SLA impact:** [Any SLA breaches]
## Timeline (all times UTC)
| Time | Event |
|------|-------|
| HH:MM | [First anomaly detected by monitoring] |
| HH:MM | [Alert fired / customer report received] |
| HH:MM | [On-call engineer acknowledged] |
| HH:MM | [Incident declared at SEV level] |
| HH:MM | [War room established] |
| HH:MM | [Root cause identified] |
| HH:MM | [Mitigation applied] |
| HH:MM | [Service restored] |
| HH:MM | [Incident resolved] |
## Root Cause
[Detailed technical explanation of what caused the incident]
## Detection
- **How was the incident detected?** [Monitoring alert / customer report / manual observation]
- **Time to detect:** [Time from first anomaly to detection]
- **Could we have detected sooner?** [Yes/No, with explanation]
## Response
- **What went well:**
- [List things that worked effectively during response]
- [E.g., "Runbook for database failover was accurate and followed successfully"]
- [E.g., "Communication to customers was timely and clear"]
- **What could be improved:**
- [List things that slowed or hindered response]
- [E.g., "Took 20 minutes to identify the correct service owner"]
- [E.g., "Monitoring did not alert on the specific failure mode"]
## Contributing Factors
[List all factors that contributed to the incident occurring or being worse than it could have been. This is not about blame - it is about understanding the system.]
1. [Factor 1: e.g., "Configuration change was not tested in staging"]
2. [Factor 2: e.g., "Alert threshold was too high to catch gradual degradation"]
3. [Factor 3: e.g., "No circuit breaker between Service A and Service B"]
## Action Items
| ID | Action | Owner | Priority | Due Date | Status |
|----|--------|-------|----------|----------|--------|
| 1 | [Preventive action] | [Name] | P1 | YYYY-MM-DD | Open |
| 2 | [Detection improvement] | [Name] | P2 | YYYY-MM-DD | Open |
| 3 | [Process improvement] | [Name] | P2 | YYYY-MM-DD | Open |
| 4 | [Runbook update] | [Name] | P3 | YYYY-MM-DD | Open |
## Lessons Learned
[Key takeaways that should be shared broadly]
## Appendix
- [Link to monitoring dashboards during incident]
- [Link to relevant log queries]
- [Link to war room channel archive]
post_mortem_process:
scheduling:
sev1: "Within 48 hours of resolution"
sev2: "Within 1 week of resolution"
sev3: "Within 2 weeks (optional, based on learning potential)"
sev4: "Not required"
meeting_format:
duration: "60-90 minutes"
attendees:
required: "IC, technical lead, scribe, involved engineers"
optional: "Engineering manager, product manager, affected team leads"
agenda:
- "5 min: Review timeline and facts"
- "15 min: Walk through root cause and contributing factors"
- "15 min: Discuss what went well"
- "15 min: Discuss what could be improved"
- "15 min: Define and assign action items"
- "5 min: Identify lessons learned and sharing plan"
principles:
- "Blameless: Focus on systems and processes, not individuals"
- "Factual: Base discussion on data, logs, and observations"
- "Forward-looking: Prioritize preventive actions over assigning fault"
- "Complete: Address detection, response, and prevention"
- "Actionable: Every finding should produce a tracked action item"
action_item_tracking:
- "All action items entered into issue tracker (Jira, GitHub Issues)"
- "Priority assigned based on risk reduction potential"
- "Owner assigned with due date"
- "Reviewed in team standups and sprint planning"
- "Tracked to completion"
- "Monthly review of open post-mortem action items"
incident_metrics:
mttr:
name: "Mean Time to Resolve"
definition: "Average time from incident detection to resolution"
target: "SEV1: <1h, SEV2: <4h"
trending: "Track monthly, aim for improvement"
mttd:
name: "Mean Time to Detect"
definition: "Average time from incident start to detection"
target: "< 5 minutes for SEV1/2"
trending: "Monitors effectiveness of alerting"
mtta:
name: "Mean Time to Acknowledge"
definition: "Average time from alert to engineer acknowledgment"
target: "< 5 minutes"
trending: "Monitors on-call responsiveness"
incident_frequency:
name: "Incidents per week/month by severity"
target: "Trending downward"
trending: "Monitors system reliability improvement"
action_item_completion:
name: "Post-mortem action item completion rate"
target: "> 90% completed on time"
trending: "Monitors follow-through on improvements"
recurring_incidents:
name: "Percentage of incidents with same root cause as previous incident"
target: "< 10%"
trending: "Monitors effectiveness of preventive actions"
incident_management_checklist:
process_setup:
- [ ] Severity levels defined with clear criteria
- [ ] Escalation matrix documented
- [ ] On-call schedule established and staffed
- [ ] War room procedures documented
- [ ] Post-mortem template created
- [ ] Communication templates prepared (status page, email)
- [ ] Incident management tool configured (PagerDuty, OpsGenie)
per_incident:
- [ ] Incident declared with severity level
- [ ] War room established (SEV1/2)
- [ ] Roles assigned (IC, comms, technical lead, scribe)
- [ ] Timeline maintained in real-time
- [ ] Status page updated (customer-facing impact)
- [ ] Stakeholders notified per communication plan
- [ ] Resolution verified with monitoring
- [ ] Post-mortem scheduled
- [ ] Post-mortem conducted and published
- [ ] Action items tracked to completion
compliance:
- [ ] All SEV1/2 incidents have post-mortems
- [ ] Incident log maintained for audit evidence
- [ ] Metrics reported monthly
- [ ] On-call health monitored (pages per person)
- [ ] Annual incident response training conducted
- [ ] Annual incident response plan test completed
development
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
testing
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.