assets/skills/fix/incident/SKILL.md
Incident response workflow for handling production issues. Use when there's an incident, outage, production down, or when user says "incident", "outage", "production down", "service down", "emergency".
npx skillsauth add phuthuycoding/moicle fix-incidentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Structured workflow for handling production incidents from triage to postmortem. Speed > completeness during P1/P2 — mitigation first, root cause after.
/fix-hotfix/fix-root-causeTRIAGE → INVESTIGATE → MITIGATE → RESOLVE → POSTMORTEM
↓ ↓ ↓ ↓ ↑
comms comms comms comms learn
Before responding:
~/.claude/architecture/ddd-architecture.mddocs/runbooks/ or wiki)Use ~/.claude/architecture/_shared/severity-levels.md (incident severity table — P1-P4 with response times).
Decide UP if any apply:
Otherwise: P3.
Page or escalate when:
| Trigger | Action | |---------|--------| | P1 on customer-facing system | Page CTO / engineering lead within 15 min | | P1 + data loss / breach | Page legal + security lead immediately | | P2 lasting >2 hours without mitigation | Page engineering lead, consider promoting to P1 | | P1/P2 spanning >1 shift | Schedule formal handoff (see Phase 1.5) | | Customer-facing comms required | Notify support + comms lead within 30 min | | Public disclosure may be needed (security) | Notify legal + comms + leadership |
Goal: classify severity, assemble response, start comms.
#incident-{date}-{short-name})[INCIDENT] {severity} {service} — {one-line impact}
Detected: {timestamp}
Status: investigating
Channel: #incident-{slug}
Next update: {time}
For incidents spanning >1 shift, do a formal handoff. Skip if same team handles end-to-end.
## Incident Handoff: {service} {date}
**From:** {outgoing IC} → **To:** {incoming IC}
**Channel:** #incident-{slug}
**Severity:** {current}
### Current state
- Impact: {what users see right now}
- Mitigations applied: {list with timestamps}
- What worked: {list}
- What didn't: {list}
### Active investigations
- {open question + who was looking into it}
### Active hypotheses
- {hypothesis + evidence for/against}
### Open decisions
- {decision needed + options + recommended choice}
### Next steps
- {action + owner + ETA}
### People paged / informed
- {list with role + when paged}
### Pending comms
- {next update due at HH:MM}
- {customer-facing status due}
Live walkthrough (5-10 min): outgoing IC walks incoming IC through the timeline + open questions in the incident channel. Both confirm understanding before handoff completes.
Goal: find what's broken — don't fix yet, identify it.
| Symptom | Likely cause | |---------|-------------| | Errors only on new code path | Bad deploy → rollback | | Errors across all paths | Infra (DB, cache, network) | | Slow but not erroring | Resource saturation / external API slowdown | | Intermittent | Cache, race condition, flapping dependency | | Started exactly on a cron tick | Scheduled job |
Goal: stop the bleeding. Don't fix root cause yet.
| Option | When | Reversibility | |--------|------|---------------| | Rollback deploy | Recent deploy is suspect | Easy | | Feature flag off | Bad feature behind flag | Easy | | Scale up | Resource saturation | Easy | | Failover | Region / instance down | Medium | | Block traffic | Bot / DDoS | Medium | | DB restore | Data corruption | Hard — last resort |
Goal: apply the permanent fix and verify.
/fix-root-cause workflow if not obvious)Goal: learn from this so it doesn't happen again. Blameless.
# Postmortem: {service} {date}
## Summary
{1-paragraph: what happened, impact, duration}
## Impact
- Users affected: {N or %}
- Duration: {detected → mitigated → resolved}
- Revenue / SLA impact: {if applicable}
## Timeline (UTC)
| Time | Event |
|------|-------|
| HH:MM | Deploy of {sha} |
| HH:MM | First alert fired |
| HH:MM | IC paged |
| HH:MM | Mitigation: rollback to {sha} |
| HH:MM | Error rate normal |
| HH:MM | Permanent fix deployed |
## Root Cause
{Technical cause — what code / config / data caused this}
## Contributing Factors
- {Factor 1, e.g., test gap}
- {Factor 2, e.g., missing alert}
## What Went Well
- {e.g., fast detection, clear rollback path}
## What Didn't Go Well
- {e.g., took 20 min to identify suspect deploy}
## Action Items
| # | Action | Owner | Due | Severity |
|---|--------|-------|-----|----------|
| 1 | {add alert for X} | @alice | {date} | high |
| 2 | {add regression test} | @bob | {date} | medium |
| Severity | Update cadence | Channels | |----------|---------------|----------| | P1 | Every 15 min while investigating; every 30 min after mitigation | Incident channel + status page + customer email if >30 min outage | | P2 | Every 30 min while investigating; hourly after mitigation | Incident channel + status page | | P3 | Hourly | Incident channel | | P4 | At start, on mitigation, on resolution | Incident channel |
If nothing new to report, post anyway: "No update — still investigating, next update at HH:MM." Silence creates panic.
Status update
[UPDATE {time} UTC] {severity} {service}
Impact: {what users see — specific, not "issues"}
Action: {what we're doing right now}
ETA mitigation: {time or "unknown"}
Next update: {time}
Mitigation applied
[MITIGATED {time} UTC] {severity} {service}
Impact: {reduced from X to Y}
Mitigation: {what we did}
Permanent fix: {ETA}
Next update: {time}
Resolved
[RESOLVED {time} UTC] {service}
Total duration: {detected → resolved}
Root cause: {one line}
Mitigation: {what we did}
Permanent fix: {deployed at HH:MM | scheduled by DATE}
Postmortem: {date} — link to come
| When | Use |
|------|-----|
| Bug reported but no active incident | /fix-hotfix |
| Root cause unclear, need deep investigation | /fix-root-cause |
| Need to roll back deployment | (use ops runbook, not a skill) |
| Documenting postmortem as a doc | /docs-write |
| Phase | Agent | Purpose |
|-------|-------|---------|
| TRIAGE | @devops | Infra + monitoring context |
| INVESTIGATE | Stack-specific dev agent | Code-level debugging |
| INVESTIGATE | @security-audit | If suspected breach / leak |
| INVESTIGATE | @db-designer | If DB / data issue |
| MITIGATE | @devops | Rollback / scale / failover |
| RESOLVE | Stack-specific dev agent | Permanent fix |
| POSTMORTEM | @docs-writer | Write postmortem doc |
development
Test-Driven Development workflow. Use when doing TDD, writing tests first, or when user says "tdd", "test first", "test driven", "red green refactor".
development
Thorough pull request review workflow with architecture compliance checks. Use when reviewing pull requests, checking code changes, or when user says "review pr", "check pr", "review code", "pr review", "review pull request".
development
Review local branch changes for architecture compliance, conventions, and code quality before pushing/PR. Stack-aware — detects the project stack and applies the matching rules. Use when user says "review changes", "review branch", "check branch", "check changes", "review my code", "review before pr".
testing
DDD architecture compliance review with automated checks and review loop. Use when user says "architect-review", "architecture review", "review architecture", "check architecture", "review ddd", "ddd review".