plugins/devops-sre/skills/incident-response/root-cause-analysis/SKILL.md
Systematic root cause analysis using 5 Whys, fishbone diagrams, and fault tree analysis. Use this skill when investigating why an incident happened, performing RCA, or writing postmortems. Activate when: root cause, why did this happen, 5 whys, incident analysis, postmortem investigation, how did this happen, what caused, failure analysis.
npx skillsauth add latestaiagents/agent-skills root-cause-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Find the real cause, not just the symptoms, to prevent recurrence.
Keep asking "why" until you reach an actionable root cause.
Problem: API returned 500 errors for 45 minutes
Why #1: Why did the API return 500 errors?
→ The database connection pool was exhausted
Why #2: Why was the connection pool exhausted?
→ Connections weren't being released after queries
Why #3: Why weren't connections being released?
→ A code change introduced a bug that skipped connection.close()
Why #4: Why wasn't this caught before production?
→ Our integration tests don't check for connection leaks
Why #5: Why don't integration tests check for connection leaks?
→ We haven't implemented connection pool monitoring in tests
ROOT CAUSE: Missing connection leak detection in test suite
ACTION: Add connection pool assertions to integration tests
| Do | Don't | |----|-------| | Use data, not assumptions | Stop at "human error" | | Consider multiple branches | Accept vague answers | | Verify each "because" | Skip to conclusions | | Look for systemic issues | Blame individuals |
Most incidents have multiple contributing factors.
┌─────────────────────────────────────────────────────────────┐
│ INCIDENT: API OUTAGE │
├─────────────────────────────────────────────────────────────┤
│ │
│ Direct Cause: │
│ └─ Database connection pool exhaustion │
│ │
│ Contributing Factors: │
│ ├─ [Code] Connection leak bug in PR #1234 │
│ ├─ [Process] Code review didn't catch the bug │
│ ├─ [Testing] No connection leak tests │
│ ├─ [Monitoring] No alert for connection pool usage │
│ ├─ [Deploy] Deployed during high-traffic period │
│ └─ [Recovery] Runbook for this scenario was outdated │
│ │
│ Environmental Factors: │
│ ├─ Team was understaffed (vacation season) │
│ └─ Similar incident 6 months ago, action items incomplete │
│ │
└─────────────────────────────────────────────────────────────┘
Work backwards from failure to identify all paths.
[API Outage]
│
┌────────────┴────────────┐
│ │
[DB Connections [App Server
Exhausted] Crashed]
│ │
┌───────┴───────┐ │
│ │ │
[Connection [Too Many [OOM
Leak] Requests] Error]
│ │ │
│ ┌─────┴─────┐ │
│ │ │ │
[Bug in [Traffic [Missing [Memory
Code] Spike] Rate Leak]
│ Limit]
│
[Marketing
Campaign]
Detailed timeline helps identify the chain of events.
Timeline: API Outage - 2026-01-15
Time (UTC) | Event | Source
------------|------------------------------|--------
09:00 | Deploy v2.3.4 started | GitHub
09:15 | Deploy completed | K8s
09:45 | Marketing email sent (50k) | Marketing
10:02 | Traffic spike begins | Datadog
10:15 | Connection pool at 80% | Metrics
10:23 | First 500 errors | Logs
10:25 | Alert fired | PagerDuty
10:27 | On-call acknowledged | PagerDuty
10:35 | Root cause identified | Slack
10:42 | Rollback initiated | K8s
10:48 | Service recovering | Datadog
11:00 | All clear declared | Slack
Key Finding: 38 minutes between deploy and issue detection
Deploy + traffic spike = perfect storm
Good action items are SMART:
| Criteria | Bad Example | Good Example | |----------|-------------|--------------| | Specific | "Improve testing" | "Add connection pool leak test to CI" | | Measurable | "Monitor better" | "Alert when pool > 80% for 5 min" | | Assignable | "Team should fix" | "@jane owns implementation" | | Realistic | "Rewrite entire system" | "Add circuit breaker to DB calls" | | Time-bound | "Soon" | "Complete by 2026-02-01" |
## Root Cause Analysis
### Direct Cause
[What directly caused the incident]
### 5 Whys Analysis
1. Why? → [Answer]
2. Why? → [Answer]
3. Why? → [Answer]
4. Why? → [Answer]
5. Why? → [Root cause]
### Contributing Factors
- **Technical:** [List]
- **Process:** [List]
- **Organizational:** [List]
### Why Wasn't This Caught?
- In development: [Why]
- In code review: [Why]
- In testing: [Why]
- In staging: [Why]
- By monitoring: [Why]
### Action Items
| Priority | Action | Owner | Due | Prevents |
|----------|--------|-------|-----|----------|
| P0 | [Action] | @name | [Date] | Direct cause |
| P1 | [Action] | @name | [Date] | Detection |
| P2 | [Action] | @name | [Date] | Future risk |
development
Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.
documentation
Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.
development
Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.
development
Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.