Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

latestaiagents/root-cause-analysis

Name: root-cause-analysis
Author: latestaiagents

plugins/devops-sre/skills/incident-response/root-cause-analysis/SKILL.md

npx skillsauth add latestaiagents/agent-skills root-cause-analysis

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Root Cause Analysis (RCA)

Find the real cause, not just the symptoms, to prevent recurrence.

RCA Principles

Look for systems failures, not human errors
Ask "why" until you find actionable causes
Multiple contributing factors are common
Prevention > blame

Method 1: 5 Whys

Keep asking "why" until you reach an actionable root cause.

Example: API Outage

Problem: API returned 500 errors for 45 minutes

Why #1: Why did the API return 500 errors?
→ The database connection pool was exhausted

Why #2: Why was the connection pool exhausted?
→ Connections weren't being released after queries

Why #3: Why weren't connections being released?
→ A code change introduced a bug that skipped connection.close()

Why #4: Why wasn't this caught before production?
→ Our integration tests don't check for connection leaks

Why #5: Why don't integration tests check for connection leaks?
→ We haven't implemented connection pool monitoring in tests

ROOT CAUSE: Missing connection leak detection in test suite
ACTION: Add connection pool assertions to integration tests

5 Whys Guidelines

| Do | Don't | |----|-------| | Use data, not assumptions | Stop at "human error" | | Consider multiple branches | Accept vague answers | | Verify each "because" | Skip to conclusions | | Look for systemic issues | Blame individuals |

Method 2: Contributing Factors Analysis

Most incidents have multiple contributing factors.

┌─────────────────────────────────────────────────────────────┐
│                    INCIDENT: API OUTAGE                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Direct Cause:                                               │
│  └─ Database connection pool exhaustion                     │
│                                                              │
│  Contributing Factors:                                       │
│  ├─ [Code] Connection leak bug in PR #1234                  │
│  ├─ [Process] Code review didn't catch the bug              │
│  ├─ [Testing] No connection leak tests                      │
│  ├─ [Monitoring] No alert for connection pool usage         │
│  ├─ [Deploy] Deployed during high-traffic period            │
│  └─ [Recovery] Runbook for this scenario was outdated       │
│                                                              │
│  Environmental Factors:                                      │
│  ├─ Team was understaffed (vacation season)                 │
│  └─ Similar incident 6 months ago, action items incomplete  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Method 3: Fault Tree Analysis

Work backwards from failure to identify all paths.

                    [API Outage]
                         │
            ┌────────────┴────────────┐
            │                         │
    [DB Connections              [App Server
     Exhausted]                   Crashed]
            │                         │
    ┌───────┴───────┐                │
    │               │                │
[Connection    [Too Many         [OOM
 Leak]         Requests]          Error]
    │               │                │
    │         ┌─────┴─────┐          │
    │         │           │          │
[Bug in    [Traffic    [Missing   [Memory
 Code]      Spike]     Rate       Leak]
              │        Limit]
              │
        [Marketing
         Campaign]

Method 4: Timeline Reconstruction

Detailed timeline helps identify the chain of events.

Timeline: API Outage - 2026-01-15

Time (UTC)  | Event                        | Source
------------|------------------------------|--------
09:00       | Deploy v2.3.4 started        | GitHub
09:15       | Deploy completed             | K8s
09:45       | Marketing email sent (50k)   | Marketing
10:02       | Traffic spike begins         | Datadog
10:15       | Connection pool at 80%       | Metrics
10:23       | First 500 errors             | Logs
10:25       | Alert fired                  | PagerDuty
10:27       | On-call acknowledged         | PagerDuty
10:35       | Root cause identified        | Slack
10:42       | Rollback initiated           | K8s
10:48       | Service recovering           | Datadog
11:00       | All clear declared           | Slack

Key Finding: 38 minutes between deploy and issue detection
             Deploy + traffic spike = perfect storm

Common Root Cause Categories

Technical

Code bugs
Configuration errors
Infrastructure failures
Dependency failures
Capacity limits

Process

Inadequate testing
Missed code review
Incomplete runbooks
Poor change management
Insufficient monitoring

Organizational

Understaffing
Knowledge silos
Communication gaps
Incomplete training
Technical debt

Action Item Quality

Good action items are SMART:

| Criteria | Bad Example | Good Example | |----------|-------------|--------------| | Specific | "Improve testing" | "Add connection pool leak test to CI" | | Measurable | "Monitor better" | "Alert when pool > 80% for 5 min" | | Assignable | "Team should fix" | "@jane owns implementation" | | Realistic | "Rewrite entire system" | "Add circuit breaker to DB calls" | | Time-bound | "Soon" | "Complete by 2026-02-01" |

RCA Template

## Root Cause Analysis

### Direct Cause
[What directly caused the incident]

### 5 Whys Analysis
1. Why? → [Answer]
2. Why? → [Answer]
3. Why? → [Answer]
4. Why? → [Answer]
5. Why? → [Root cause]

### Contributing Factors
- **Technical:** [List]
- **Process:** [List]
- **Organizational:** [List]

### Why Wasn't This Caught?
- In development: [Why]
- In code review: [Why]
- In testing: [Why]
- In staging: [Why]
- By monitoring: [Why]

### Action Items
| Priority | Action | Owner | Due | Prevents |
|----------|--------|-------|-----|----------|
| P0 | [Action] | @name | [Date] | Direct cause |
| P1 | [Action] | @name | [Date] | Detection |
| P2 | [Action] | @name | [Date] | Future risk |

Anti-Patterns to Avoid

"Human error" - The human made an error, but the system allowed it
"Lack of attention" - Why did the system require such attention?
"Should have known" - How could they have known?
"Didn't follow procedure" - Why was the procedure not followed?
Single root cause - Usually there are multiple contributing factors

latestaiagents/root-cause-analysis

plugins/devops-sre/skills/incident-response/root-cause-analysis/SKILL.md

Systematic root cause analysis using 5 Whys, fishbone diagrams, and fault tree analysis. Use this skill when investigating why an incident happened, performing RCA, or writing postmortems. Activate when: root cause, why did this happen, 5 whys, incident analysis, postmortem investigation, how did this happen, what caused, failure analysis.

2 stars

testing

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add latestaiagents/agent-skills root-cause-analysis

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 2:54 AM13.7s1 file scanned

SKILL.md

name:: root-cause-analysis
description:: |
Activate when:: root cause, why did this happen, 5 whys, incident analysis, postmortem investigation,

Root Cause Analysis (RCA)

Find the real cause, not just the symptoms, to prevent recurrence.

RCA Principles

Look for systems failures, not human errors
Ask "why" until you find actionable causes
Multiple contributing factors are common
Prevention > blame

Method 1: 5 Whys

Keep asking "why" until you reach an actionable root cause.

Example: API Outage

Problem: API returned 500 errors for 45 minutes

Why #1: Why did the API return 500 errors?
→ The database connection pool was exhausted

Why #2: Why was the connection pool exhausted?
→ Connections weren't being released after queries

Why #3: Why weren't connections being released?
→ A code change introduced a bug that skipped connection.close()

Why #4: Why wasn't this caught before production?
→ Our integration tests don't check for connection leaks

Why #5: Why don't integration tests check for connection leaks?
→ We haven't implemented connection pool monitoring in tests

ROOT CAUSE: Missing connection leak detection in test suite
ACTION: Add connection pool assertions to integration tests

5 Whys Guidelines

Method 2: Contributing Factors Analysis

Most incidents have multiple contributing factors.

┌─────────────────────────────────────────────────────────────┐
│                    INCIDENT: API OUTAGE                      │
├─────────────────────────────────────────────────────────────┤
│                                                              │
│  Direct Cause:                                               │
│  └─ Database connection pool exhaustion                     │
│                                                              │
│  Contributing Factors:                                       │
│  ├─ [Code] Connection leak bug in PR #1234                  │
│  ├─ [Process] Code review didn't catch the bug              │
│  ├─ [Testing] No connection leak tests                      │
│  ├─ [Monitoring] No alert for connection pool usage         │
│  ├─ [Deploy] Deployed during high-traffic period            │
│  └─ [Recovery] Runbook for this scenario was outdated       │
│                                                              │
│  Environmental Factors:                                      │
│  ├─ Team was understaffed (vacation season)                 │
│  └─ Similar incident 6 months ago, action items incomplete  │
│                                                              │
└─────────────────────────────────────────────────────────────┘

Method 3: Fault Tree Analysis

Work backwards from failure to identify all paths.

                    [API Outage]
                         │
            ┌────────────┴────────────┐
            │                         │
    [DB Connections              [App Server
     Exhausted]                   Crashed]
            │                         │
    ┌───────┴───────┐                │
    │               │                │
[Connection    [Too Many         [OOM
 Leak]         Requests]          Error]
    │               │                │
    │         ┌─────┴─────┐          │
    │         │           │          │
[Bug in    [Traffic    [Missing   [Memory
 Code]      Spike]     Rate       Leak]
              │        Limit]
              │
        [Marketing
         Campaign]

Method 4: Timeline Reconstruction

Detailed timeline helps identify the chain of events.

Timeline: API Outage - 2026-01-15

Time (UTC)  | Event                        | Source
------------|------------------------------|--------
09:00       | Deploy v2.3.4 started        | GitHub
09:15       | Deploy completed             | K8s
09:45       | Marketing email sent (50k)   | Marketing
10:02       | Traffic spike begins         | Datadog
10:15       | Connection pool at 80%       | Metrics
10:23       | First 500 errors             | Logs
10:25       | Alert fired                  | PagerDuty
10:27       | On-call acknowledged         | PagerDuty
10:35       | Root cause identified        | Slack
10:42       | Rollback initiated           | K8s
10:48       | Service recovering           | Datadog
11:00       | All clear declared           | Slack

Key Finding: 38 minutes between deploy and issue detection
             Deploy + traffic spike = perfect storm

Common Root Cause Categories

Technical

Code bugs
Configuration errors
Infrastructure failures
Dependency failures
Capacity limits

Process

Inadequate testing
Missed code review
Incomplete runbooks
Poor change management
Insufficient monitoring

Organizational

Understaffing
Knowledge silos
Communication gaps
Incomplete training
Technical debt

Action Item Quality

Good action items are SMART:

RCA Template

## Root Cause Analysis

### Direct Cause
[What directly caused the incident]

### 5 Whys Analysis
1. Why? → [Answer]
2. Why? → [Answer]
3. Why? → [Answer]
4. Why? → [Answer]
5. Why? → [Root cause]

### Contributing Factors
- **Technical:** [List]
- **Process:** [List]
- **Organizational:** [List]

### Why Wasn't This Caught?
- In development: [Why]
- In code review: [Why]
- In testing: [Why]
- In staging: [Why]
- By monitoring: [Why]

### Action Items
| Priority | Action | Owner | Due | Prevents |
|----------|--------|-------|-----|----------|
| P0 | [Action] | @name | [Date] | Direct cause |
| P1 | [Action] | @name | [Date] | Detection |
| P2 | [Action] | @name | [Date] | Future risk |

Anti-Patterns to Avoid

"Human error" - The human made an error, but the system allowed it
"Lack of attention" - Why did the system require such attention?
"Should have known" - How could they have known?
"Didn't follow procedure" - Why was the procedure not followed?
Single root cause - Usually there are multiple contributing factors

Related Skills

latestaiagents/skill-testing

development

VerifiedTrustedCommunity

Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-testing

latestaiagents/skill-frontmatter

documentation

VerifiedTrustedCommunity

Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-frontmatter

latestaiagents/skill-activation-patterns

development

VerifiedTrustedCommunity

Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-activation-patterns

latestaiagents/progressive-disclosure

development

VerifiedTrustedCommunity

Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/progressive-disclosure

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/latestaiagents/agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r agent-skills/plugins/devops-sre/skills/incident-response/root-cause-analysis ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

latestaiagents/agent-skills

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT