Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

harsh040506/root-cause-analysis

Name: root-cause-analysis
Author: harsh040506

engineering/debugging/skills/root-cause-analysis/SKILL.md

npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library root-cause-analysis

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Root Cause Analysis

Systematic, hypothesis-driven debugging to find the true root cause of any software problem.

The Core Debugging Mindset

The bug is telling you the truth. Bugs are deterministic — the same inputs always produce the same outputs. When behavior seems random, it's because you don't yet understand all the inputs (timing, state, environment, data).

Never assume — verify. Every hypothesis must be tested against evidence. The fastest way to debug is to disprove hypotheses quickly, not to stare at code hoping for insight.

The most recent change is the most likely cause. If it was working before, something changed. Start there.

Debugging Framework: CLEAR

C — Characterize the problem precisely L — Locate the failure boundary E — Enumerate hypotheses A — Acquire evidence to test each hypothesis R — Resolve with minimal, targeted fix

Step 1: Characterize

Before touching anything, write down:

Observed behavior: What is actually happening?
Expected behavior: What should happen?
Reproducibility: Always? Intermittent? Only in certain conditions?
Scope: Who/what is affected? One user/endpoint/region, or all?
Timeline: When did it start? Was it ever working?
Changed recently: Deploy, config, dependency, data, traffic pattern?

The difference between your answers to 1 and 2 IS the bug. Be precise. "It's broken" is not a characterization. "The API returns HTTP 500 with the message 'relation orders does not exist' on all checkout requests since 14:32 UTC today" is.

Step 2: Locate the Failure Boundary

Find the exact line where correct becomes incorrect. Use binary search on the call stack:

Request → API Handler → Service Layer → Repository → Database Driver → Database
         ↑
    Where does correct stop and wrong begin?

Techniques:

Stack trace — read the innermost frame first (that's where the error was thrown)
Add logging — temporary console.log / print to narrow the location
REPL/notebook — test the suspect function in isolation
Remove code — comment out sections to find which one causes the issue

Git bisect — if you have a commit range, binary search finds the bad commit:

git bisect start
git bisect bad HEAD
git bisect good v1.4.0  # Last known good version
# Git checks out commits, you test and run:
git bisect good  # or bad
# Repeats until it finds the culprit commit

Step 3: Read the Error Message and Stack Trace

Read the entire thing. Most people read the first line and stop. The relevant information is usually in the middle.

How to Read a Stack Trace

ERROR: TypeError: Cannot read properties of undefined (reading 'id')
    at OrderService.createOrder (/app/services/order.service.ts:47:28)
    at async CheckoutController.checkout (/app/controllers/checkout.ts:83:5)
    at async processTicksAndRejections (node:internal/process/task_queues:95:5)

Reading order:

Error type and message (TypeError: Cannot read properties of undefined (reading 'id')) — what went wrong
Innermost frame (order.service.ts:47) — where it went wrong
Call chain — how you got there (work upward from innermost)

Go to line 47 of order.service.ts. Something that's expected to be an object is undefined there. Ask: what variable on that line could be undefined? Why would it be undefined?

Common Error Patterns

| Error | What it means | Where to look | |-------|-------------|--------------| | NullPointerException / TypeError: undefined | Expected object is null/undefined | Check what's being returned from the call above | | Connection refused | Server isn't running or wrong host/port | Check env config, service health | | Timeout | Operation exceeded time limit | Network latency, slow query, blocking I/O | | Out of memory | Heap exhausted | Memory leak, too-large data in memory | | Permission denied | Insufficient privileges | File/directory permissions, IAM roles | | Deadlock detected | Two transactions blocking each other | Transaction scope and lock ordering | | Unique constraint violated | Duplicate primary/unique key insert | Race condition, missing idempotency | | CORS error | Browser blocking cross-origin request | Backend missing Access-Control-Allow-Origin header |

Step 4: Enumerate and Test Hypotheses

Generate 3–5 hypotheses. Rank by probability (most likely first). Test fastest to disprove first.

Hypothesis Template

Hypothesis #1: [What you think is causing it]
Evidence for: [Why you think this is likely]
Evidence against: [Why it might not be this]
Test: [The cheapest experiment that confirms or denies this]
Expected if true: [What you'll observe]
Expected if false: [What you'll observe instead]
Result: [Fill in after testing]

The Scientific Method Applied

Test one hypothesis at a time. If you change multiple things simultaneously, you don't know which one fixed it — or which one broke something else.

When a hypothesis is disproven, update your mental model and generate new hypotheses based on the new information. Don't keep testing the same wrong idea.

Step 5: Hypothesis-Specific Investigation

Suspect: Recent Code Change

# What changed in the last week?
git log --oneline --since="7 days ago" -- <relevant-path>

# Show what changed in a specific commit
git show <commit-hash> -- <file>

# Compare current to tag
git diff v1.4.0 HEAD -- src/

Suspect: Configuration / Environment

# What environment variables are set?
printenv | grep -i "database\|api\|secret\|port" | sort

# Are environment variables the same between working and broken environments?
# Compare: diff <(ssh staging "env | sort") <(ssh production "env | sort")

Suspect: Database

-- Is the table there?
\dt orders  -- PostgreSQL
SHOW TABLES LIKE 'orders';  -- MySQL

-- Does the schema match expectations?
\d orders  -- PostgreSQL
DESCRIBE orders;  -- MySQL

-- Are there locks?
SELECT * FROM pg_locks WHERE NOT granted;  -- PostgreSQL
SHOW PROCESSLIST;  -- MySQL

-- Recent slow queries
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 10;

Suspect: External API / Service

# Is it reachable?
curl -v https://api.external-service.com/health

# Check response times
curl -w "Total: %{time_total}s\n" -o /dev/null https://api.external-service.com/endpoint

# Check their status page (don't waste time debugging your code if it's them)

Suspect: Race Condition

Signs: intermittent, frequency increases with load, related to async code or concurrent requests.

Look for:

Shared mutable state accessed without synchronization
Time-of-check-to-time-of-use (TOCTOU) patterns
Missing database transactions where atomicity is needed
Shared counters or accumulators without atomic operations

Step 6: Write a Regression Test First

Before fixing, write a test that reproduces the bug. This:

Confirms you understand the bug
Verifies your fix actually works
Prevents the bug from returning

// Test that reproduces the bug
test('should not throw when user has no orders', async () => {
  const user = await createUser({ hasOrders: false });
  // This was previously throwing TypeError: Cannot read properties of undefined
  await expect(orderService.createOrder(user.id)).resolves.toBeDefined();
});

Step 7: Minimal Fix

Apply the smallest possible change that addresses the root cause.

Root cause fix vs. workaround:

| Workaround | Root cause fix | |-----------|---------------| | if (!user) return null | Fix the code that should always return a user to actually return a user | | Catch and swallow the exception | Fix the condition that causes the exception | | Increase timeout | Fix the slow operation | | Restart service daily | Fix the memory leak |

Workarounds hide problems and accumulate technical debt. Fix root causes.

Debugging Anti-Patterns

Shotgun debugging: Making random changes hoping one fixes it. This wastes time and introduces new bugs. Always test a specific hypothesis.

Debugging without reproducing: If you can't reproduce the bug, you can't verify the fix. Spend time making it reproducible before writing any fix.

Fixing symptoms: Adding null checks around a crash when the real issue is that a function is returning null when it shouldn't. The null check hides the bug.

Tunnel vision: Getting fixated on one hypothesis and forcing evidence to fit it. If evidence contradicts your hypothesis, update your hypothesis — don't rationalize the evidence away.

Not reading the error message: Error messages almost always tell you what happened. Read them fully, including the complete stack trace.

Deeper Reference

For a structured catalog of recurring error patterns and their root causes, see:

references/error-pattern-catalog.md — web app error patterns (HTTP 500/503/timeout), distributed system failures (cascading failures, split-brain, clock skew), and log signature library for fast pattern identification

harsh040506/root-cause-analysis

engineering/debugging/skills/root-cause-analysis/SKILL.md

This skill should be used when the user describes a bug, error, crash, or unexpected behavior and asks why it's happening. Trigger on "why is this happening", "what's causing this", "root cause", "5 whys", "RCA", "debug this error", "investigate this bug", "figure out why", "trace this issue", or when the user pastes an error message, stack trace, exception, or describes incorrect behavior. Also trigger for "how do I fix this", "my code isn't working", "something is wrong with", or "this is broken". This skill applies to any language, framework, or system.

2 stars

development

Updated Apr 5, 2026

$ install --global

skillsauth

npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library root-cause-analysis

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 5, 2026, 5:10 PM3.9s2 files scanned

SKILL.md

name:: root-cause-analysis
description:: This skill should be used when the user describes a bug, error, crash, or unexpected behavior and asks why it's happening. Trigger on "why is this happening", "what's causing this", "root cause", "5 whys", "RCA", "debug this error", "investigate this bug", "figure out why", "trace this issue", or when the user pastes an error message, stack trace, exception, or describes incorrect behavior. Also trigger for "how do I fix this", "my code isn't working", "something is wrong with", or "this is broken". This skill applies to any language, framework, or system.

Root Cause Analysis

Systematic, hypothesis-driven debugging to find the true root cause of any software problem.

The Core Debugging Mindset

Never assume — verify. Every hypothesis must be tested against evidence. The fastest way to debug is to disprove hypotheses quickly, not to stare at code hoping for insight.

The most recent change is the most likely cause. If it was working before, something changed. Start there.

Debugging Framework: CLEAR

Step 1: Characterize

Before touching anything, write down:

Observed behavior: What is actually happening?
Expected behavior: What should happen?
Reproducibility: Always? Intermittent? Only in certain conditions?
Scope: Who/what is affected? One user/endpoint/region, or all?
Timeline: When did it start? Was it ever working?
Changed recently: Deploy, config, dependency, data, traffic pattern?

Step 2: Locate the Failure Boundary

Find the exact line where correct becomes incorrect. Use binary search on the call stack:

Request → API Handler → Service Layer → Repository → Database Driver → Database
         ↑
    Where does correct stop and wrong begin?

Techniques:

Stack trace — read the innermost frame first (that's where the error was thrown)
Add logging — temporary console.log / print to narrow the location
REPL/notebook — test the suspect function in isolation
Remove code — comment out sections to find which one causes the issue

Git bisect — if you have a commit range, binary search finds the bad commit:

git bisect start
git bisect bad HEAD
git bisect good v1.4.0  # Last known good version
# Git checks out commits, you test and run:
git bisect good  # or bad
# Repeats until it finds the culprit commit

Step 3: Read the Error Message and Stack Trace

Read the entire thing. Most people read the first line and stop. The relevant information is usually in the middle.

How to Read a Stack Trace

ERROR: TypeError: Cannot read properties of undefined (reading 'id')
    at OrderService.createOrder (/app/services/order.service.ts:47:28)
    at async CheckoutController.checkout (/app/controllers/checkout.ts:83:5)
    at async processTicksAndRejections (node:internal/process/task_queues:95:5)

Reading order:

Error type and message (TypeError: Cannot read properties of undefined (reading 'id')) — what went wrong
Innermost frame (order.service.ts:47) — where it went wrong
Call chain — how you got there (work upward from innermost)

Go to line 47 of order.service.ts. Something that's expected to be an object is undefined there. Ask: what variable on that line could be undefined? Why would it be undefined?

Common Error Patterns

Step 4: Enumerate and Test Hypotheses

Generate 3–5 hypotheses. Rank by probability (most likely first). Test fastest to disprove first.

Hypothesis Template

Hypothesis #1: [What you think is causing it]
Evidence for: [Why you think this is likely]
Evidence against: [Why it might not be this]
Test: [The cheapest experiment that confirms or denies this]
Expected if true: [What you'll observe]
Expected if false: [What you'll observe instead]
Result: [Fill in after testing]

The Scientific Method Applied

Test one hypothesis at a time. If you change multiple things simultaneously, you don't know which one fixed it — or which one broke something else.

When a hypothesis is disproven, update your mental model and generate new hypotheses based on the new information. Don't keep testing the same wrong idea.

Step 5: Hypothesis-Specific Investigation

Suspect: Recent Code Change

# What changed in the last week?
git log --oneline --since="7 days ago" -- <relevant-path>

# Show what changed in a specific commit
git show <commit-hash> -- <file>

# Compare current to tag
git diff v1.4.0 HEAD -- src/

Suspect: Configuration / Environment

# What environment variables are set?
printenv | grep -i "database\|api\|secret\|port" | sort

# Are environment variables the same between working and broken environments?
# Compare: diff <(ssh staging "env | sort") <(ssh production "env | sort")

Suspect: Database

-- Is the table there?
\dt orders  -- PostgreSQL
SHOW TABLES LIKE 'orders';  -- MySQL

-- Does the schema match expectations?
\d orders  -- PostgreSQL
DESCRIBE orders;  -- MySQL

-- Are there locks?
SELECT * FROM pg_locks WHERE NOT granted;  -- PostgreSQL
SHOW PROCESSLIST;  -- MySQL

-- Recent slow queries
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 10;

Suspect: External API / Service

# Is it reachable?
curl -v https://api.external-service.com/health

# Check response times
curl -w "Total: %{time_total}s\n" -o /dev/null https://api.external-service.com/endpoint

# Check their status page (don't waste time debugging your code if it's them)

Suspect: Race Condition

Signs: intermittent, frequency increases with load, related to async code or concurrent requests.

Look for:

Shared mutable state accessed without synchronization
Time-of-check-to-time-of-use (TOCTOU) patterns
Missing database transactions where atomicity is needed
Shared counters or accumulators without atomic operations

Step 6: Write a Regression Test First

Before fixing, write a test that reproduces the bug. This:

Confirms you understand the bug
Verifies your fix actually works
Prevents the bug from returning

// Test that reproduces the bug
test('should not throw when user has no orders', async () => {
  const user = await createUser({ hasOrders: false });
  // This was previously throwing TypeError: Cannot read properties of undefined
  await expect(orderService.createOrder(user.id)).resolves.toBeDefined();
});

Step 7: Minimal Fix

Apply the smallest possible change that addresses the root cause.

Root cause fix vs. workaround:

Workarounds hide problems and accumulate technical debt. Fix root causes.

Debugging Anti-Patterns

Shotgun debugging: Making random changes hoping one fixes it. This wastes time and introduces new bugs. Always test a specific hypothesis.

Debugging without reproducing: If you can't reproduce the bug, you can't verify the fix. Spend time making it reproducible before writing any fix.

Fixing symptoms: Adding null checks around a crash when the real issue is that a function is returning null when it shouldn't. The null check hides the bug.

Tunnel vision: Getting fixated on one hypothesis and forcing evidence to fit it. If evidence contradicts your hypothesis, update your hypothesis — don't rationalize the evidence away.

Not reading the error message: Error messages almost always tell you what happened. Read them fully, including the complete stack trace.

Deeper Reference

For a structured catalog of recurring error patterns and their root causes, see:

references/error-pattern-catalog.md — web app error patterns (HTTP 500/503/timeout), distributed system failures (cascading failures, split-brain, clock skew), and log signature library for fast pattern identification

Related Skills

harsh040506/single-cell-rna-qc

testing

VerifiedTrustedCommunity

Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/single-cell-rna-qc

harsh040506/scvi-tools

tools

VerifiedTrustedCommunity

Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/scvi-tools

harsh040506/scientific-problem-selection

testing

VerifiedTrustedCommunity

This skill should be used when scientists need help with research problem selection, project ideation, troubleshooting stuck projects, or strategic scientific decisions. Use this skill when users ask to pitch a new research idea, work through a project problem, evaluate project risks, plan research strategy, navigate decision trees, or get help choosing what scientific problem to work on. Typical requests include "I have an idea for a project", "I'm stuck on my research", "help me evaluate this project", "what should I work on", or "I need strategic advice about my research".

2SKILL.mdUpdated Apr 5, 2026

harsh040506/scientific-problem-selection

harsh040506/nextflow-development

development

VerifiedTrustedCommunity

Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/nextflow-development

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/harsh040506/claude-code-unified-skill-plugin-library.git

# Copy into Claude Code skills folder (global)
cp -r claude-code-unified-skill-plugin-library/engineering/debugging/skills/root-cause-analysis ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

harsh040506/claude-code-unified-skill-plugin-library

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT