engineering/debugging/skills/root-cause-analysis/SKILL.md
This skill should be used when the user describes a bug, error, crash, or unexpected behavior and asks why it's happening. Trigger on "why is this happening", "what's causing this", "root cause", "5 whys", "RCA", "debug this error", "investigate this bug", "figure out why", "trace this issue", or when the user pastes an error message, stack trace, exception, or describes incorrect behavior. Also trigger for "how do I fix this", "my code isn't working", "something is wrong with", or "this is broken". This skill applies to any language, framework, or system.
npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library root-cause-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Systematic, hypothesis-driven debugging to find the true root cause of any software problem.
The bug is telling you the truth. Bugs are deterministic — the same inputs always produce the same outputs. When behavior seems random, it's because you don't yet understand all the inputs (timing, state, environment, data).
Never assume — verify. Every hypothesis must be tested against evidence. The fastest way to debug is to disprove hypotheses quickly, not to stare at code hoping for insight.
The most recent change is the most likely cause. If it was working before, something changed. Start there.
C — Characterize the problem precisely L — Locate the failure boundary E — Enumerate hypotheses A — Acquire evidence to test each hypothesis R — Resolve with minimal, targeted fix
Before touching anything, write down:
The difference between your answers to 1 and 2 IS the bug. Be precise. "It's broken" is not a characterization. "The API returns HTTP 500 with the message 'relation orders does not exist' on all checkout requests since 14:32 UTC today" is.
Find the exact line where correct becomes incorrect. Use binary search on the call stack:
Request → API Handler → Service Layer → Repository → Database Driver → Database
↑
Where does correct stop and wrong begin?
Techniques:
console.log / print to narrow the locationgit bisect start
git bisect bad HEAD
git bisect good v1.4.0 # Last known good version
# Git checks out commits, you test and run:
git bisect good # or bad
# Repeats until it finds the culprit commit
Read the entire thing. Most people read the first line and stop. The relevant information is usually in the middle.
ERROR: TypeError: Cannot read properties of undefined (reading 'id')
at OrderService.createOrder (/app/services/order.service.ts:47:28)
at async CheckoutController.checkout (/app/controllers/checkout.ts:83:5)
at async processTicksAndRejections (node:internal/process/task_queues:95:5)
Reading order:
TypeError: Cannot read properties of undefined (reading 'id')) — what went wrongorder.service.ts:47) — where it went wrongGo to line 47 of order.service.ts. Something that's expected to be an object is undefined there. Ask: what variable on that line could be undefined? Why would it be undefined?
| Error | What it means | Where to look |
|-------|-------------|--------------|
| NullPointerException / TypeError: undefined | Expected object is null/undefined | Check what's being returned from the call above |
| Connection refused | Server isn't running or wrong host/port | Check env config, service health |
| Timeout | Operation exceeded time limit | Network latency, slow query, blocking I/O |
| Out of memory | Heap exhausted | Memory leak, too-large data in memory |
| Permission denied | Insufficient privileges | File/directory permissions, IAM roles |
| Deadlock detected | Two transactions blocking each other | Transaction scope and lock ordering |
| Unique constraint violated | Duplicate primary/unique key insert | Race condition, missing idempotency |
| CORS error | Browser blocking cross-origin request | Backend missing Access-Control-Allow-Origin header |
Generate 3–5 hypotheses. Rank by probability (most likely first). Test fastest to disprove first.
Hypothesis #1: [What you think is causing it]
Evidence for: [Why you think this is likely]
Evidence against: [Why it might not be this]
Test: [The cheapest experiment that confirms or denies this]
Expected if true: [What you'll observe]
Expected if false: [What you'll observe instead]
Result: [Fill in after testing]
Test one hypothesis at a time. If you change multiple things simultaneously, you don't know which one fixed it — or which one broke something else.
When a hypothesis is disproven, update your mental model and generate new hypotheses based on the new information. Don't keep testing the same wrong idea.
# What changed in the last week?
git log --oneline --since="7 days ago" -- <relevant-path>
# Show what changed in a specific commit
git show <commit-hash> -- <file>
# Compare current to tag
git diff v1.4.0 HEAD -- src/
# What environment variables are set?
printenv | grep -i "database\|api\|secret\|port" | sort
# Are environment variables the same between working and broken environments?
# Compare: diff <(ssh staging "env | sort") <(ssh production "env | sort")
-- Is the table there?
\dt orders -- PostgreSQL
SHOW TABLES LIKE 'orders'; -- MySQL
-- Does the schema match expectations?
\d orders -- PostgreSQL
DESCRIBE orders; -- MySQL
-- Are there locks?
SELECT * FROM pg_locks WHERE NOT granted; -- PostgreSQL
SHOW PROCESSLIST; -- MySQL
-- Recent slow queries
SELECT query, calls, mean_exec_time, total_exec_time
FROM pg_stat_statements
ORDER BY total_exec_time DESC LIMIT 10;
# Is it reachable?
curl -v https://api.external-service.com/health
# Check response times
curl -w "Total: %{time_total}s\n" -o /dev/null https://api.external-service.com/endpoint
# Check their status page (don't waste time debugging your code if it's them)
Signs: intermittent, frequency increases with load, related to async code or concurrent requests.
Look for:
Before fixing, write a test that reproduces the bug. This:
// Test that reproduces the bug
test('should not throw when user has no orders', async () => {
const user = await createUser({ hasOrders: false });
// This was previously throwing TypeError: Cannot read properties of undefined
await expect(orderService.createOrder(user.id)).resolves.toBeDefined();
});
Apply the smallest possible change that addresses the root cause.
Root cause fix vs. workaround:
| Workaround | Root cause fix |
|-----------|---------------|
| if (!user) return null | Fix the code that should always return a user to actually return a user |
| Catch and swallow the exception | Fix the condition that causes the exception |
| Increase timeout | Fix the slow operation |
| Restart service daily | Fix the memory leak |
Workarounds hide problems and accumulate technical debt. Fix root causes.
Shotgun debugging: Making random changes hoping one fixes it. This wastes time and introduces new bugs. Always test a specific hypothesis.
Debugging without reproducing: If you can't reproduce the bug, you can't verify the fix. Spend time making it reproducible before writing any fix.
Fixing symptoms: Adding null checks around a crash when the real issue is that a function is returning null when it shouldn't. The null check hides the bug.
Tunnel vision: Getting fixated on one hypothesis and forcing evidence to fit it. If evidence contradicts your hypothesis, update your hypothesis — don't rationalize the evidence away.
Not reading the error message: Error messages almost always tell you what happened. Read them fully, including the complete stack trace.
For a structured catalog of recurring error patterns and their root causes, see:
references/error-pattern-catalog.md — web app error patterns (HTTP 500/503/timeout), distributed system failures (cascading failures, split-brain, clock skew), and log signature library for fast pattern identificationtesting
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.
tools
Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.
testing
This skill should be used when scientists need help with research problem selection, project ideation, troubleshooting stuck projects, or strategic scientific decisions. Use this skill when users ask to pitch a new research idea, work through a project problem, evaluate project risks, plan research strategy, navigate decision trees, or get help choosing what scientific problem to work on. Typical requests include "I have an idea for a project", "I'm stuck on my research", "help me evaluate this project", "what should I work on", or "I need strategic advice about my research".
development
Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.