engineering/debugging/skills/performance-profiling/SKILL.md
This skill should be used when the user asks about "performance", "slow", "latency", "bottleneck", "profiling", "flamegraph", "CPU usage", "memory usage", "memory leak", "garbage collection", "GC pressure", "N+1 query", "database slow", "slow query", "index", "EXPLAIN", "benchmark", "load test", "throughput", "RPS", "requests per second", "P99", "percentile latency", "caching", "Redis", "Memcached", "async performance", "event loop lag", "blocking", "thread pool", "connection pool saturation", "heap dump", "heap snapshot", or "optimize". Also trigger for "this is too slow", "users are complaining about speed", "page load is slow", "API is timing out", "why is CPU at 100%", or "memory keeps growing".
npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library performance-profilingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Systematically identify and eliminate performance bottlenecks using measurement-first principles.
1. Measure → establish baseline
2. Identify → which layer is slow?
3. Profile → what is causing it within that layer?
4. Optimize → targeted fix
5. Verify → measure again, compare to baseline
6. Don't guess → return to step 1 if uncertain
The rule: measure before and after every change. Intuition about performance is usually wrong.
Establish quantitative baselines before touching anything:
# wrk (full featured)
wrk -t4 -c50 -d30s \
-H "Authorization: Bearer $TOKEN" \
http://localhost:3000/api/endpoint
# hey (simpler, good output)
hey -n 10000 -c 100 -m GET http://localhost:3000/api/endpoint
# k6 (scriptable, integrates with CI/CD)
k6 run script.js
# Apache Bench
ab -n 10000 -c 100 http://localhost:3000/api/endpoint
Record all of: P50, P95, P99, RPS, error rate.
-- PostgreSQL: measure a specific query
EXPLAIN (ANALYZE, BUFFERS, FORMAT TEXT)
SELECT * FROM orders WHERE customer_id = 12345 ORDER BY created_at DESC LIMIT 20;
-- Look for: total execution time, number of rows, Seq Scan vs Index Scan, buffer cache hits
-- Target: execution time < 100ms for single queries, < 10ms for hot paths
-- Find slowest queries (requires pg_stat_statements extension)
SELECT
round(mean_exec_time::numeric, 2) AS avg_ms,
calls,
round(total_exec_time::numeric, 0) AS total_ms,
left(query, 100) AS query
FROM pg_stat_statements
ORDER BY mean_exec_time DESC
LIMIT 20;
# Unix time
time python script.py
# Detailed system resource usage
/usr/bin/time -v python script.py # Linux
/usr/bin/time -l python script.py # macOS
# Wall time vs CPU time — if wall >> CPU, you're waiting on I/O
Before profiling application code, determine which layer is responsible.
User browser / client
↓ DNS + TLS + network
Load balancer
↓
Application server (CPU / memory / event loop)
↓
Database (disk I/O / CPU / indexes / locks)
↓
External APIs (network + their performance)
↓
Cache layer (Redis / Memcached)
# Is it the network?
# Compare: curl timing from close server vs. far
curl -o /dev/null -s -w "DNS: %{time_namelookup}s | Connect: %{time_connect}s | TLS: %{time_appconnect}s | First byte: %{time_starttransfer}s | Total: %{time_total}s\n" https://api.example.com/endpoint
# Is it CPU?
# Check CPU during load test
top -b -n 3 | grep "Cpu\|%CPU"
# >80% CPU → application CPU bound, profile the code
# <30% CPU but slow → I/O bound (database, network, disk)
# Is it memory / GC?
# Node.js
node --max-old-space-size=4096 --expose-gc server.js
# Trigger GC manually, watch pause duration
# Is it database?
# Add timing around DB calls in application code
# Look for: individual queries > 100ms, many small queries (N+1)
# Is it an external API?
# Add per-call timing with a span or log statement
const start = Date.now();
const result = await externalApi.call(params);
logger.info({ duration_ms: Date.now() - start, api: 'payment-provider' }, 'API call completed');
CPU profiling:
# Built-in profiler (good for getting started)
node --prof --prof-process server.js
# Load test → Ctrl+C → converts .log to readable text
# 0x (flamegraph — best visual tool)
npx 0x --output-dir ./profile -- node server.js
# Load test → Ctrl+C → opens flamegraph in browser
# Clinic.js (comprehensive diagnostics)
npx clinic flame -- node server.js
npx clinic doctor -- node server.js # Identifies event loop lag, GC issues, I/O
Event loop lag (the Node.js killer):
// Detect event loop lag
const { monitorEventLoopDelay } = require('perf_hooks');
const histogram = monitorEventLoopDelay({ resolution: 10 });
histogram.enable();
setInterval(() => {
const p99Lag = histogram.percentile(99) / 1e6; // Convert ns to ms
if (p99Lag > 10) {
logger.warn({ event_loop_lag_p99_ms: p99Lag }, 'Event loop lag detected');
}
histogram.reset();
}, 5000);
CommonNode.js performance issues:
fs.readFileSync, fs.writeFileSync) in request handlersdns.lookup instead of dns.resolve)# cProfile — find which functions are slow
import cProfile, pstats, io
def profile_function(func, *args, **kwargs):
pr = cProfile.Profile()
pr.enable()
result = func(*args, **kwargs)
pr.disable()
s = io.StringIO()
stats = pstats.Stats(pr, stream=s)
stats.sort_stats('cumulative')
stats.print_stats(30)
print(s.getvalue())
return result
# line_profiler — find which LINES are slow
# pip install line_profiler
@profile # Decorator added by line_profiler
def slow_function(data):
results = []
for item in data:
results.append(process(item)) # Is this the slow line?
return results
# py-spy — attach to running process without restarts
# (Production-safe profiler)
py-spy top --pid <PID> # Like 'top' but for Python functions
py-spy record --output flame.svg --pid <PID> # Flamegraph of running process
Common Python performance issues:
__slots__ for classes with many instances+ in loops (use join())asyncio for I/O-bound code// Add pprof to your server
import (
"net/http"
_ "net/http/pprof" // This is all you need
)
// In main:
go http.ListenAndServe("localhost:6060", nil)
# Capture 30-second CPU profile
go tool pprof http://localhost:6060/debug/pprof/profile?seconds=30
# Check heap (memory usage)
go tool pprof http://localhost:6060/debug/pprof/heap
# Goroutine leaks (goroutines that never stop)
go tool pprof http://localhost:6060/debug/pprof/goroutine
# Interactive flamegraph
go tool pprof -http=:8080 cpu.prof
Common Go performance issues:
go build -gcflags="-m" to seesync/atomic for simple countersfmt.Sprintf in hot paths — use strings.Builder insteadEXPLAIN (ANALYZE, BUFFERS) SELECT * FROM orders WHERE customer_id = 5;
Key things to look for:
| Term | What it means | Action if bad |
|------|-------------|--------------|
| Seq Scan on large table | Reading entire table (no index) | Add index |
| Index Scan | Using index — good | Fine |
| Rows Removed by Filter: 50000 | Index exists but not selective enough | Better index or composite index |
| Buffers: shared hit=0 read=5000 | Most blocks read from disk | Increase shared_buffers, add index |
| Actual time ≫ estimated time | Stale statistics | ANALYZE table_name |
-- Equality first, range second, highest cardinality first
-- Query: WHERE user_id = ? AND status = 'active' AND created_at > ?
CREATE INDEX idx_orders_user_status_date
ON orders(user_id, status, created_at)
WHERE deleted_at IS NULL; -- Partial index — smaller, faster for common filter
-- Use CONCURRENTLY to avoid table lock in production
CREATE INDEX CONCURRENTLY idx_name ON table(column);
Too few connections: requests queue up, latency spikes under load
Too many connections: database thrashes, memory exhaustion
PostgreSQL max_connections × 0.8 = usable connections
Distribute across all app instances
Example: 100 max_connections × 0.8 = 80 usable
With 4 app instances: pool_size = 20 per instance
// node-postgres connection pool
const pool = new Pool({
max: 20, // Max connections per pool
idleTimeoutMillis: 30000, // Close idle connections after 30s
connectionTimeoutMillis: 2000, // Fail fast if pool is full
});
Cache when:
Don't cache when:
Browser cache → CDN → Application cache (in-process) → Redis → Database
Use the closest cache that can serve the request with acceptable freshness.
const CACHE_TTL = 300; // 5 minutes
async function getCachedData(key: string): Promise<Data> {
// Check cache
const cached = await redis.get(key);
if (cached !== null) {
return JSON.parse(cached);
}
// Cache miss — fetch from source
const data = await fetchFromDatabase(key);
// Store with TTL
await redis.setex(key, CACHE_TTL, JSON.stringify(data));
return data;
}
// Cache invalidation on write
async function updateData(id: string, updates: Partial<Data>): Promise<void> {
await db.update(id, updates);
await redis.del(`data:${id}`); // Invalidate specific key
await redis.del(`data:list:*`); // Invalidate list keys (use SCAN for production)
}
| Issue | Symptom | Fix | Expected gain |
|-------|---------|-----|--------------|
| N+1 queries | Many small DB queries per request | Eager loading, batch queries | 10–100× |
| Missing index | Seq scan on large table | Add targeted index | 10–1000× |
| Synchronous I/O in async code | Event loop lag | Move to worker thread | 5–50× |
| No caching of expensive data | Same query repeatedly | Add Redis cache | 10–100× |
| Serialized I/O | Sequential awaits that could be parallel | Promise.all, goroutines | Proportional to I/O count |
| CPU-bound in interpreted language | High CPU, slow loops | Compiled extension, SIMD, batch | 10–100× |
| GC pressure | Periodic latency spikes | Reduce allocations, tune GC | 2–10× |
| Connection pool exhaustion | Requests queuing | Increase pool size (within DB limits) | Eliminates queuing |
| Large payload | Slow network transfer | Pagination, compression, CDN | Proportional to payload size |
| DNS lookups per request | Consistent added latency | Connection pooling / keep-alive | 10–200ms |
For ready-to-run profiling commands across languages and runtimes, see:
references/profiling-commands.md — copy-paste profiling commands for Python (cProfile, py-spy), Go (pprof), Node.js (clinic.js, 0x), PostgreSQL slow query analysis, and Linux system profiling (perf, flamegraph)testing
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.
tools
Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.
testing
This skill should be used when scientists need help with research problem selection, project ideation, troubleshooting stuck projects, or strategic scientific decisions. Use this skill when users ask to pitch a new research idea, work through a project problem, evaluate project risks, plan research strategy, navigate decision trees, or get help choosing what scientific problem to work on. Typical requests include "I have an idea for a project", "I'm stuck on my research", "help me evaluate this project", "what should I work on", or "I need strategic advice about my research".
development
Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.