Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

jaykim88/performance-profiling

Name: performance-profiling
Author: jaykim88

plugins/backend-toolkit/skills/performance-profiling/SKILL.md

npx skillsauth add jaykim88/claude-ai-engineering performance-profiling

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Performance Profiling

Purpose

Locate the actual backend bottleneck with data (not intuition), fix it, and prove the fix with load tests gated on tail latency — so the system scales predictably.

Universal — measure-first profiling, p95/p99 over averages, connection pooling, and SLO-gated load testing are backend-perf principles; the profiler/load tool differs.

Procedure

Measure before optimizing — find the real bottleneck
- Use observability metrics (observability-setup) to see WHERE time goes: DB? external call? CPU? lock contention?
- The bottleneck is usually DB query time (→ query-optimization), not app CPU — confirm before profiling app code
Right-size connection pooling
- DB connections are scarce; an unbounded/oversized pool exhausts the DB, an undersized one starves the app
- Use a transaction-mode connection pooler for serverless/high-concurrency (specific poolers in Implementation)
- Pool size ≈ based on DB max connections ÷ instances, not "bigger is better"
Load test with SLO-aligned thresholds
- Define p(95) / p(99) latency + error-rate thresholds IN the test
- Tail latency (p99), not average, defines API health — averages hide the slow requests users feel
- Test realistic scenarios (ramp-up, sustained, spike), not just a flat hammer
CPU profiling when app code is the bottleneck
- Capture a flame graph; find the hot path (sync work blocking the event loop in Node, GIL contention in Python)
- Move CPU-heavy work off the request path (worker / background-jobs) or optimize the hot function

4b. Memory profiling — leaks and unbounded growth

"Slow over time" is often a leak, not a CPU issue: take two heap snapshots at different runtimes and diff (retainers tab in Chrome DevTools / node --heapsnapshot-signal)
Watch RSS / heapUsed trends in observability; a steady climb between deploys = leak
Common Node leaks: event-listener accumulation, closures over large objects in a module-level cache, unbounded Map/array

4c. Event-loop lag as a first-class signal (Node-specific, conceptually general)

The event loop blocks → every concurrent request stalls (not just the slow one). Monitor eventLoopUtilization() / monitorEventLoopDelay and alert at threshold
Equivalent: GC pause % (any runtime), garbage-collection time per minute

Address tail latency causes
- GC pauses, cold caches, connection acquisition waits, N+1 (→ query-optimization), lock contention (→ transaction-management)
Validate (validation loop)
- Re-run the load test after each fix; if p99 still breaches the SLO threshold → the bottleneck moved; re-profile and fix the new top contributor
- Gate the load test in CI (fail on threshold breach) so regressions are caught

Anti-patterns

| ❌ Anti-pattern | ✅ Correct | |---|---| | Optimizing by guessing | Profile/measure first; fix the proven bottleneck | | Tuning on average latency | Gate on p95/p99 tail latency | | Unbounded / "bigger" connection pool | Right-sized pool + transaction-mode pooler | | CPU-heavy work on the request thread (blocks event loop) | Offload to a worker / background job | | Load test = flat hammer | Realistic ramp/sustained/spike scenarios | | "Slow over time" treated as CPU when it's a leak | Heap snapshot diff between two timepoints; track RSS trend | | Event-loop lag unmonitored (silent stalls) | Track eventLoopUtilization / GC pause % with alert thresholds |

Severity tiers

| Tier | Examples | Action SLA | |---|---|---| | Critical | Connection pool exhaustion causing outages; p99 > SLO by multiples; event loop blocked by sync CPU work | Block release; fix immediately | | Major | No load test before a traffic event; pool size guessed; tail latency unmeasured | Fix this sprint | | Minor | Load test not in CI; minor GC tuning opportunity | Schedule within 2 sprints |

Completion Criteria

[ ] Bottleneck identified with metrics/profiler (not guessed)
[ ] Connection pool right-sized + pooler in place
[ ] Load test defines + passes p95/p99 + error-rate thresholds
[ ] CPU hot paths offloaded or optimized
[ ] Load test gated in CI

Output

Load test scripts: k6 with SLO thresholds (exit-code gating)
Profiling report: docs/perf-profile-YYYY-MM-DD.md — bottleneck, fix, before/after p95/p99
Commit format: perf(backend): right-size connection pool / perf(backend): offload <work> to worker

Implementation

TypeScript + Node + Postgres (default)

Pooling: PgBouncer or Supabase pooler (transaction mode); Prisma connection limit tuned to pool
Load test: Grafana k6 — thresholds: { http_req_duration: ['p(95)<300','p(99)<800'] }; CI fails on breach (exit 99)
Profiling: node --prof / 0x for flame graphs (note: clinic.js is no longer actively maintained — use node --prof / 0x / k6)
Watch: event-loop lag (don't block it with sync CPU)

Other stacks

Python / FastAPI: py-spy / cProfile flame graphs; gunicorn/uvicorn worker tuning; pgbouncer
Go: pprof (built-in, excellent); goroutine/heap profiles; database/sql SetMaxOpenConns
Universal: k6 (Go engine, JS scripts) is stack-agnostic for load testing; p95/p99 + pooling are universal; flame graphs exist for every runtime

Related skills

query-optimization — DB query time is usually the top bottleneck
caching-strategy — cache after optimizing, to shed read load
observability-setup — metrics tell you WHERE to profile

Reference

Key insight encoded: Define SLO-aligned p(95)/p(99) thresholds in the load test so a breach fails CI automatically — tail latency, not averages, defines API health.
Caveat: clinic.js is no longer actively maintained — cite for flame-graph concepts but recommend node --prof / 0x / k6 as the live toolchain.

jaykim88/performance-profiling

plugins/backend-toolkit/skills/performance-profiling/SKILL.md

Find and fix backend bottlenecks — connection pooling, p95/p99 latency, load testing with SLO-aligned thresholds (k6), and CPU profiling (flame graphs). Use when latency is high, throughput plateaus, or before scaling traffic. Not for DB query specifics (use query-optimization) or read-load shedding (use caching-strategy).

development

Updated Jun 9, 2026

$ install --global

skillsauth

npx skillsauth add jaykim88/claude-ai-engineering performance-profiling

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 9, 2026, 8:27 AM148.0s1 file scanned

SKILL.md

name:: performance-profiling
description:: Find and fix backend bottlenecks — connection pooling, p95/p99 latency, load testing with SLO-aligned thresholds (k6), and CPU profiling (flame graphs). Use when latency is high, throughput plateaus, or before scaling traffic. Not for DB query specifics (use query-optimization) or read-load shedding (use caching-strategy).
license:: MIT

Performance Profiling

Purpose

Locate the actual backend bottleneck with data (not intuition), fix it, and prove the fix with load tests gated on tail latency — so the system scales predictably.

Universal — measure-first profiling, p95/p99 over averages, connection pooling, and SLO-gated load testing are backend-perf principles; the profiler/load tool differs.

Procedure

Measure before optimizing — find the real bottleneck
- Use observability metrics (observability-setup) to see WHERE time goes: DB? external call? CPU? lock contention?
- The bottleneck is usually DB query time (→ query-optimization), not app CPU — confirm before profiling app code
Right-size connection pooling
- DB connections are scarce; an unbounded/oversized pool exhausts the DB, an undersized one starves the app
- Use a transaction-mode connection pooler for serverless/high-concurrency (specific poolers in Implementation)
- Pool size ≈ based on DB max connections ÷ instances, not "bigger is better"
Load test with SLO-aligned thresholds
- Define p(95) / p(99) latency + error-rate thresholds IN the test
- Tail latency (p99), not average, defines API health — averages hide the slow requests users feel
- Test realistic scenarios (ramp-up, sustained, spike), not just a flat hammer
CPU profiling when app code is the bottleneck
- Capture a flame graph; find the hot path (sync work blocking the event loop in Node, GIL contention in Python)
- Move CPU-heavy work off the request path (worker / background-jobs) or optimize the hot function

4b. Memory profiling — leaks and unbounded growth

"Slow over time" is often a leak, not a CPU issue: take two heap snapshots at different runtimes and diff (retainers tab in Chrome DevTools / node --heapsnapshot-signal)
Watch RSS / heapUsed trends in observability; a steady climb between deploys = leak
Common Node leaks: event-listener accumulation, closures over large objects in a module-level cache, unbounded Map/array

4c. Event-loop lag as a first-class signal (Node-specific, conceptually general)

The event loop blocks → every concurrent request stalls (not just the slow one). Monitor eventLoopUtilization() / monitorEventLoopDelay and alert at threshold
Equivalent: GC pause % (any runtime), garbage-collection time per minute

Address tail latency causes
- GC pauses, cold caches, connection acquisition waits, N+1 (→ query-optimization), lock contention (→ transaction-management)
Validate (validation loop)
- Re-run the load test after each fix; if p99 still breaches the SLO threshold → the bottleneck moved; re-profile and fix the new top contributor
- Gate the load test in CI (fail on threshold breach) so regressions are caught

Anti-patterns

Severity tiers

Completion Criteria

[ ] Bottleneck identified with metrics/profiler (not guessed)
[ ] Connection pool right-sized + pooler in place
[ ] Load test defines + passes p95/p99 + error-rate thresholds
[ ] CPU hot paths offloaded or optimized
[ ] Load test gated in CI

Output

Load test scripts: k6 with SLO thresholds (exit-code gating)
Profiling report: docs/perf-profile-YYYY-MM-DD.md — bottleneck, fix, before/after p95/p99
Commit format: perf(backend): right-size connection pool / perf(backend): offload <work> to worker

Implementation

TypeScript + Node + Postgres (default)

Pooling: PgBouncer or Supabase pooler (transaction mode); Prisma connection limit tuned to pool
Load test: Grafana k6 — thresholds: { http_req_duration: ['p(95)<300','p(99)<800'] }; CI fails on breach (exit 99)
Profiling: node --prof / 0x for flame graphs (note: clinic.js is no longer actively maintained — use node --prof / 0x / k6)
Watch: event-loop lag (don't block it with sync CPU)

Other stacks

Python / FastAPI: py-spy / cProfile flame graphs; gunicorn/uvicorn worker tuning; pgbouncer
Go: pprof (built-in, excellent); goroutine/heap profiles; database/sql SetMaxOpenConns
Universal: k6 (Go engine, JS scripts) is stack-agnostic for load testing; p95/p99 + pooling are universal; flame graphs exist for every runtime

Related skills

query-optimization — DB query time is usually the top bottleneck
caching-strategy — cache after optimizing, to shed read load
observability-setup — metrics tell you WHERE to profile

Reference

Key insight encoded: Define SLO-aligned p(95)/p(99) thresholds in the load test so a breach fails CI automatically — tail latency, not averages, defines API health.
Caveat: clinic.js is no longer actively maintained — cite for flame-graph concepts but recommend node --prof / 0x / k6 as the live toolchain.

Related Skills

jaykim88/webhook-design

development

VerifiedTrustedCommunity

Design webhooks correctly on both sides — sending (HMAC signing, retries with backoff, at-least-once) and receiving (verify signature on raw body, enqueue + 200 fast, dedupe on event id). Use when adding webhook delivery or consuming a provider's webhooks. Not for internal service-to-service events (use async-messaging) or general outbound-call retry policy (use resilience-patterns).

SKILL.mdUpdated Jun 9, 2026

jaykim88/webhook-design

jaykim88/transaction-management

testing

VerifiedTrustedCommunity

Use transactions and isolation levels correctly — keep them short, no network calls inside, explicit isolation, retry on serialization conflicts, and choose optimistic vs pessimistic locking. Use when a write spans multiple tables, when concurrent updates corrupt data, or when designing money/inventory flows. Not for cross-service event delivery (use async-messaging Outbox) or schema-level constraints (use schema-design).

SKILL.mdUpdated Jun 9, 2026

jaykim88/transaction-management

jaykim88/test-strategy

development

VerifiedTrustedCommunity

Backend testing pyramid — unit for pure logic, integration against a real DB (Testcontainers), and consumer-driven contract testing (Pact) for service boundaries. Use before a feature, after a bug fix, or when services break each other on deploy. Not for load testing (use performance-profiling) or security testing (use backend-security-audit).

SKILL.mdUpdated Jun 9, 2026

jaykim88/test-strategy

jaykim88/schema-design

data-ai

VerifiedTrustedCommunity

Design a relational schema — normalize to 3NF then denormalize with justification, choose the right Postgres index type per data shape, enforce constraints at the DB. Use when modeling a new domain, when queries are slow, or before a migration. Not for diagnosing slow queries (use query-optimization) or shipping the change without downtime (use migration-strategy).

SKILL.mdUpdated Jun 9, 2026

jaykim88/schema-design

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/jaykim88/claude-ai-engineering.git

# Copy into Claude Code skills folder (global)
cp -r claude-ai-engineering/plugins/backend-toolkit/skills/performance-profiling ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

jaykim88/claude-ai-engineering

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT