plugins/backend-toolkit/skills/resilience-patterns/SKILL.md
Apply reliability primitives — capped exponential backoff with jitter, circuit breakers, timeouts, and idempotency keys — to every outbound call and mutating endpoint. Use when integrating an external service, when retries cause duplicate effects, or before shipping a payment/order flow. Not for job-runner retry config specifically (use background-jobs) or webhook-delivery specifics (use webhook-design, which reuses these primitives).
npx skillsauth add jaykim88/claude-ai-engineering resilience-patternsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Make every outbound call and every mutating endpoint survive partial failure without causing duplicate effects or cascading outages. Retries, timeouts, breakers, and idempotency keys are the primitives — apply them deliberately, not reflexively.
Universal — capped exponential backoff + jitter, circuit breakers, timeout-from-p99, and idempotency keys are distributed-systems primitives independent of language; only the library differs.
Set timeouts from p99.9 latency — never infinite
Retry ONLY idempotent operations
Use capped exponential backoff + full jitter
delay = random(0, min(cap, base * 2^attempt))Idempotency keys for mutations
Idempotency-Key: <uuid> header on POSTCircuit breaker for repeatedly-failing dependencies
5b. Bulkhead: isolate critical paths from one bad dependency
5c. Define a fallback chain for high-availability paths
5d. Propagate deadlines across hops
| ❌ Anti-pattern | ✅ Correct | |---|---| | Retry a POST with no idempotency key | Idempotency-Key header + stored result | | Exponential backoff without jitter | Capped exponential + full jitter | | No timeout on outbound call | Timeout from dependency p99.9 | | Infinite retries | Cap attempts → fallback / DLQ | | Catch-all retry on every error | Retry only transient errors (timeout/503), not 4xx | | One bad dependency drains the shared connection pool | Bulkhead per dependency (separate pool / concurrency limit) | | Improvising the fallback during the incident | Predeclared fallback chain (primary → secondary → degraded) | | 2s downstream timeout under a 800ms caller deadline | Propagate the deadline; downstream timeout = min(local, remaining) |
| Tier | Examples | Action SLA | |---|---|---| | Critical | Payment/order POST retried without idempotency key (duplicate charges); no timeout on a synchronous external call | Block release; fix immediately | | Major | Backoff without jitter; missing circuit breaker on a known-flaky dependency | Fix this sprint | | Minor | Retry cap slightly high; breaker cooldown untuned | Schedule within 2 sprints |
feat(resilience): add idempotency key to <endpoint> / fix(resilience): cap backoff + jitter on <client>idempotency_key → response in Postgres, checked in the same Prisma $transaction as the writefetch/axios calls in a Cockatiel policytenacity (retry/backoff) + pybreaker (circuit breaker)cenkalti/backoff + sony/gobreakerIdempotency-Key is the reference); jitter formula is math, not library-specificwebhook-design — webhook delivery uses the same retry + idempotency primitivesbackground-jobs — job retries must be idempotenttransaction-management — idempotency keys often stored in the same transaction as the effectIdempotency-Key for POST). Set timeouts from p99.9, not guesses.development
Design webhooks correctly on both sides — sending (HMAC signing, retries with backoff, at-least-once) and receiving (verify signature on raw body, enqueue + 200 fast, dedupe on event id). Use when adding webhook delivery or consuming a provider's webhooks. Not for internal service-to-service events (use async-messaging) or general outbound-call retry policy (use resilience-patterns).
testing
Use transactions and isolation levels correctly — keep them short, no network calls inside, explicit isolation, retry on serialization conflicts, and choose optimistic vs pessimistic locking. Use when a write spans multiple tables, when concurrent updates corrupt data, or when designing money/inventory flows. Not for cross-service event delivery (use async-messaging Outbox) or schema-level constraints (use schema-design).
development
Backend testing pyramid — unit for pure logic, integration against a real DB (Testcontainers), and consumer-driven contract testing (Pact) for service boundaries. Use before a feature, after a bug fix, or when services break each other on deploy. Not for load testing (use performance-profiling) or security testing (use backend-security-audit).
data-ai
Design a relational schema — normalize to 3NF then denormalize with justification, choose the right Postgres index type per data shape, enforce constraints at the DB. Use when modeling a new domain, when queries are slow, or before a migration. Not for diagnosing slow queries (use query-optimization) or shipping the change without downtime (use migration-strategy).