skills/circuit-breakers-and-retries/SKILL.md
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
npx skillsauth add curiositech/windags-skills circuit-breakers-and-retriesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
TL;DR: Retries amplify exponentially across layers (3 retries × 4 layers = 64× origin load). Always pair retries with a circuit breaker, full-jitter exponential backoff (
random(0, min(cap, base * 2^attempt))), a per-request attempt cap (3), and a per-client retry-ratio budget (10% per Google SRE). Set absolute deadlines at the edge and propagate them — "you don't get credit for late assignments with RPCs." Hystrix is dead since 2018; use Resilience4j (JVM) or Polly (.NET).
| Symptom | Section | |---|---| | "Cascading failures took down everything" | Retry amplification math | | "What state machine should the breaker have?" | Circuit breaker states | | "How long to wait between retries?" | Backoff with jitter | | "When NOT to retry?" | Retry decision rules | | "Deadline propagation" | Deadlines | | "Slow calls dragging down the system" | Slow-call rate |
flowchart TD
A[Outbound call to dependency] --> B{Past circuit breaker — OPEN?}
B -->|OPEN| C[Fail fast<br/>CallNotPermittedException]
B -->|HALF_OPEN| D{Within probe-call quota?}
D -->|No| C
D -->|Yes| E[Allow call]
B -->|CLOSED| E
E --> F{Call succeeds?}
F -->|Yes| G[Record success<br/>respect deadline]
F -->|No, retriable error| H{Retry budget remaining?<br/>per-request <3 AND<br/>per-client retry-ratio <10%}
H -->|No| I[Return failure to caller<br/>'overloaded; don't retry']
H -->|Yes| J[Wait full-jitter delay<br/>random 0..min cap, base*2^n]
J --> K{Time left in deadline?}
K -->|No| I
K -->|Yes| E
F -->|No, non-retriable<br/>4xx malformed, auth| I
G --> L[Update breaker stats<br/>failure rate / slow call rate]
C --> L
From Google SRE Book Ch. 22 (Addressing Cascading Failures):
Avoid amplifying retries by issuing retries at multiple levels: a single request at the highest layer may produce a number of attempts as large as the product of the number of attempts at each layer to the lowest layer. If the database can't service requests because it's overloaded, and the backend, frontend, and JavaScript layers all issue 3 retries (4 attempts), then a single user action may create 64 attempts (4³) on the database.
The rule that follows: retry only at the layer immediately above the failing dependency. Lower layers should return "overloaded; don't retry" — typically HTTP 503 with Retry-After, or a domain-specific non-retriable error.
First, we implement a per-request retry budget of up to three attempts. If a request has already failed three times, we let the failure bubble up to the caller.
Secondly, we implement a per-client retry budget. Each client keeps track of the ratio of requests that correspond to retries. A request will only be retried as long as this ratio is below 10%.
Although we've effectively capped the growth caused by retries, a threefold increase in requests is significant... However, layering on the per-client retry budget (a 10% retry ratio) reduces the growth to just 1.1x in the general case—a significant improvement.
These two together — min(3 attempts, 10% retry ratio) — turn 3× worst-case into 1.1× typical-case.
From Resilience4j docs:
The CircuitBreaker is implemented via a finite state machine with three normal states: CLOSED, OPEN and HALF_OPEN and three special states METRICS_ONLY, DISABLED and FORCED_OPEN.
The state of the CircuitBreaker changes from CLOSED to OPEN when the failure rate is equal or greater than a configurable threshold... The CircuitBreaker also changes from CLOSED to OPEN when the percentage of slow calls is equal or greater than a configurable threshold. This helps to reduce the load on an external system before it is actually unresponsive.
The CircuitBreaker rejects calls with a
CallNotPermittedExceptionwhen it is OPEN. After a wait time duration has elapsed, the CircuitBreaker state changes from OPEN to HALF_OPEN and permits a configurable number of calls to see if the backend is still unavailable or has become available again... If the failure rate or slow call rate is then equal or greater than the configured threshold, the state changes back to OPEN. If [...] is below the threshold, the state changes back to CLOSED.
| Property | Default | Meaning |
|---|---|---|
| failureRateThreshold | 50 (%) | CLOSED→OPEN trigger |
| slowCallRateThreshold | 100 (%) | Disabled by default; enable for latency-driven trip |
| slowCallDurationThreshold | 60_000 ms | Above this = "slow" |
| permittedNumberOfCallsInHalfOpenState | 10 | Probe count |
| slidingWindowType | COUNT_BASED | Or TIME_BASED (per-second buckets) |
| slidingWindowSize | 100 | calls (count) or seconds (time) |
| minimumNumberOfCalls | 100 | Floor before rate is computed |
| waitDurationInOpenState | 60_000 ms | OPEN→HALF_OPEN delay |
The slow-call rate is Resilience4j's improvement past Hystrix: trip on latency, not just errors. By the time errors start, the dependency is already over the cliff. Slow-call detection gives you a leading indicator.
The Half-Open state helps prevent a recovering service from suddenly being flooded with requests. As a service recovers, it might be able to support a limited volume of requests until the recovery is complete.
The Circuit Breaker pattern serves a different purpose than the Retry pattern. The Retry pattern enables an application to retry an operation with the expectation that it eventually succeeds. The Circuit Breaker pattern prevents an application from performing an operation that's likely to fail.
The right combo: retry inside a circuit breaker. Retry handles transient failures; the breaker handles persistent failures. The retry logic must check for CallNotPermittedException and not retry through it.
From Marc Brooker's AWS blog post:
The three formulas:
# Full Jitter (recommended for most cases)
sleep = random_between(0, min(cap, base * 2 ** attempt))
# Equal Jitter
temp = min(cap, base * 2 ** attempt)
sleep = temp/2 + random_between(0, temp/2)
# Decorrelated Jitter
sleep = min(cap, random_between(base, sleep_prev * 3))
Brooker's verdict (verbatim):
The no-jitter exponential backoff approach is the clear loser. It not only takes more work, but also takes more time than the jittered approaches.
Of the jittered approaches, 'Equal Jitter' is the loser. It does slightly more work than 'Full Jitter', and takes much longer. The decision between 'Decorrelated Jitter' and 'Full Jitter' is less clear.
The return on implementation complexity of using jittered backoff is huge, and it should be considered a standard approach for remote clients.
Default recipe: Full Jitter, base = 50ms to 200ms, cap = 30s to 60s.
function delayMs(attempt, base = 100, cap = 30_000) {
return Math.floor(Math.random() * Math.min(cap, base * 2 ** attempt))
}
Why no-jitter is the killer: every retry from every client lands at the same instant after the upstream blip. Exponential backoff without jitter doesn't spread retries — it synchronizes them.
The single most-broken retry implementation is "retry on any error." From Google SRE Book Ch. 22:
Use clear response codes... separate retriable and nonretriable error conditions. Don't retry permanent errors or malformed requests in a client, because neither will ever succeed. Return a specific status when overloaded so that clients and other layers back off and do not retry.
| HTTP status | Retry? | Why |
|---|---|---|
| 200/2xx | n/a | Success |
| 400 Bad Request | No | Malformed; never succeeds |
| 401 Unauthorized | No | Need new credentials, not a retry |
| 403 Forbidden | No | Auth issue, not transient |
| 404 Not Found | No | Doesn't exist |
| 409 Conflict (idempotency in progress) | Sometimes | Wait then check final state, not blind retry |
| 422 Unprocessable | No | Malformed input |
| 429 Too Many Requests | Yes | Honor Retry-After header |
| 500 Internal Server Error | Cautiously | May be permanent (bug); limit attempts |
| 502 Bad Gateway | Yes | Upstream blip |
| 503 Service Unavailable | Yes | Honor Retry-After; expect overload |
| 504 Gateway Timeout | Cautiously | Already cost time; deadline may be up |
Network-layer rules:
From Google SRE Book Ch. 22:
A common theme in many cascading outages is that servers spend resources handling requests that will exceed their deadlines on the client. As a result, resources are spent while no progress is made: you don't get credit for late assignments with RPCs.
With deadline propagation, a deadline is set high in the stack (e.g., in the frontend). The tree of RPCs emanating from an initial request will all have the same absolute deadline. For example, if server A selects a 30-second deadline, and processes the request for 7 seconds before sending an RPC to server B, the RPC from A to B will have a 23-second deadline.
The shape:
async function handle(req, ctx) {
// ctx.deadline is an absolute timestamp set at the edge
const remaining = ctx.deadline - Date.now()
if (remaining <= 0) throw new Error('deadline exceeded before work started')
// Pass remaining as the timeout to downstream, NOT a fresh 30s
const downstream = await fetchWithTimeout(url, { timeout: remaining })
// ...
}
In gRPC, deadlines are first-class on the context. In HTTP, propagate via header (X-Request-Deadline-Ms, grpc-timeout if you've adopted that convention) or via tracing baggage.
The rule: never set a fresh timeout downstream. Always pass min(local_budget, remaining_deadline).
// Pseudocode — language-agnostic
async function callWithResilience(req, ctx) {
return circuitBreaker.execute(async () => {
let lastErr
for (let attempt = 0; attempt < 3; attempt++) { // per-request budget
const remaining = ctx.deadline - Date.now()
if (remaining <= 0) throw new DeadlineExceeded()
try {
return await callWithTimeout(req, Math.min(remaining, 5000))
} catch (err) {
lastErr = err
if (!isRetriable(err)) throw err // non-retriable: stop
if (!retryBudget.tryConsume()) throw err // per-client 10% budget
const delay = Math.floor(Math.random() * Math.min(30_000, 100 * 2 ** attempt))
await sleep(Math.min(delay, ctx.deadline - Date.now()))
}
}
throw lastErr
})
}
The retryBudget is a token bucket over the last 2 minutes that allows requests * 0.10 retries per window. When exhausted, retries skip.
Client tracks requests and accepts over the last 2 minutes. Client-side rejection probability:
max(0, (requests - K * accepts) / (requests + 1))
K = 2 is the recommended default ("we generally prefer the 2× multiplier"). The client itself drops requests before they reach the wire when the upstream is rejecting too many.
| Anti-pattern | Why it bites | Fix |
|---|---|---|
| Retry without circuit breaker | When upstream is down, retries hammer it | Wrap retries in a breaker; respect OPEN state |
| Circuit breaker without slow-call detection | Trips only after errors start; too late | Enable slow-call rate threshold (Resilience4j default off — turn on) |
| Retry at every layer | 3 retries × 4 layers = 64× amplification | Retry only at the layer immediately above the failing dependency |
| No-jitter exponential backoff | Synchronizes retry storms | Full Jitter |
| Retrying 4xx errors | Will never succeed | Whitelist retriable status codes (5xx + 429) only |
| Fresh timeout per hop | Total wallclock balloons; deadline ignored | Propagate deadlines; pass min(budget, remaining) |
| maxAttempts: 100 | Useless — by attempt 10 the upstream is gone | Cap at 3 (or 5 max for highly transient infra) |
| Hystrix in 2026 | Netflix EOL'd it in 2018 | Resilience4j (JVM), Polly (.NET), or hand-rolled per-language |
| Retrying non-idempotent POST without idempotency key | Risk of duplicate side effects | Add idempotency key OR don't retry |
| Open-state ignored by retry loop | Retries through CallNotPermittedException defeating the breaker | Retry loop must short-circuit on breaker rejection |
| | Novice | Expert |
|---|---|---|
| First retry | for (let i=0; i<5; i++) await call() | Full Jitter + 3-attempt cap + 10% client budget |
| Adds circuit breaker | Custom counter, no half-open | Resilience4j with slow-call threshold + half-open probes |
| Cascading failure postmortem | "Add more retries" | "Remove retries from N-2 layers; add deadline propagation" |
| On 503 with Retry-After | Ignores header | Honors header; backs off at minimum the suggested duration |
| Chooses backoff | Constant 1s | Full jitter exponential capped at 30s |
Timeline test: how long after a downstream blip does retry traffic from your service stop? Expert answer: bounded by cap (e.g., 30s). Novice answer: until the blip ends — meaning retries kept the dependency overloaded longer than the original incident.
A resilience change ships when:
CallNotPermittedException (or breaker-equivalent); verified by integration test that opens the breaker and confirms zero downstream calls.slowCallRateThreshold is configured (Resilience4j default disabled); verify in config.Retry-After header (when present) is parsed and honored.bulkhead-isolation-design)hedged-request-design)rate-limiting-strategy)backpressure-design)istio-resilience-config or linkerd-retry-policies)postgres-isolation-and-retries)testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.
development
Use when designing a background-job system, choosing between BullMQ / Sidekiq / RQ / Temporal / SQS, deciding queue-vs-workflow, sizing concurrency vs rate limits, building dead-letter queues, or making handlers idempotent. Triggers: jobs running twice on retry, lost jobs after worker crash, DLQ filling up, Redis OOM from job backlog, exactly-once requested, "do we need Temporal?", visibility timeout / lockDuration confusion, exponential backoff vs jitter, fan-out fan-in workflows. NOT for outbound webhook publishing (different concerns), receiver-side webhook handling (different concerns), event-streaming/Kafka topology, or in-process async (event loop only).