skills/nygard-2018-release-it-2nd-edition/SKILL.md
Production stability patterns including circuit breakers, bulkheads, and resilience strategies for real-world systems
npx skillsauth add curiositech/windags-skills nygard-2018-release-it-2nd-editionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
license: Apache-2.0
Load this skill when:
The book's core insight: Production systems fail in ways that are mechanically invisible during development. If the conversation involves the gap between test environments and production reality, this skill applies.
Production failures follow mechanical laws based on architectural coupling. A single uncaught exception doesn't stay isolated—it propagates through:
Key principle: Design for failure isolation. Every integration point is a potential failure amplifier. The failure propagation speed is determined by timeout values, pool sizes, and synchronization points—not business logic.
Testing environments are deliberately unrealistic and cannot reveal production-scale failure modes:
Key principle: Antipatterns are emergent properties of scale, time, and adversarial conditions. You cannot test your way into production confidence—you must design for assumed failure.
Early choices about communication patterns, failure handling, and resource management determine operational costs for years:
Key principle: The cheapest time to fix operational problems is during initial architecture. Every integration point should explicitly answer: "What happens when this fails? How fast? Where does the failure go?"
Each stability pattern targets a mechanical failure mode:
Key principle: Patterns are not abstract "best practices"—they are physics-based countermeasures. Apply patterns where failure mechanics predict problems (high fan-out, shared resources, external dependencies).
Zero-downtime deployment requires application participation at every layer:
Key principle: If deployment requires downtime windows or multi-hour ceremonies, the architecture is wrong. Treat continuous deployment as a non-functional requirement from day one.
IF introducing a synchronous call to another service
THEN
IF the integration is to a third-party vendor or legacy system
THEN expect byzantine failures (slow responses without errors, malformed data, connection leaks)
IF system hangs rather than crashes
THEN → Load references/failure-propagation-mechanics.md — likely blocked threads or resource pool exhaustion
IF problem only appears under sustained load or after hours/days
THEN → Load references/production-scale-antipatterns.md — emergent failure from memory leaks, unbounded result sets, or scaling effects
IF deployment causes outages or requires downtime windows
THEN → Load references/zero-downtime-deployment-architecture.md — deployment is not being treated as architectural feature
IF choosing between synchronous and asynchronous communication
THEN async is safer (failures don't propagate backward) but harder to reason about; sync requires timeout/circuit breaker discipline
IF evaluating shared vs. partitioned resources (thread pools, DB connections)
THEN partitioning (bulkheads) prevents cascading failures at cost of lower resource efficiency
IF deciding whether to cache integration point responses
THEN caching provides resilience when dependency fails but adds staleness and invalidation complexity
IF code makes network calls
THEN verify explicit timeouts, connection pooling with size limits, error handling that doesn't crash threads
IF code queries database without LIMIT clause
THEN reject—unbounded result sets are a production time bomb
IF code holds locks or resources while calling external systems
THEN redesign—this creates cascading failure coupling
| File | Load When | Contents |
|------|-----------|----------|
| references/failure-propagation-mechanics.md | Investigating cascading failures, system hangs, or resource exhaustion; designing integration patterns between services | Detailed mechanics of how failures propagate through architectural coupling: blocked threads, resource pool exhaustion, synchronous call chains, shared contention points. Includes real case studies (airline outage, AWS S3 incident) showing failure amplification. |
| references/production-scale-antipatterns.md | Problem only appears in production; debugging emergent behavior at scale; reviewing architecture for latent failure modes | The 8 core antipatterns (Integration Points, Blocked Threads, Cascading Failures, Chain Reactions, Users, Self-Denial, Scaling Effects, Unbounded Result Sets) with mechanical explanations and specific countermeasures. Explains why testing cannot reveal these problems. |
| references/zero-downtime-deployment-architecture.md | Planning deployment strategy; deployment currently requires downtime; implementing CI/CD; discussing immutable infrastructure or 12-factor apps | Architectural requirements for treating deployment as a feature: health checks, schema migration patterns, configuration injection, service discovery, protocol versioning. Covers immutable infrastructure and the operations/development contract. |
"It worked in testing" — Testing environments cannot reveal production-scale failure modes. Assuming test coverage equals production confidence is the original sin.
Defaulting to library/framework timeout values — Most libraries default to infinite timeouts or values tuned for LAN latency, not internet-scale unreliability. Always set explicit, short timeouts.
Failing hard instead of failing soft — Crashing on error propagates failures upward. Graceful degradation (returning cached data, default values, or partial results) contains failures.
Shared resource pools without bulkheads — A single slow integration point can exhaust a shared thread pool, taking down the entire application. Partition resources by failure domain.
Synchronous calls without circuit breakers — Retry storms and cascading failures result. Every sync integration needs a circuit breaker.
"We'll add monitoring/deployment/security later" — These are architectural concerns. Bolting them on after initial design is 10-100x more expensive.
Treating instances as pets instead of cattle — Snowflake servers with manual configuration don't scale. Embrace disposable, immutable infrastructure.
Unbounded result sets — Queries without LIMIT clauses, pagination, or streaming are production time bombs that work fine in development with small datasets.
Someone has internalized Release It! if they:
Ask "what happens when this fails?" during architecture discussions — not as pessimism, but as mechanical analysis of failure propagation paths
Specify timeout values in milliseconds during design — and can explain the math: (timeout × pool_size × request_rate) = memory consumption
Distinguish between liveness and readiness in health check discussions — they understand the difference between "process is running" and "ready to serve traffic"
Describe failures using mechanical terms — "blocked threads," "cascading failures," "circuit breaker," not vague terms like "broken" or "down"
Treat deployment as a feature requirement — it appears in architecture diagrams and design reviews, not just ops runbooks
Argue against shared resource pools — they instinctively partition (bulkhead) resources even when it seems inefficient
Assume integration points will exhibit byzantine failures — they don't trust vendor libraries, network reliability, or graceful degradation of dependencies
Someone has only read the summary if they:
Core philosophy: Production systems exist in a hostile environment where failure is continuous. Your job is not to prevent failures (impossible) but to design systems where failures are isolated, observable, and non-catastrophic. Every architectural decision either amplifies or dampens failure propagation—choose deliberately.
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.