skills/kleppmann-data-intensive/SKILL.md
--- --- license: Apache-2.0 name: kleppmann-data-intensive description: Comprehensive guide to designing reliable, scalable data systems covering databases, streaming, and consistency category: Research & Academic tags: - data-systems - distributed-systems - databases - streaming - consistency --- # SKILL: Designing Data-Intensive Systems (Kleppmann) **Source**: *Designing Data-Intensive Applications* by Martin Kleppmann **Domain**: Distributed systems, data architecture, reliability
npx skillsauth add curiositech/windags-skills skills/kleppmann-data-intensiveInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Source: Designing Data-Intensive Applications by Martin Kleppmann Domain: Distributed systems, data architecture, reliability engineering Applies to: Building systems where data complexity (not computation) is the bottleneck
IF: Strong consistency required (banking, inventory)
AND: Can tolerate higher latency + coordination overhead
THEN: Use linearizability with synchronous replication
IF: Operations need ordering but not global agreement
AND: User experience matters more than strict consistency
THEN: Use causal consistency (preserve cause-effect, allow concurrent ops)
IF: High availability required during network partitions
AND: Can resolve conflicts application-side
THEN: Accept eventual consistency with conflict resolution
IF: Read-your-writes is critical but global consistency isn't
AND: Users mostly operate on their own data
THEN: Route user reads to leader/use session consistency
IF: tail latency > 500ms + replication lag > 1s
AND: Cache hit rate < 80%
THEN: Switch to local quorum reads, accept bounded staleness
IF: Write throughput bottleneck identified
AND: Operations can be partitioned by key
THEN: Implement horizontal partitioning with partition-local transactions
IF: Cross-partition queries frequent
AND: Eventual consistency acceptable for derived data
THEN: Use CQRS pattern (separate write/read paths)
IF: Coordination overhead dominates response time
AND: Operations can be made idempotent
THEN: Replace distributed locks with compare-and-set operations
IF: Component failure detected (timeout/error)
AND: Operation might have succeeded
THEN: Make operation idempotent, use unique request IDs for retry
IF: Distributed resource coordination required
AND: Process pauses/network delays possible
THEN: Implement fencing tokens (resource rejects lower-numbered tokens)
IF: Multi-step workflow spans services
AND: Atomic rollback needed
THEN: Use saga pattern with compensating transactions, not 2PC
IF: Service dependency causing tail latency spikes
THEN: Implement circuit breaker + hedged requests after timeout threshold
Symptoms: Two nodes both believe they're the leader, conflicting writes accepted Root Causes: Network partition + inadequate quorum checking + lease expiry race conditions Detection Rule: If you see duplicate primary keys or "impossible" data states after network events Fixes (ranked by speed/safety):
Symptoms: Single component failure causes system-wide outage, p99 latency spike across all services Root Causes: Synchronous dependencies + no circuit breakers + retry storms + unbounded queues Detection Rule: If failure rate increases exponentially rather than linearly with initial fault Fixes (ranked by speed/safety):
Symptoms: User writes data, immediately reads and sees old value, claims "data was lost" Root Causes: Async replication lag + load balancer routes read to stale replica + no session affinity Detection Rule: If user complaints about "lost data" correlate with write-then-read patterns Fixes (ranked by speed/safety):
Symptoms: Multiple processes acquire same lock simultaneously, resource corruption occurs Root Causes: Lock service uses timeouts without fencing + GC pauses + network delays exceed lease time Detection Rule: If you see lock violation errors or concurrent modification of "protected" resources Fixes (ranked by speed/safety):
Symptoms: Cache expires, all requests hit database simultaneously, database overloads Root Causes: Cache expiry + no request deduplication + synchronous cache population + high concurrency Detection Rule: If database load spikes correlate with cache miss events Fixes (ranked by speed/safety):
Scenario: User adds last item to cart, another user tries same item simultaneously
Decision Process:
Implementation:
-- Atomic inventory check with compare-and-set
UPDATE inventory
SET quantity = quantity - 1, version = version + 1
WHERE product_id = ? AND quantity >= 1 AND version = ?
Novice would miss: Using SELECT then UPDATE (race condition window) Expert catches: Version field prevents lost updates, quantity check prevents overselling
Fallback handling:
Scenario: User posts update, immediately checks feed, doesn't see their post
Decision Process:
Implementation Strategy:
Quality validation:
This skill should NOT be used for:
Delegate to other skills:
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.