skills/ongaro-ousterhout-2014-raft/SKILL.md
Apply Raft's design principles for building understandable, reliable distributed and multi-agent systems
npx skillsauth add curiositech/windags-skills ongaro-ousterhout-2014-raftInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
license: Apache-2.0
Activation Triggers:
Load this skill when you need to:
This skill is NOT for: Systems where theoretical elegance matters more than implementation correctness, or where perfect optimization trumps maintainability.
Principle: Optimize for the cognitive limitations of implementers, not the aesthetic preferences of designers.
When faced with design alternatives that are theoretically equivalent:
Why it matters: Complex systems fail not because the theory is wrong but because humans misunderstand the theory. A "simpler" algorithm that's harder to explain produces more bugs than a "complex" algorithm with clear component boundaries.
Application: Before adding abstraction, generality, or optimization, ask if the complexity cost exceeds the capability gain. Make the common case trivial even if it makes edge cases more explicit.
Principle: The joints between components matter more than the components themselves.
Good decomposition separates concerns such that:
Raft's example: Leader election, log replication, and safety are nearly independent subproblems. You can understand log replication without understanding election mechanics.
Bad decomposition signs:
Principle: Adding constraints makes systems more reliable by eliminating states humans can't reason about.
Raft's constraints:
Each constraint eliminates combinatorial possibilities. The Leader Append-Only Property alone prevents entire classes of consistency violations.
How to apply:
Trade-off: Strategic inflexibility reduces capability less than it reduces complexity. Raft is "less general" than Paxos but handles the practical case more reliably.
Principle: Hierarchical coordination with clear authority is simpler than democratic consensus, even when both achieve the same outcome.
Raft's strong leader model:
When to use strong leadership:
When NOT to use it:
Principle: Reducing guarantees can paradoxically increase reliability by eliminating pathological cases humans can't anticipate.
Raft's randomized election timeouts: Each server picks a random timeout value. If deterministic, split votes become likely in certain timing scenarios. Randomization handles all possible timings uniformly: "Pick any; it doesn't matter."
Application pattern:
Examples beyond Raft:
When decomposing a system into components, evaluate:
IF explanation requires saying "but first you need to understand..."
THEN decomposition boundaries are wrong
IF changing component A forces changes in component B
THEN they share hidden state or assumptions
IF the interface between components is complex
THEN consider merging them or finding different boundary
IF failure in component A cascades to component B
THEN they're not truly independent subproblems
Good test: Can you implement components in any order? If implementation order is forced, your decomposition creates artificial dependencies.
When system has too many edge cases:
1. Enumerate actual states system occupies during correct operation
2. Identify theoretical states that implementation allows but are never needed
3. For each unnecessary state, ask: "What constraint would make this impossible?"
4. Add constraints that maximize (edge cases eliminated / functionality reduced)
5. Enforce through types, invariants, or architectural patterns—not validation
Example: If logs can have holes, you must handle: holes during replication, holes during recovery, hole compaction, hole replication retry. Constraint "no holes allowed" eliminates all these cases.
When choosing coordination model:
PREFER strong leader (asymmetric) IF:
- Common case is 100x more frequent than failures
- Decision latency matters more than optimal correctness
- Single agent has sufficient context for good decisions
- Coordination overhead exceeds cost of occasional suboptimal choice
PREFER symmetric peer coordination IF:
- No agent has privileged information
- System must tolerate arbitrary leader failure
- Decision quality matters more than latency
- Trust should be distributed
Hybrid option: Leader for common case, consensus for critical decisions (this is Raft's actual pattern).
When facing coordination deadlocks or pathological timing:
USE randomization IF:
- All adequate choices are roughly equivalent
- Deterministic approach creates timing dependencies
- Corner cases proliferate despite attempts to handle them
- System needs to break symmetry quickly
STAY deterministic IF:
- Debugging requires reproducibility
- Choices have irreversible consequences
- Compliance requires audit trails
- Performance optimization depends on predictability
Warning sign: If you keep finding "one more edge case" in deterministic design, randomization may eliminate the entire class.
When documenting systems or designing agents:
Specification is incomplete if implementer must guess:
- What happens during partial failures
- Resource limits and timing assumptions
- Recovery procedures after crashes
- How to handle concurrent operations
Complete specification includes:
- Normal operation (the algorithm)
- All failure modes and recovery paths
- Timing assumptions and resource limits
- Invariants that must be maintained
- Extension points and what they can assume
Test: Can someone implement correctly without asking clarifying questions? Raft's completeness enabled independent implementations that interoperated.
| Reference File | When to Load | Key Content |
|---------------|--------------|-------------|
| decomposition-as-coordination-strategy.md | Designing multi-component systems; experiencing tight coupling between modules | How Raft's three-way split (election/replication/safety) reduces interdependency; criteria for good vs. bad decomposition boundaries |
| strong-leadership-vs-democratic-coordination.md | Choosing between centralized orchestrator and peer coordination; designing leader election | Deep analysis of strong leader trade-offs; when asymmetric coordination simplifies systems; failure scenario handling |
| state-space-reduction-through-constraints.md | System has too many edge cases; debugging mysterious state corruption; designing invariants | Raft's specific constraints (no holes, append-only, etc.); how restrictions eliminate bug classes; constraint selection methodology |
| randomization-and-coordination-simplicity.md | Facing split-brain or coordination deadlocks; deterministic approach creating corner cases | Why Raft uses randomized timeouts; when nondeterminism reduces complexity; trade-offs vs. reproducibility |
| failure-modes-and-recovery-design.md | Implementing fault tolerance; system experiencing undefined failure states; designing recovery logic | Raft's failure taxonomy; how to design for failure from the start; making recovery paths first-class design elements |
| understandability-as-measurable-goal.md | Evaluating design alternatives; system is "theoretically correct" but brittle in practice; measuring design quality | Raft's empirical study comparing comprehension; metrics for understandability; how to make it a design objective not afterthought |
| implementing-vs-specifying-distributed-systems.md | Writing specifications or documentation; gap between design and implementation; distributed system design | The completeness gap in academic literature; what Raft includes that others omit; how to write implementable specs |
Mistake: Adding performance optimizations or generalizations before the core algorithm is fully understood and correctly implemented.
Why it fails: Optimization introduces special cases. If the base case isn't crystal clear, optimizations create compound complexity that's exponentially harder to debug.
Raft's lesson: The paper explicitly omits optimizations to maintain focus on understandability. Add performance later, after correctness is proven.
Mistake: Using consensus/voting for every decision because it's "more distributed" or "more fault-tolerant."
Why it fails: Coordination overhead dominates when most decisions are uncontroversial. Symmetric protocols require multiple round-trips even when all agents would agree immediately.
Raft's lesson: Use strong leader for common case, consensus only for critical decisions (like leader election itself).
Mistake: Writing validation and error-handling logic for states that shouldn't be possible.
Why it fails: Error handlers become part of the state space, introducing their own edge cases. Recovery logic is rarely tested and often wrong.
Raft's lesson: Make invalid states impossible through constraints. The system should structurally prevent holes in logs, not detect and repair them.
Mistake: Splitting system into components based on performance characteristics, code size, or team boundaries rather than semantic independence.
Why it fails: Creates components with high semantic coupling and low cohesion. Changes ripple across boundaries; understanding one requires understanding others.
Raft's lesson: Decompose by failure modes and logical independence. Raft's three components fail independently and can be reasoned about separately.
Mistake: Describing the algorithm's correctness properties without timing assumptions, resource limits, or failure handling.
Why it fails: Implementers must guess about edge cases, leading to divergent implementations that don't interoperate.
Raft's lesson: Include timing (election timeouts), resources (log compaction), and failure modes (network partitions, crashes) in the spec itself.
Mistake: Insisting on deterministic behavior even when it creates cascading corner cases.
Why it fails: Deterministic coordination in asynchronous systems requires complex tie-breaking, ranking, or additional rounds. Corner cases multiply.
Raft's lesson: The authors tried deterministic election first, found "subtle issues" that kept proliferating, and switched to randomization. Sometimes nondeterminism is the simpler choice.
1. They explain constraints before algorithms
2. They evaluate designs by understandability
3. They recognize decomposition quality
4. They see the tension between specification completeness and abstraction
5. They know when NOT to apply Raft's lessons
Ask: "Why did the Raft authors conduct an empirical study comparing comprehension with Paxos?"
Surface answer: "To prove Raft is simpler."
Deep answer: "To demonstrate that understandability can be operationalized and measured, which makes it a legitimate design objective rather than a subjective preference. The study's real contribution is showing that design-for-comprehension produces measurably better outcomes—it's not just about aesthetics, it's about reliability. This challenges the assumption that theoretical elegance and practical understandability are aligned."
Note: This skill emphasizes design principles over implementation details. Load reference files for deeper dives into specific mechanisms, trade-offs, and applications.
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.