skills/distributed-algorithms/SKILL.md
Foundational distributed computing algorithms for consensus, coordination, and fault tolerance
npx skillsauth add curiositech/windags-skills distributed-algorithmsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
license: Apache-2.0
name: distributed-algorithms-lynch
description: >
Reasoning framework for distributed systems design, correctness, and
impossibility — grounded in Lynch's formal automata-theoretic treatment.
Covers synchronous/asynchronous/partially-synchronous models, failure
models, consensus, and impossibility results as engineering constraints.
version: 1.0
activation_triggers:
- distributed systems design or architecture questions
- consensus, leader election, or coordination problems
- "is this possible?" questions about distributed guarantees
- fault tolerance and replication design
- debugging liveness or safety violations in distributed systems
- CAP theorem, FLP, Byzantine fault tolerance mentions
- questions about consistency models (linearizability, serializability)
- concurrent/shared-memory synchronization hierarchy questions
- formal verification of distributed protocols
- questions about timeouts, clocks, and timing assumptions
Load this skill when someone is:
Signal phrases: "eventually consistent," "we can't get all nodes to agree," "our leader election is flaky," "we need linearizability," "how many failures can we tolerate," "is this CAP-related?"
Every distributed algorithm lives at a point defined by:
| Dimension | Options | Effect on Solvability | |-----------|---------|----------------------| | Communication | Shared memory / Message passing | Changes primitive power; most results translate via simulation | | Timing | Synchronous / Asynchronous / Partially synchronous | Most consequential — determines what's provably achievable | | Failure model | None / Crash / Omission / Byzantine | Determines n > kf thresholds and algorithm complexity | | Problem class | Consensus / Election / Broadcast / Allocation | Different hardness in each timing+failure combination |
Practical implication: Before writing a line of code, identify where you are in this space. Moving even one dimension can flip a problem from unsolvable to trivial — or vice versa.
The foundational negative results are not academic — they are the most useful theorems in the book:
These mean: if you are in that regime and your protocol isn't working, no amount of debugging will fix it. The correct response is to relax an assumption — add a timeout (partial synchrony), strengthen failure detection, reduce the failure tolerance requirement.
Informal reasoning about concurrent systems fails — systematically and silently. Lynch's warning:
"Without such care, it is difficult to avoid mistakes."
The tools that work: state machines (I/O automata), invariant assertions, simulation relations, execution traces. Informal prose about distributed behavior is unreliable at the intersection of interleaving, timing, and failure modes. Use formal reasoning whenever correctness matters — not as gold-plating but as the minimum that actually works.
The atomicity composition theorem (Theorem 13.7): if each layer atomically implements its interface, the composed system is atomic. This enables:
This is why atomicity/linearizability is the right correctness condition for building composable systems. Weaker conditions (sequential consistency, eventual consistency) don't compose cleanly.
Relaxations have measurable, provable benefits:
When someone says "we'll just relax consistency a bit," the Lynch framework demands: which relaxation, by exactly how much, and what do you gain?
1. Identify timing model: Is there a bound on message delay? Process speed?
→ Synchronous: most problems solvable
→ Asynchronous: check impossibility results FIRST
→ Partially synchronous: consensus achievable, with caveats
2. Identify failure model: What failures must be tolerated?
→ Crash: n > 2f for most consensus variants
→ Byzantine: n > 3f required, no exceptions
3. Check known impossibility results:
→ Async + any crash failure + consensus = IMPOSSIBLE (FLP)
→ Byzantine + n ≤ 3f = IMPOSSIBLE
→ If impossible, ask: which assumption can we relax?
4. If possible: what are the complexity costs?
→ Time complexity: rounds or real-time bounds?
→ Message complexity: O(n²) or O(n log n)?
→ Is that acceptable for the use case?
If the system seems correct but non-terminating (liveness issue):
→ Check: are you in an asynchronous model with failures? (→ FLP applies)
→ Check: is your failure detector strong enough? (eventually perfect ◇P?)
→ Check: are timing assumptions actually met in deployment?
If the system terminates but produces wrong values (safety issue):
→ Check: is your failure threshold n > 2f (crash) or n > 3f (Byzantine)?
→ Check: are atomic operations actually atomic? (linearizability)
→ Formalize executions as state traces; find the invariant violation
If behavior is inconsistent across nodes:
→ Check: which consistency model are you actually implementing?
→ Linearizability ≠ sequential consistency ≠ eventual consistency
→ These do not compose interchangeably
Need strong composability? → Linearizability (atomic objects)
Need only single-object correctness? → Sequential consistency (weaker, cheaper)
Tolerating staleness? → Eventual consistency (but: no FLP escape
without also giving up something else)
Need consensus specifically? → Must be partially synchronous at minimum;
identify failure detector class needed
f failures to tolerate:
Crash stop → need n > 2f, use crash-tolerant consensus (Paxos-family)
Byzantine → need n > 3f, use BFT protocol (cost: O(n) factor overhead)
Failure detector availability:
Perfect P → synchronous system or unrealistic assumption
Eventually ◇P → achievable in partially synchronous systems; sufficient for consensus
If you can't afford n > 3f:
→ Reduce f (tolerate fewer failures)
→ Use cryptographic authentication to convert Byzantine → authenticated Byzantine
(reduces to n > 2f with authentication)
Load these on demand based on what's needed:
| File | When to Load |
|------|-------------|
| references/impossibility-results-as-engineering-constraints.md | Someone is attempting something that might be provably impossible; FLP, Byzantine thresholds, or lower bounds are relevant; need to justify stopping or changing approach |
| references/timing-models-and-their-consequences.md | Choosing between synchronous/asynchronous/partially-synchronous architectures; timeouts and failure detectors are in scope; "why does consensus work here but not there?" |
| references/failure-models-and-fault-tolerance.md | Designing fault-tolerant systems; evaluating how many failures a system can handle; Byzantine vs. crash fault distinction matters |
| references/distributed-consensus-the-complete-picture.md | Consensus, Paxos, Raft, leader election, atomic broadcast, or any agreement protocol is under discussion; need the complete problem taxonomy |
| references/formal-proof-methods-for-distributed-systems.md | Proving correctness; finding subtle bugs via formal reasoning; invariant-based arguments; simulation relations needed |
| references/synchronization-primitives-and-atomic-objects.md | Shared-memory concurrency; compare-and-swap power hierarchy; wait-free vs. lock-free; building atomic objects from weaker primitives |
| references/algorithm-design-patterns-and-canonical-problems.md | Need a known algorithm for a canonical problem (broadcast, election, snapshot); looking for a design pattern; connecting a new problem to a known one |
These are mistakes Lynch's framework directly prevents:
1. Trying to solve consensus in a pure async system Spending weeks tuning a consensus protocol that will never terminate under failures. FLP says: relax the timing model, don't add more retries.
2. Under-counting processes needed for Byzantine tolerance Assuming n > 2f suffices when Byzantine failures are possible. The correct threshold is n > 3f. Systems with n = 3 tolerating 1 Byzantine fault are unsound.
3. Treating consistency models as interchangeable Mixing linearizable and sequentially consistent components assuming the result is linearizable. Atomicity composes; weaker conditions don't.
4. Informal reasoning about interleaving Writing prose arguments about why "this can't happen" in concurrent execution. Informal concurrent reasoning fails silently. The fix: state machines and invariant assertions.
5. Conflating safety and liveness failures Treating a liveness bug (system halts) the same as a safety bug (system produces wrong result). They have different causes, different impossibility results, and different fixes.
6. Ignoring the failure detector requirement Implementing consensus in a "partially synchronous" system without specifying what failure detector is available and whether it's sufficient (◇P is the minimum for consensus).
7. Assuming relaxed consistency is "free" Choosing eventual consistency to escape CAP/FLP constraints without accounting for what application-level invariants now become the programmer's problem.
How to tell if someone has actually internalized Lynch vs. just read the summary:
✓ Internalized: When told "our consensus is broken," their first question is "what's your timing model and failure model?" — not "what algorithm are you using?"
✓ Internalized: They know FLP is about asynchronous systems specifically, and immediately ask "so what synchrony assumption does your system actually make?" rather than concluding "consensus is impossible."
✓ Internalized: They distinguish between the possibility result (can we solve this at all?) and the complexity result (at what cost?) as two separate questions requiring separate answers.
✓ Internalized: When someone says "we use eventual consistency," they ask "eventual consistency of what? Under what conflict resolution rule? With what convergence guarantee?" — not just nod.
✓ Internalized: They know the synchronizer doesn't make async systems synchronous — it provides a simulation that preserves certain properties, and they know which properties are and aren't preserved.
How to spot someone who only read the summary:
✗ They cite FLP as a reason consensus is "basically impossible" without specifying the async model assumption.
✗ They say "Byzantine fault tolerance requires 3x the nodes" without knowing why 3f+1 and not 2f+1 (the triangle argument: can't distinguish which third is lying).
✗ They treat "linearizable" and "serializable" as synonyms.
✗ They believe adding more replicas always increases fault tolerance (it does for availability; for Byzantine tolerance it requires n > 3f which is a different constraint).
✗ They argue informally that "the race condition can't happen in practice" rather than proving it can't happen via invariant.
Load reference files from the table above when deeper treatment of a specific topic is needed.
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.