SKILL: FLP Impossibility and Distributed Consensus

DECISION POINTS

1. Coordination Strategy Selection

Given system constraints → Choose approach:

IF no timing assumptions allowed AND exact consensus required
├── THEN consensus is impossible
├── → Redesign to avoid coordination
└── → OR accept potential non-termination + add manual escape

IF timing assumptions acceptable (timeouts, heartbeats allowed)
├── AND need strong consistency
├── → Use Paxos/Raft family protocols
└── → Monitor timing assumption violations

IF approximate results acceptable
├── OR partial agreement sufficient
├── → Use best-effort protocols
└── → Trade precision for availability

IF static participant set AND known failure bound
├── → Use majority protocol
└── → (Consensus IS possible in this case)

2. Non-Termination Diagnosis

System stuck without deciding → Check cause:

IF cannot distinguish slow from crashed processes
├── → You've hit FLP boundary
├── → Add failure detector (timeout) OR accept non-termination
└── → Document synchrony assumptions if using timeouts

IF system has timing assumptions but still hangs
├── → Assumptions violated (network partition, load spike)
├── → Check: timeout too aggressive? GC pause? Network delay?
└── → Adjust assumptions OR add assumption violation handling

IF in bivalent state (can decide either way)
├── → Normal FLP scenario - not a bug
├── → Add escape hatch: manual override, quorum adjustment
└── → OR wait for network/timing conditions to improve

3. Timeout Setting Strategy

Setting failure detection timeout → Balance false positives vs negatives:

IF timeout too short
├── → False positives: declare live process dead
├── → Leads to: split decisions, unnecessary failovers
└── → SOLUTION: make adaptive, increase under load

IF timeout too long
├── → False negatives: don't detect real failures
├── → Leads to: long waits, poor user experience
└── → SOLUTION: expose partial results, manual override

ALWAYS:
├── → Log timeout violations to validate assumptions
├── → Design protocol to handle false positives gracefully
└── → Make timing assumptions explicit in documentation

4. Task Result Aggregation

Aggregating parallel task results → Check operation properties:

IF aggregation is commutative/associative (order doesn't matter)
├── → Process results as they arrive
├── → Minimal coordination needed
└── → Example: sum, max, set union

IF aggregation requires specific order OR all inputs
├── → Requires coordination protocol
├── → Apply coordination strategy selection (Decision Point 1)
└── → Example: consensus on final decision, ordered operations

IF some tasks might not complete
├── → Wait forever (pure asynchrony) OR timeout (synchrony assumption)
├── → Recommend: expose both modes
└── → Strict mode vs best-effort mode

FAILURE MODES

1. Speed Fallacy

Symptom: System still hangs despite hardware upgrades, faster networks, optimized code Diagnosis: Treating impossibility as performance problem Detection Rule: If you're saying "we'll make it fast enough that failures don't matter" Fix: Accept the trade-off, choose which guarantee to sacrifice (usually pure asynchrony via timeouts)

2. Hidden Synchrony

Symptom: System works in testing but fails unpredictably in production during network issues Diagnosis: Using timeouts/heartbeats without acknowledging synchrony assumptions Detection Rule: If your "asynchronous" system has any timeout values Fix: Document timing assumptions explicitly, monitor violations, design for assumption breaches

3. Infinite Wait Trap

Symptom: System appears frozen, users can't proceed, no error message or recovery option Diagnosis: Protocol blocks indefinitely waiting for consensus without escape hatch Detection Rule: If system can enter a state where it waits forever with no user recourse Fix: Add timeouts, manual override, quorum adjustment, or "decide with available data" mode

4. Bivalence Blindness

Symptom: Operators can't tell if system is making progress or stuck, debugging is impossible Diagnosis: Treating consensus as black box without exposing internal state Detection Rule: If you can't answer "is the system stuck or still working?" Fix: Expose bivalent/univalent state, show which processes voted, add progress indicators

5. Consensus Confusion

Symptom: Applying FLP reasoning to problems that don't need binary consensus Diagnosis: Over-applying impossibility result to leader election, ordering, or best-effort coordination Detection Rule: If you're using "FLP says it's impossible" for non-consensus problems Fix: Distinguish exact consensus from other coordination patterns, use appropriate guarantees

WORKED EXAMPLES

Example 1: Multi-Agent Code Review Decision

Scenario: 5 AI agents must reach consensus on whether to approve a pull request. Each agent analyzes different aspects (security, performance, style, tests, logic).

Expert Decision Process:

Identify coordination type: This requires exact consensus - all agents must agree on single approve/reject decision
Check FLP applicability: Agents operate asynchronously, any could crash during analysis
Apply Decision Point 1: Need timing assumptions for guaranteed termination
Choose protocol: Implement Paxos-style protocol with 30-second analysis timeout per agent

What novice misses:

Tries to implement "pure" asynchronous consensus (impossible)
Doesn't plan for agents that crash mid-analysis
No escape hatch when 2 agents approve, 2 reject, 1 crashes

Expert solution:

Explicit timeout: declare agent failed after 30sec
Escape hatch: manual review if consensus fails
Progress visibility: show which agents voted, current tally
Handle timing violations: increase timeout under load

Example 2: Distributed Task Result Aggregation

Scenario: 100 agents processing data chunks, need to aggregate results into final report. Some agents might fail.

Expert Decision Process:

Check operation properties: Result aggregation is commutative (order doesn't matter)
Apply Decision Point 4: Can process results as they arrive - minimal coordination
Handle failures: Use "best effort" with timeout rather than exact consensus
Design escape: After 90% agents report OR 5-minute timeout, generate report with available data

FLP implications:

Waiting for all 100 agents = potential infinite wait (pure asynchrony)
Using timeout = synchrony assumption that could be wrong
Trade-off: accept partial results for guaranteed termination

Expert solution:

Stream processing: incorporate results as they arrive
Dual mode: "strict" (wait for all) vs "best effort" (timeout)
Visibility: show completion percentage, which agents pending
Graceful degradation: report quality score based on coverage

Example 3: Debugging Consensus Hang

Scenario: Raft-based coordination system appears stuck during leader election.

Expert Diagnosis Process:

Check bivalence state: Are we in leader election phase? (bivalent for leader choice)
Apply Decision Point 2: System has timing assumptions but still hangs
Investigate timing violations: Network partition? GC pause? Load spike causing timeout violations?
Root cause: Network latency increased 10x, breaking heartbeat assumptions

What novice sees: "Consensus is broken, FLP proves this is impossible" What expert sees: "Synchrony assumptions violated, this is expected behavior"

Expert solution:

Monitor timing assumptions: alert when heartbeat intervals exceeded
Adaptive timeouts: increase under network stress
Manual escape: force leader election with quorum override
Document assumption: "requires <200ms network latency for guaranteed progress"

QUALITY GATES

Task completion checklist for applying FLP knowledge:

[ ] Identified whether exact consensus is actually required (vs. best-effort coordination)
[ ] Made synchrony assumptions explicit (documented timeout values and their justification)
[ ] Added escape hatch for legitimate non-termination scenarios (manual override, quorum adjustment, partial results)
[ ] Exposed system coordination state (bivalent vs committed, which processes responded)
[ ] Designed handling for timing assumption violations (network delays, GC pauses, load spikes)
[ ] Distinguished coordination problems that require consensus from those that don't
[ ] Added monitoring for coordination progress and timing assumption breaches
[ ] Tested failure scenarios during coordination (crash during decision phase)
[ ] Documented trade-offs made (which guarantee sacrificed: termination, fault tolerance, or asynchrony)
[ ] Verified system degrades gracefully when assumptions violated rather than hanging indefinitely

NOT-FOR BOUNDARIES

Do NOT use this skill for:

Performance optimization: FLP is about impossibility, not speed - use performance-analysis skill instead
Byzantine fault tolerance: FLP assumes crash failures only - use byzantine-consensus skill for adversarial scenarios
Eventual consistency patterns: These sidestep consensus entirely - use eventual-consistency skill
Single-process decision making: FLP only applies to distributed coordination
Best-effort coordination: When approximate/partial results acceptable - use coordination-patterns skill
Leader election specifics: Related but different problem - use leader-election skill
Causal consistency: Different consistency model - use causal-consistency skill

Delegate to other skills when:

Client asks about specific consensus algorithms → Use paxos-protocol or raft-consensus skills
Need to implement failure detectors → Use failure-detection skill
Designing for Byzantine attackers → Use byzantine-consensus skill
Optimizing consensus performance → Use consensus-optimization skill
Questions about linearizability → Use consistency-models skill

SKILL: FLP Impossibility and Distributed Consensus

DECISION POINTS

1. Coordination Strategy Selection

Given system constraints → Choose approach:

IF no timing assumptions allowed AND exact consensus required
├── THEN consensus is impossible
├── → Redesign to avoid coordination
└── → OR accept potential non-termination + add manual escape

IF timing assumptions acceptable (timeouts, heartbeats allowed)
├── AND need strong consistency
├── → Use Paxos/Raft family protocols
└── → Monitor timing assumption violations

IF approximate results acceptable
├── OR partial agreement sufficient
├── → Use best-effort protocols
└── → Trade precision for availability

IF static participant set AND known failure bound
├── → Use majority protocol
└── → (Consensus IS possible in this case)

2. Non-Termination Diagnosis

System stuck without deciding → Check cause:

IF cannot distinguish slow from crashed processes
├── → You've hit FLP boundary
├── → Add failure detector (timeout) OR accept non-termination
└── → Document synchrony assumptions if using timeouts

IF system has timing assumptions but still hangs
├── → Assumptions violated (network partition, load spike)
├── → Check: timeout too aggressive? GC pause? Network delay?
└── → Adjust assumptions OR add assumption violation handling

IF in bivalent state (can decide either way)
├── → Normal FLP scenario - not a bug
├── → Add escape hatch: manual override, quorum adjustment
└── → OR wait for network/timing conditions to improve

3. Timeout Setting Strategy

Setting failure detection timeout → Balance false positives vs negatives:

IF timeout too short
├── → False positives: declare live process dead
├── → Leads to: split decisions, unnecessary failovers
└── → SOLUTION: make adaptive, increase under load

IF timeout too long
├── → False negatives: don't detect real failures
├── → Leads to: long waits, poor user experience
└── → SOLUTION: expose partial results, manual override

ALWAYS:
├── → Log timeout violations to validate assumptions
├── → Design protocol to handle false positives gracefully
└── → Make timing assumptions explicit in documentation

4. Task Result Aggregation

Aggregating parallel task results → Check operation properties:

IF aggregation is commutative/associative (order doesn't matter)
├── → Process results as they arrive
├── → Minimal coordination needed
└── → Example: sum, max, set union

IF aggregation requires specific order OR all inputs
├── → Requires coordination protocol
├── → Apply coordination strategy selection (Decision Point 1)
└── → Example: consensus on final decision, ordered operations

IF some tasks might not complete
├── → Wait forever (pure asynchrony) OR timeout (synchrony assumption)
├── → Recommend: expose both modes
└── → Strict mode vs best-effort mode

FAILURE MODES

1. Speed Fallacy

2. Hidden Synchrony

3. Infinite Wait Trap

4. Bivalence Blindness

5. Consensus Confusion

WORKED EXAMPLES

Example 1: Multi-Agent Code Review Decision

Scenario: 5 AI agents must reach consensus on whether to approve a pull request. Each agent analyzes different aspects (security, performance, style, tests, logic).

Expert Decision Process:

Identify coordination type: This requires exact consensus - all agents must agree on single approve/reject decision
Check FLP applicability: Agents operate asynchronously, any could crash during analysis
Apply Decision Point 1: Need timing assumptions for guaranteed termination
Choose protocol: Implement Paxos-style protocol with 30-second analysis timeout per agent

What novice misses:

Tries to implement "pure" asynchronous consensus (impossible)
Doesn't plan for agents that crash mid-analysis
No escape hatch when 2 agents approve, 2 reject, 1 crashes

Expert solution:

Explicit timeout: declare agent failed after 30sec
Escape hatch: manual review if consensus fails
Progress visibility: show which agents voted, current tally
Handle timing violations: increase timeout under load

Example 2: Distributed Task Result Aggregation

Scenario: 100 agents processing data chunks, need to aggregate results into final report. Some agents might fail.

Expert Decision Process:

Check operation properties: Result aggregation is commutative (order doesn't matter)
Apply Decision Point 4: Can process results as they arrive - minimal coordination
Handle failures: Use "best effort" with timeout rather than exact consensus
Design escape: After 90% agents report OR 5-minute timeout, generate report with available data

FLP implications:

Waiting for all 100 agents = potential infinite wait (pure asynchrony)
Using timeout = synchrony assumption that could be wrong
Trade-off: accept partial results for guaranteed termination

Expert solution:

Stream processing: incorporate results as they arrive
Dual mode: "strict" (wait for all) vs "best effort" (timeout)
Visibility: show completion percentage, which agents pending
Graceful degradation: report quality score based on coverage

Example 3: Debugging Consensus Hang

Scenario: Raft-based coordination system appears stuck during leader election.

Expert Diagnosis Process:

Check bivalence state: Are we in leader election phase? (bivalent for leader choice)
Apply Decision Point 2: System has timing assumptions but still hangs
Investigate timing violations: Network partition? GC pause? Load spike causing timeout violations?
Root cause: Network latency increased 10x, breaking heartbeat assumptions

What novice sees: "Consensus is broken, FLP proves this is impossible" What expert sees: "Synchrony assumptions violated, this is expected behavior"

Expert solution:

Monitor timing assumptions: alert when heartbeat intervals exceeded
Adaptive timeouts: increase under network stress
Manual escape: force leader election with quorum override
Document assumption: "requires <200ms network latency for guaranteed progress"

QUALITY GATES

Task completion checklist for applying FLP knowledge:

[ ] Identified whether exact consensus is actually required (vs. best-effort coordination)
[ ] Made synchrony assumptions explicit (documented timeout values and their justification)
[ ] Added escape hatch for legitimate non-termination scenarios (manual override, quorum adjustment, partial results)
[ ] Exposed system coordination state (bivalent vs committed, which processes responded)
[ ] Designed handling for timing assumption violations (network delays, GC pauses, load spikes)
[ ] Distinguished coordination problems that require consensus from those that don't
[ ] Added monitoring for coordination progress and timing assumption breaches
[ ] Tested failure scenarios during coordination (crash during decision phase)
[ ] Documented trade-offs made (which guarantee sacrificed: termination, fault tolerance, or asynchrony)
[ ] Verified system degrades gracefully when assumptions violated rather than hanging indefinitely

NOT-FOR BOUNDARIES

Do NOT use this skill for:

Performance optimization: FLP is about impossibility, not speed - use performance-analysis skill instead
Byzantine fault tolerance: FLP assumes crash failures only - use byzantine-consensus skill for adversarial scenarios
Eventual consistency patterns: These sidestep consensus entirely - use eventual-consistency skill
Single-process decision making: FLP only applies to distributed coordination
Best-effort coordination: When approximate/partial results acceptable - use coordination-patterns skill
Leader election specifics: Related but different problem - use leader-election skill
Causal consistency: Different consistency model - use causal-consistency skill

Delegate to other skills when:

Client asks about specific consensus algorithms → Use paxos-protocol or raft-consensus skills
Need to implement failure detectors → Use failure-detection skill
Designing for Byzantine attackers → Use byzantine-consensus skill
Optimizing consensus performance → Use consensus-optimization skill
Questions about linearizability → Use consistency-models skill

Adoption

curiositech/fischer-lynch-paterson-1985-flp-impossibility

$ install --global

Security Scan Results

SKILL.md

SKILL: FLP Impossibility and Distributed Consensus

DECISION POINTS

1. Coordination Strategy Selection

2. Non-Termination Diagnosis

3. Timeout Setting Strategy

4. Task Result Aggregation

FAILURE MODES

1. Speed Fallacy

2. Hidden Synchrony

3. Infinite Wait Trap

4. Bivalence Blindness

5. Consensus Confusion

WORKED EXAMPLES

Example 1: Multi-Agent Code Review Decision

Example 2: Distributed Task Result Aggregation

Example 3: Debugging Consensus Hang

QUALITY GATES

NOT-FOR BOUNDARIES

Related Skills

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy

curiositech/fischer-lynch-paterson-1985-flp-impossibility

$ install --global

Security Scan Results

SKILL.md

SKILL: FLP Impossibility and Distributed Consensus

DECISION POINTS

1. Coordination Strategy Selection

2. Non-Termination Diagnosis

3. Timeout Setting Strategy

4. Task Result Aggregation

FAILURE MODES

1. Speed Fallacy

2. Hidden Synchrony

3. Infinite Wait Trap

4. Bivalence Blindness

5. Consensus Confusion

WORKED EXAMPLES

Example 1: Multi-Agent Code Review Decision

Example 2: Distributed Task Result Aggregation

Example 3: Debugging Consensus Hang

QUALITY GATES

NOT-FOR BOUNDARIES

Related Skills

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy