TLA+ Specification Correctness Guide

A phased methodology for writing correct TLA+ specifications. Every rule, pattern, and recommendation traces to established literature, Lamport's own specifications, or battle-tested industrial practice. No guessing.

This skill guides you through specification from foundations to model checking configuration. Follow the phases in order for new specs; jump to specific phases for targeted reviews.

Evidence Base

| ID | Source | Key Principle | |----|--------|---------------| | M1 | Lamport, Specifying Systems (2002) | Canonical form: Init /\ [][Next]_vars /\ Fairness | | M2 | Lamport, Specifying Systems Ch. 6 | Stuttering: [][Next]_vars permits stuttering steps; compositional refinement | | M3 | Lamport, Specifying Systems Ch. 8 | Liveness: temporal properties, fairness as specification component | | M4 | Lamport, The Temporal Logic of Actions (1994) | Weak vs strong fairness definitions and selection criteria | | M5 | Lamport, Specifying Systems Ch. 5 | UNCHANGED: every action must specify all unmodified variables | | M8 | Lamport, Specifying Systems Ch. 2-4 | Abstraction: model WHAT not HOW, separate interface from implementation | | M9 | Lamport, TLA+ community practice | TypeOK as first invariant constraining variable domains | | M10 | Newcombe et al., CACM 2015 | AWS experience: 3-5 processes sufficient for most design bugs | | M11 | TLC reference manual | TLC config options and their soundness guarantees | | M12 | Yu et al., Model Checking TLA+ Specifications (1999) | TLC implementation, symmetry reduction, state constraints | | M14 | Lamport, Specifying Systems; Raft TLA+ spec | Distributed systems patterns in TLA+ | | M16 | Lamport, A PlusCal User's Manual (2009) | PlusCal as algorithmic specification language compiling to TLA+ |

Confidence Tiers

Every rule is tagged with a confidence tier. Higher tiers have stronger evidence and should be followed more strictly.

| Tier | Meaning | Compliance | |------|---------|------------| | T1 | High confidence, strong evidence from multiple sources | Non-negotiable for correctness | | T2 | Good evidence, note caveats | Strong default, deviate with documented reason | | T3 | Use with caution, limited evidence | Useful guidance, verify in your context |

When to Use

Writing a new TLA+ specification for any protocol or algorithm
Reviewing an existing TLA+ specification for correctness
Adding liveness properties to a safety-only specification
Configuring TLC for model checking
Modeling distributed systems protocols (consensus, replication, leader election)
Deciding between PlusCal and TLA+ for a new specification

When NOT to Use

Writing implementation code (this is about specifications, not programs)
Working with other formal methods (Alloy, SPIN, Coq) — different tools, different rules
Simple sequential algorithms where testing suffices
When you need a TLA+ syntax tutorial (this assumes basic TLA+ literacy)

Phase 1: Specification Foundations

Establish the structural skeleton of a correct specification. Every TLA+ specification must satisfy these non-negotiable fundamentals before any properties are checked.

1.1 Canonical Form [T1, M1, P4.1.F1]

Use the canonical form as the default template for all specifications.

Safety-only specification (no liveness requirements):

---- MODULE MyProtocol ----
EXTENDS Integers, Sequences, FiniteSets, TLC

CONSTANTS
    Node,       \* Set of participating nodes
    MaxValue    \* Domain bound for model checking

VARIABLES
    state,      \* Per-node local state
    msgs        \* Set of in-flight messages (or Bag for realism)

vars == <<state, msgs>>

------------------------------------------------------------
\* Type invariant — ALWAYS define first [T1, M9]
TypeOK ==
    /\ state \in [Node -> {"idle", "active", "done"}]
    /\ msgs \subseteq MessageType

------------------------------------------------------------
\* Initial state
Init ==
    /\ state = [n \in Node |-> "idle"]
    /\ msgs = {}

------------------------------------------------------------
\* Actions — each MUST specify UNCHANGED for all unmodified variables [T1, M5]
ActionA(n) ==
    /\ state[n] = "idle"
    /\ state' = [state EXCEPT ![n] = "active"]
    /\ UNCHANGED <<msgs>>

ActionB(n) ==
    /\ state[n] = "active"
    /\ msgs' = msgs \cup {[type |-> "done", from |-> n]}
    /\ UNCHANGED <<state>>

------------------------------------------------------------
\* Next-state relation
Next ==
    \E n \in Node :
        \/ ActionA(n)
        \/ ActionB(n)

------------------------------------------------------------
\* Specification — canonical form
Spec == Init /\ [][Next]_vars

====

Safety + liveness specification (add fairness only when needed):

\* Add ONLY for liveness properties [T1, M1, M3]
Fairness ==
    /\ \A n \in Node : WF_vars(ActionA(n))
    /\ \A n \in Node : WF_vars(ActionB(n))

Spec == Init /\ [][Next]_vars /\ Fairness

Caveat [P4.1.F1]: Canonical form is a default template, not a law. Lamport himself deviates in some specifications (e.g., omitting fairness entirely for safety-only specs, using different temporal formulas for specific refinement mappings). If you deviate, document WHY with a reference to the specific pattern you are following.

1.2 Stuttering Invariance Checklist [T1, M2, M5]

Stuttering steps (where no variable changes) are fundamental to TLA+ and enable compositional refinement. The [][Next]_vars form explicitly permits them — this is not optional.

STUTTERING CHECKLIST
────────────────────
[ ] Specification uses [][Next]_vars (NOT [][Next])
    → [][Next] without _vars FORBIDS stuttering and breaks refinement.
    → The ONLY valid form is [][Next]_vars where vars is the
      tuple of ALL specification variables.

[ ] vars tuple contains ALL declared VARIABLES
    → If you add a variable, update the vars tuple immediately.
    → Missing a variable from vars silently permits arbitrary changes to it.

[ ] Every action specifies UNCHANGED for ALL unmodified variables
    → An action that mentions v1' but not v2' leaves v2 unconstrained.
    → TLC will explore states where v2 takes ANY value in its type domain.
    → This is the #1 source of spurious counterexamples.

Variable grouping pattern for specs with many variables:

\* Group related variables for manageable UNCHANGED clauses
nodeVars    == <<state, epoch>>
networkVars == <<msgs, delivered>>
configVars  == <<members, leader>>

vars == <<nodeVars, networkVars, configVars>>

\* Actions use group UNCHANGED
SendMessage(n) ==
    /\ ...
    /\ msgs' = msgs \cup {msg}
    /\ UNCHANGED <<nodeVars, configVars, delivered>>

1.3 TypeOK as First Invariant [T1, M9]

TypeOK constrains the domains of all variables. It is always the first invariant you write and the first invariant you check.

Why TypeOK matters:

Catches variable domain errors before expensive property checking
Acts as a sanity check that Init and Next are well-formed
Provides a foundation for TLAPS proofs (type correctness is a proof obligation)
Rapidly detects missing UNCHANGED clauses (unconstrained variables violate TypeOK)

TypeOK construction rules:

TYPEOK CONSTRUCTION
───────────────────
[ ] Every VARIABLE appears in TypeOK

[ ] TypeOK constrains variables to finite domains for TLC
    → "x \in Nat" is valid TLA+ but NOT model-checkable
    → Use "x \in 0..MaxBound" for model checking

[ ] TypeOK is checked BEFORE other invariants
    → In the TLC config: INVARIANT TypeOK, SafetyProp1, SafetyProp2
    → TypeOK violations are always the first thing to debug

[ ] Function-valued variables use domain constraints
    → state \in [Node -> StatusSet] not just "state is a function"
    → Check both the domain AND the range

TypeOK patterns by variable type:

| Variable Type | TypeOK Pattern | Example | |---------------|----------------|---------| | Simple value | v \in DomainSet | epoch \in Nat (or 0..MaxEpoch for TLC) | | Function | v \in [Domain -> Range] | state \in [Node -> {"idle","active"}] | | Set | v \subseteq BaseSet | msgs \subseteq MessageType | | Sequence | v \in Seq(ElemType) | log \in Seq(Entry) (use BSeq(Entry, MaxLen) for TLC) | | Record | v \in [field1: T1, field2: T2] | msg \in [type: MsgTypes, from: Node] | | Optional | v \in T \cup {None} | leader \in Node \cup {None} | | Bag/Multiset | Custom predicate | BagToSet(msgs) \subseteq MessageType |

Phase 2: Abstraction and Modeling

Select the right level of abstraction BEFORE writing specification logic. Wrong abstraction is the most expensive specification error — it wastes model-checking time on irrelevant detail or misses real bugs by over-abstracting.

2.1 Seven-Step Abstraction Framework [T2, P3.1, M8]

Follow these steps in order when starting a new specification.

Step 1: State the property first

Before modeling any behavior, write down the property you want to verify in plain English. This determines what MUST be in the specification and what can be abstracted away.

Example:
  Property: "No two nodes hold the lock simultaneously"
  → Need: lock state per node
  → Don't need: network packet format, serialization, retry timers

Step 2: Identify minimal state

List the minimum set of variables needed to express the property from Step 1. Every variable must appear in at least one property or be necessary for computing a variable that does.

Minimal state test: For each variable, ask:
  "If I remove this variable, can I still express all my properties?"
  If yes → remove it.

Step 3: Model the environment as nondeterminism

Do not model external systems (clients, networks, OS) in detail. Replace them with nondeterministic choices that capture all possible behaviors.

\* BAD: modeling a specific client
ClientSend == /\ clientState = "ready"
              /\ clientState' = "waiting"
              /\ ...

\* GOOD: nondeterministic environment
ReceiveRequest(n) == \* A request can arrive at any node, any time
    /\ state[n] = "idle"
    /\ state' = [state EXCEPT ![n] = "processing"]
    /\ UNCHANGED <<...>>

Step 4: Choose communication model

| Model | TLA+ Representation | Use When | Tradeoff | |-------|-------------------|----------|----------| | Set of messages | msgs \subseteq MsgType | Safety properties only | Simplest; loses ordering and duplicates | | Bag (multiset) | msgs \in BagType | Need duplicate detection | Realistic; moderate complexity | | Sequence per link | chan \in [Link -> Seq(MsgType)] | FIFO ordering required | Most complex; needed for TCP-like channels |

[T2, P3.4, M14]: For most distributed protocol specifications, start with a set model. Upgrade to bag/sequence only if a property depends on message ordering or duplicate counting.

Step 5: Separate WHAT from HOW

Specify the intended behavior, not the implementation mechanism.

\* BAD: specifying HOW (implementation detail)
AcquireLock(n) ==
    /\ Send([type |-> "lock_req", from |-> n])
    /\ WaitForReply(n)
    /\ ProcessReply(n)

\* GOOD: specifying WHAT (observable behavior)
AcquireLock(n) ==
    /\ lock = None
    /\ lock' = n
    /\ UNCHANGED <<...>>

Step 6: Diagnose over/under-specification

| Symptom | Diagnosis | Fix | |---------|-----------|-----| | TLC explores billions of states for a 3-node model | Over-specified (too much detail) | Remove implementation-level state | | Safety property passes but system is clearly wrong | Under-specified (missing behaviors) | Add actions for failure cases, concurrent operations | | Liveness property fails with trivial counterexample | Missing fairness or missing actions | Check that all intended behaviors have actions | | No errors but the spec "feels wrong" | Wrong abstraction level | Return to Step 1, re-examine the property |

Step 7: Refine incrementally

Start with the simplest correct spec and add detail only when needed to express properties that the simpler spec cannot. Each refinement step should be verifiable: the refined spec should refine (implement) the abstract spec.

2.2 Ten Distributed Systems Patterns [T2, P3.4, M14]

Common modeling patterns for distributed protocols. Each pattern includes the TLA+ idiom and notes on when to use it.

Pattern 1: Quorum

\* Abstract: CONSTANT set of quorums
CONSTANT Quorum
ASSUME QuorumAssumption ==
    /\ \A Q \in Quorum : Q \subseteq Node
    /\ \A Q1, Q2 \in Quorum : Q1 \cap Q2 /= {}

\* OR computed majority (simpler for symmetric systems)
Quorum == {S \in SUBSET Node : Cardinality(S) * 2 > Cardinality(Node)}

For model checking, the computed form is often better (no need to enumerate quorums in the config). For proof, the abstract CONSTANT form is more general.

Pattern 2: Epoch / Term (Monotonic Counter)

\* Per-node monotonic epoch
VARIABLE currentEpoch  \* currentEpoch \in [Node -> Nat]

\* Epoch advances are strictly increasing per node
AdvanceEpoch(n) ==
    /\ currentEpoch' = [currentEpoch EXCEPT ![n] = currentEpoch[n] + 1]
    /\ ...

\* Stale-epoch rejection
ProcessMessage(n, msg) ==
    /\ msg.epoch >= currentEpoch[n]  \* reject stale
    /\ ...

Pattern 3: Leader Election (Nondeterministic)

\* Abstract leader election — nondeterministic candidacy
BecomeLeader(n) ==
    /\ \A m \in Node : leader[m] /= n  \* not already leader
    /\ leader' = [leader EXCEPT ![n] = n]
    /\ currentEpoch' = [currentEpoch EXCEPT ![n] = currentEpoch[n] + 1]
    /\ UNCHANGED <<...>>

Do NOT model the election protocol (Raft, ZAB) unless you are verifying the election protocol itself. Abstract it as nondeterministic leader selection with appropriate constraints.

Pattern 4: Timeout (Always-Enabled Action, NO Clock)

\* BAD: modeling a clock for timeouts
Timeout(n) ==
    /\ clock[n] >= deadline[n]  \* NO — do not model clocks
    /\ ...

\* GOOD: timeout as an always-enabled action
Timeout(n) ==
    /\ state[n] = "waiting"
    /\ state' = [state EXCEPT ![n] = "timed_out"]
    /\ UNCHANGED <<...>>
\* Timeout can fire at ANY time a node is waiting — this captures
\* all possible timeout behaviors without modeling time.

[T1]: Never model real-time clocks in a TLA+ specification unless you are specifically verifying real-time properties. Timeouts are modeled as always-enabled actions that capture the nondeterminism of when they fire.

Pattern 5: Log Replication (Sequence with Truncate-Overwrite)

\* Log as sequence with append and truncate-overwrite
VARIABLE log  \* log \in [Node -> Seq(Entry)]

AppendEntry(n, entry) ==
    /\ log' = [log EXCEPT ![n] = Append(log[n], entry)]
    /\ UNCHANGED <<...>>

TruncateAndOverwrite(n, idx, entries) ==
    /\ log' = [log EXCEPT ![n] = SubSeq(log[n], 1, idx) \o entries]
    /\ UNCHANGED <<...>>

Pattern 6: Membership Change (Config as Log Entry)

\* Joint consensus: configuration transitions via log
VARIABLE config  \* config \in [Node -> SUBSET Node]

\* Configuration change is a special log entry
ProposeConfigChange(leader, newMembers) ==
    /\ AppendEntry(leader, [type |-> "config", members |-> newMembers])
    /\ config' = [config EXCEPT ![leader] = config[leader] \cup newMembers]
    /\ UNCHANGED <<...>>

Pattern 7: Idempotency (Accumulating Set)

\* Track processed operation IDs to ensure idempotency
VARIABLE processed  \* processed \in [Node -> SUBSET OpId]

ProcessOp(n, op) ==
    /\ op.id \notin processed[n]  \* skip if already processed
    /\ \* ... perform operation ...
    /\ processed' = [processed EXCEPT ![n] = processed[n] \cup {op.id}]
    /\ UNCHANGED <<...>>

Pattern 8: Two-Phase Protocol (Prepare-Collect-Decide)

\* Two-phase pattern: prepare → collect votes → decide
VARIABLE phase    \* [Node -> {"idle", "prepare", "decide"}]
VARIABLE votes    \* [Node -> SUBSET Node]

Prepare(n) ==
    /\ phase[n] = "idle"
    /\ phase' = [phase EXCEPT ![n] = "prepare"]
    /\ votes' = [votes EXCEPT ![n] = {}]
    /\ \* send prepare messages
    /\ UNCHANGED <<...>>

ReceiveVote(n, voter) ==
    /\ phase[n] = "prepare"
    /\ votes' = [votes EXCEPT ![n] = votes[n] \cup {voter}]
    /\ UNCHANGED <<...>>

Decide(n) ==
    /\ phase[n] = "prepare"
    /\ votes[n] \in Quorum  \* collected a quorum
    /\ phase' = [phase EXCEPT ![n] = "decide"]
    /\ UNCHANGED <<...>>

Pair with a refinement mapping to show this implements a simpler atomic specification. [T2]

Pattern 9: Failure and Restart (Durable vs Volatile State)

\* Failure: volatile state lost, durable state preserved
VARIABLE volatileState  \* lost on restart
VARIABLE durableState   \* preserved on restart

Crash(n) ==
    /\ volatileState' = [volatileState EXCEPT ![n] = InitVolatile]
    /\ UNCHANGED <<durableState, ...>>
    \* Durable state is UNCHANGED — this is the key invariant

Restart(n) ==
    /\ \* recover from durableState[n]
    /\ volatileState' = [volatileState EXCEPT
                           ![n] = RecoverFrom(durableState[n])]
    /\ UNCHANGED <<durableState, ...>>

Pattern 10: Ghost Variables (Append-Only History)

\* Ghost variable: records history for property checking
\* Does NOT affect system behavior — only used in invariants/properties
VARIABLE history  \* history \in Seq(Event)

\* Append to history in every action that is relevant
ActionWithHistory(n) ==
    /\ \* ... normal action logic ...
    /\ history' = Append(history, [type |-> "action", node |-> n, ...])
    /\ UNCHANGED <<...>>

\* Use ghost variable in properties
NoDoubleCommit ==
    \A i, j \in 1..Len(history) :
        /\ history[i].type = "commit"
        /\ history[j].type = "commit"
        /\ history[i].key = history[j].key
        => history[i].value = history[j].value

Ghost variables are specification-only artifacts. They are append-only (history never shrinks) and used solely to express properties that reference past states. Mark them clearly in comments.

Phase 3: Safety Verification

After establishing the specification structure (Phase 1) and modeling decisions (Phase 2), verify safety properties.

3.1 Safety Specification Checklist [T1, M9, M5]

SAFETY CHECKLIST
────────────────
[ ] TypeOK is defined and checked first
    → Every variable constrained to its domain [T1, M9]

[ ] All safety invariants expressed as state predicates
    → "P is always true" written as INVARIANT P in TLC config
    → For action properties, use [][ActionProp]_vars form

[ ] Deadlock-freedom checked between safety and liveness phases
    → TLC checks deadlock by default (all states have a Next step)
    → Intentional termination: add a Done action or use
      PROPERTY [][Next]_vars => <>[]done to distinguish
    → [P4.4.F6]: ALWAYS check deadlock-freedom after safety passes
      but BEFORE adding liveness. A deadlocked spec trivially
      satisfies some liveness properties (vacuous truth).

[ ] No CONSTRAINT in the TLC config for the final check
    → CONSTRAINT is acceptable during development to speed iteration
    → CONSTRAINT MUST be removed for the final safety verification
    → [T1, M11]: CONSTRAINT can hide states reachable via longer paths

[ ] Invariants are inductive (for TLAPS proofs)
    → An invariant I is inductive if: I /\ Next => I'
    → TypeOK is typically inductive; complex safety properties
      may need strengthening (adding auxiliary conjuncts)

3.2 Deadlock-Freedom Gate [T1, P4.4.F6]

This check is mandatory between the safety and liveness phases.

A specification deadlocks if some reachable state has no Next step enabled. TLC checks this by default. However:

If your protocol has intentional termination, you must distinguish it from unintentional deadlock:

\* Termination action — prevents false deadlock reports
Done ==
    /\ AllNodesFinished
    /\ UNCHANGED vars

Next == ActionA \/ ActionB \/ ... \/ Done

A deadlocked spec can vacuously satisfy liveness properties. If the system deadlocks in a state where the liveness property is already satisfied, TLC will report no error. This is not a real liveness guarantee.

DEADLOCK GATE
─────────────
[ ] TLC reports zero deadlock states
    OR
[ ] All "deadlock" states are intentional termination states
    AND they are documented in the spec
    AND a Done action is present

3.3 TLC Safety Configuration [T1, M11, M12]

Recommended TLC configuration for safety checking:

\* MyProtocol.cfg — safety checking configuration
SPECIFICATION Spec

\* Type invariant first, then safety properties
INVARIANT TypeOK
INVARIANT SafetyProperty1
INVARIANT SafetyProperty2

\* State constraint for development ONLY — remove for final check
\* CONSTRAINT StateConstraint

\* Symmetry sets — safe for safety properties [T1, M11]
SYMMETRY NodeSymmetry

\* Constants
CONSTANT Node = {n1, n2, n3}
CONSTANT MaxValue = 3

| TLC Option | Safety Checking | Notes | |------------|----------------|-------| | INVARIANT | Sound | Always use for state predicates | | CONSTRAINT | Sound but incomplete | May miss states reachable via longer paths; remove for final check | | SYMMETRY | Sound if symmetry set is correct | Incorrect symmetry sets produce unsound results silently | | VIEW | Sound | Reduces state space via abstraction | | ACTION_CONSTRAINT | Sound | Restricts which actions TLC explores |

Phase 4: Liveness Verification

Liveness properties ("something good eventually happens") require fairness assumptions and careful TLC configuration. This is where most soundness errors occur.

4.1 Six-Phase Liveness Methodology [T1, M3, P3.2, P3.3]

Follow these phases strictly in order.

Phase L1: Write safety first (complete Phase 3)

Never add liveness to a specification that has not passed safety checking. Liveness debugging on an incorrect safety spec is wasted effort.

Phase L2: Identify liveness requirements

Write down what MUST eventually happen in plain English:

Examples:
  - "Every request eventually receives a response"
  - "Every node eventually learns of every update"
  - "The leader eventually commits every proposed entry"
  - "The system eventually reaches a stable configuration"

Express each as a temporal formula:

\* "Every request eventually receives a response"
Liveness1 == \A r \in requests : <>(responded[r])

\* "The leader eventually commits" (conditional on leadership)
Liveness2 == []<>(committed = proposed)

\* Response: "always eventually" (repeated liveness)
Liveness3 == []<>(\E n \in Node : state[n] = "active")

Phase L3: Select fairness — WF first, SF only when needed [T2, M4, P3.3]

Use the fairness selection decision table:

| Condition | Fairness | Rationale | |-----------|----------|-----------| | Action is continuously enabled once enabled | WF_vars(Action) | WF suffices: if enabled forever, it eventually happens | | Action is repeatedly enabled then disabled (intermittent) | SF_vars(Action) | SF needed: must happen if repeatedly enabled, even if intermittently disabled | | Action represents environment/external input | No fairness | Never make the environment fair — it models adversarial behavior | | Action is the composite Next | No fairness on Next alone | WF_vars(Next) says "some action happens" but does not guarantee WHICH action happens; useless for most properties | | Timeout action | Usually no fairness | Timeouts model nondeterminism; fairness on timeouts constrains which timeouts fire |

WF vs SF decision procedure:

Start with WF_vars(A) for every system action A
Check if TLC finds a liveness counterexample
If the counterexample shows Action A repeatedly enabled then disabled without executing, upgrade to SF_vars(A) for that specific action
Never start with SF — it is a stronger assumption and may hide real liveness bugs

[T2, M4]: Lamport's convention: "Weak fairness is almost always what you want." Strong fairness is needed only for actions that are intermittently disabled by other actions (e.g., a receive action that is enabled only when a message is in transit, and messages can be repeatedly consumed and re-sent).

Phase L4: Verify machine closure (automatic in canonical form)

A specification Spec == Init /\ [][Next]_vars /\ Fairness is machine-closed if the safety part (Init /\ [][Next]_vars) does not constrain which infinite behaviors are possible (only fairness does that).

In the canonical form, machine closure is guaranteed [T1, M1]. You only need to worry about machine closure if you deviate from canonical form (e.g., adding raw temporal formulas to the spec).

Phase L5: Model-check liveness [T1, P3.2, M11, M12]

THIS IS THE CRITICAL STEP. TLC's liveness checking has soundness traps that do not exist in safety checking.

LIVENESS MODEL-CHECKING RULES
──────────────────────────────
[ ] REMOVE all CONSTRAINT directives from the TLC config
    → CONSTRAINT + liveness = UNSOUND [T1, M11]
    → CONSTRAINT can hide infinite behaviors that violate liveness
    → This is NON-NEGOTIABLE for liveness checking

[ ] REMOVE all SYMMETRY directives from the TLC config
    → SYMMETRY + liveness = DANGEROUS [T1, M11]
    → Symmetry reduction may cause TLC to miss liveness violations
    → It is sometimes safe but the conditions are subtle;
      removing it is always safe

[ ] Use -lncheck final for production liveness verification
    → -lncheck final performs the complete liveness check
    → Without -lncheck, TLC may use an incomplete algorithm

[ ] Liveness properties go in PROPERTY (not INVARIANT)
    → INVARIANT is for state predicates (safety)
    → PROPERTY is for temporal formulas (liveness)

Phase L6: Debug liveness failures [T1, P3.2]

TLC liveness counterexamples have a lasso structure: a finite prefix followed by a back-loop to a previously visited state. The back-loop represents an infinite behavior where the liveness property is never satisfied.

Lasso structure:
  s0 → s1 → s2 → ... → sk → ... → sj
                               ↑        |
                               └────────┘
                             (back-loop: infinite repetition)

Debugging procedure:
1. Read the finite prefix: how does the system reach state sk?
2. Read the back-loop: what actions repeat forever?
3. Check: is there an action that SHOULD fire but doesn't?
   → If yes: add or strengthen fairness for that action
4. Check: is the counterexample realistic?
   → Unrealistic: the environment acts adversarially in a way your
     system need not tolerate. Consider if your liveness requirement
     is too strong.
   → Realistic: the protocol has a liveness bug. Fix the protocol.
5. Check: does the back-loop involve only stutter steps?
   → If yes: your specification may be missing an action.

4.2 Complete TLC Soundness Matrix [T1, M11, M12]

This is the complete map of TLC configuration options versus soundness for safety and liveness checking. Reference this table before every TLC run.

| Config Option | Safety | Liveness | Risk | Recommendation | |---------------|--------|----------|------|----------------| | INVARIANT | Sound | N/A (wrong directive) | None for safety | Use for state predicates | | PROPERTY | N/A (wrong directive) | Sound (with caveats below) | None | Use for temporal formulas | | CONSTRAINT | Sound (but incomplete) | UNSOUND | High: hides reachable states from liveness check | Remove for final liveness check; acceptable during development for safety only | | SYMMETRY | Sound (if correct) | Dangerous | Medium: may miss counterexamples | Remove for liveness; use only for safety with verified symmetry set | | VIEW | Sound | Unverified | Medium: hash collisions possible | Use with caution; verify key results without VIEW | | ACTION_CONSTRAINT | Sound | Context-dependent | Medium: may prevent actions needed for liveness counterexample | Review carefully for liveness | | -lncheck final | N/A | Required for soundness | High if omitted: may use incomplete algorithm | Always use for production liveness checks | | -workers N (N>1) | Sound | Sound (TLC 2.18+) | Low: older TLC versions had bugs with multi-worker liveness | Use current TLC version |

[T1, P3.2]: The safe liveness workflow: (1) pass all safety checks with CONSTRAINT/SYMMETRY, (2) remove CONSTRAINT and SYMMETRY, (3) add PROPERTY directives, (4) run with -lncheck final, (5) accept the longer runtime.

Phase 5: Model Checking Configuration

5.1 Small Model Heuristics [T2, M10, P4.4.F3]

AWS's experience (Newcombe et al., CACM 2015): 3-5 processes are sufficient for finding most design bugs. Start small, increase only when needed.

| Model Size | When Sufficient | When Insufficient | |------------|----------------|-------------------| | 2 nodes | Mutual exclusion, simple leader-follower | Quorum-based protocols (need 3+) | | 3 nodes | Most consensus protocols, quorum systems | Protocols with f+1 requirements where f>1 | | 4-5 nodes | Complex fault-tolerance (f=1 with 2f+1) | Threshold-dependent bugs (e.g., f=2 requires 5+ nodes) | | >5 nodes | Byzantine fault tolerance (3f+1) | Rarely needed for design verification |

Caveat [P4.4.F3]: Small models miss:

Threshold-dependent bugs: Properties that only fail at specific quorum sizes (e.g., a bug that manifests only with exactly 2 of 5 nodes failing)

Symmetry-breaking bugs: Bugs that require specific asymmetric configurations

Scaling-related issues: Performance properties that degrade non-linearly

If your protocol has threshold-dependent behavior, test at least one configuration AT the threshold and one ABOVE it.

5.2 TLC Configuration Guide

Development configuration (fast iteration):

SPECIFICATION Spec
INVARIANT TypeOK
INVARIANT SafetyProperty
CONSTRAINT SmallStateConstraint   \* OK for safety development
SYMMETRY NodeSymmetry             \* OK for safety development
CONSTANT Node = {n1, n2, n3}     \* Start small

Final safety configuration (production):

SPECIFICATION Spec
INVARIANT TypeOK
INVARIANT SafetyProperty
\* NO CONSTRAINT — full state space
SYMMETRY NodeSymmetry             \* OK for safety (if correct)
CONSTANT Node = {n1, n2, n3}

Final liveness configuration (production):

SPECIFICATION Spec
INVARIANT TypeOK                  \* Keep safety checks active
PROPERTY LivenessProperty
\* NO CONSTRAINT — UNSOUND with liveness
\* NO SYMMETRY — DANGEROUS with liveness
CONSTANT Node = {n1, n2, n3}

TLC command-line flags:

| Flag | Purpose | When to Use | |------|---------|-------------| | -workers auto | Use all CPU cores | Always (production runs) | | -lncheck final | Complete liveness check | Always for liveness verification | | -deadlock | Disable deadlock checking | Only if spec has intentional terminal states AND Done action is present | | -depth N | Limit search depth | Development only; never for final verification | | -coverage N | Print coverage stats every N minutes | Development; helps find unreachable actions | | -dump format file | Dump state graph | Debugging counterexamples | | -simulate num=N | Random simulation instead of full BFS | Very large state spaces where exhaustive check is infeasible |

5.3 Multi-Tool Awareness [T3, P4.3.F6, P4.4.F2]

TLC is not the only TLA+ verification tool. Be aware of alternatives and their strengths.

| Tool | Technique | Strengths | Limitations | |------|-----------|-----------|-------------| | TLC | Explicit-state model checking | Complete for finite models; well-understood soundness; fast for small models | Exponential state explosion; finite models only | | Apalache | Symbolic model checking (SMT) | Handles larger state spaces; bounded model checking for unbounded domains; type checking | Slower per-state; less mature than TLC; limited liveness support | | TLAPS | Interactive theorem proving | Full mathematical proofs; unbounded verification; strongest guarantee | High manual effort; steep learning curve | | Quint | Emerging: TLA+-like language with better tooling | Modern syntax; integrated testing; simulation | Not yet production-proven; subset of TLA+ expressiveness |

TLC-specific advice is marked throughout this skill. Where behavior differs in Apalache, notes are included. However, this skill primarily targets TLC workflows. For Apalache-specific guidance, consult the Apalache documentation.

Apalache type annotations: Apalache requires type annotations that TLC ignores. If you plan to use both tools, add Apalache type annotations as comments:
\* @type: Set(NODE) => Bool;
IsQuorum(S) == Cardinality(S) * 2 > Cardinality(Node)

Phase 6: PlusCal Considerations [T3, M16, P4.4.F1]

PlusCal is a higher-level algorithmic language that compiles to TLA+. It is a valid choice for certain specifications.

6.1 PlusCal Decision Table

| Factor | PlusCal | Raw TLA+ | |--------|---------|----------| | Author familiarity | Imperative/sequential programming | Set theory, temporal logic | | Specification type | Sequential or simple concurrent algorithms | Complex concurrent/distributed protocols | | Expressiveness ceiling | Limited: sequential composition, limited nondeterminism | Full: arbitrary temporal logic, flexible composition | | Compilation | Compiles to TLA+ (can inspect/modify output) | Direct | | Fairness | Automatic per-process fairness | Manual fairness specification | | Best for | Teaching, prototyping, algorithm verification | Protocol verification, refinement, industrial use |

[T3, P4.4.F1]: A finished PlusCal specification is better than an unfinished TLA+ specification. If PlusCal gets you to a checkable spec faster, use it. But know the ceiling: PlusCal cannot express all TLA+ patterns (e.g., arbitrary interleaving of non-process actions, complex fairness, stuttering-sensitive refinement).

When to switch from PlusCal to TLA+:

You need fairness on individual actions (not just processes)

You need to compose specifications via refinement

You need ghost variables that participate in the temporal formula

PlusCal's process model does not match your system's concurrency model

Quick Reference

Common mistakes catalog, specification template, decision tables, evidence traceability matrix, and references are in references/quick-reference.md.

TLA+ Specification Correctness Guide

This skill guides you through specification from foundations to model checking configuration. Follow the phases in order for new specs; jump to specific phases for targeted reviews.

Evidence Base

Confidence Tiers

Every rule is tagged with a confidence tier. Higher tiers have stronger evidence and should be followed more strictly.

When to Use

Writing a new TLA+ specification for any protocol or algorithm
Reviewing an existing TLA+ specification for correctness
Adding liveness properties to a safety-only specification
Configuring TLC for model checking
Modeling distributed systems protocols (consensus, replication, leader election)
Deciding between PlusCal and TLA+ for a new specification

When NOT to Use

Writing implementation code (this is about specifications, not programs)
Working with other formal methods (Alloy, SPIN, Coq) — different tools, different rules
Simple sequential algorithms where testing suffices
When you need a TLA+ syntax tutorial (this assumes basic TLA+ literacy)

Phase 1: Specification Foundations

Establish the structural skeleton of a correct specification. Every TLA+ specification must satisfy these non-negotiable fundamentals before any properties are checked.

1.1 Canonical Form [T1, M1, P4.1.F1]

Use the canonical form as the default template for all specifications.

Safety-only specification (no liveness requirements):

---- MODULE MyProtocol ----
EXTENDS Integers, Sequences, FiniteSets, TLC

CONSTANTS
    Node,       \* Set of participating nodes
    MaxValue    \* Domain bound for model checking

VARIABLES
    state,      \* Per-node local state
    msgs        \* Set of in-flight messages (or Bag for realism)

vars == <<state, msgs>>

------------------------------------------------------------
\* Type invariant — ALWAYS define first [T1, M9]
TypeOK ==
    /\ state \in [Node -> {"idle", "active", "done"}]
    /\ msgs \subseteq MessageType

------------------------------------------------------------
\* Initial state
Init ==
    /\ state = [n \in Node |-> "idle"]
    /\ msgs = {}

------------------------------------------------------------
\* Actions — each MUST specify UNCHANGED for all unmodified variables [T1, M5]
ActionA(n) ==
    /\ state[n] = "idle"
    /\ state' = [state EXCEPT ![n] = "active"]
    /\ UNCHANGED <<msgs>>

ActionB(n) ==
    /\ state[n] = "active"
    /\ msgs' = msgs \cup {[type |-> "done", from |-> n]}
    /\ UNCHANGED <<state>>

------------------------------------------------------------
\* Next-state relation
Next ==
    \E n \in Node :
        \/ ActionA(n)
        \/ ActionB(n)

------------------------------------------------------------
\* Specification — canonical form
Spec == Init /\ [][Next]_vars

====

Safety + liveness specification (add fairness only when needed):

\* Add ONLY for liveness properties [T1, M1, M3]
Fairness ==
    /\ \A n \in Node : WF_vars(ActionA(n))
    /\ \A n \in Node : WF_vars(ActionB(n))

Spec == Init /\ [][Next]_vars /\ Fairness

Caveat [P4.1.F1]: Canonical form is a default template, not a law. Lamport himself deviates in some specifications (e.g., omitting fairness entirely for safety-only specs, using different temporal formulas for specific refinement mappings). If you deviate, document WHY with a reference to the specific pattern you are following.

1.2 Stuttering Invariance Checklist [T1, M2, M5]

Stuttering steps (where no variable changes) are fundamental to TLA+ and enable compositional refinement. The [][Next]_vars form explicitly permits them — this is not optional.

STUTTERING CHECKLIST
────────────────────
[ ] Specification uses [][Next]_vars (NOT [][Next])
    → [][Next] without _vars FORBIDS stuttering and breaks refinement.
    → The ONLY valid form is [][Next]_vars where vars is the
      tuple of ALL specification variables.

[ ] vars tuple contains ALL declared VARIABLES
    → If you add a variable, update the vars tuple immediately.
    → Missing a variable from vars silently permits arbitrary changes to it.

[ ] Every action specifies UNCHANGED for ALL unmodified variables
    → An action that mentions v1' but not v2' leaves v2 unconstrained.
    → TLC will explore states where v2 takes ANY value in its type domain.
    → This is the #1 source of spurious counterexamples.

Variable grouping pattern for specs with many variables:

\* Group related variables for manageable UNCHANGED clauses
nodeVars    == <<state, epoch>>
networkVars == <<msgs, delivered>>
configVars  == <<members, leader>>

vars == <<nodeVars, networkVars, configVars>>

\* Actions use group UNCHANGED
SendMessage(n) ==
    /\ ...
    /\ msgs' = msgs \cup {msg}
    /\ UNCHANGED <<nodeVars, configVars, delivered>>

1.3 TypeOK as First Invariant [T1, M9]

TypeOK constrains the domains of all variables. It is always the first invariant you write and the first invariant you check.

Why TypeOK matters:

Catches variable domain errors before expensive property checking
Acts as a sanity check that Init and Next are well-formed
Provides a foundation for TLAPS proofs (type correctness is a proof obligation)
Rapidly detects missing UNCHANGED clauses (unconstrained variables violate TypeOK)

TypeOK construction rules:

TYPEOK CONSTRUCTION
───────────────────
[ ] Every VARIABLE appears in TypeOK

[ ] TypeOK constrains variables to finite domains for TLC
    → "x \in Nat" is valid TLA+ but NOT model-checkable
    → Use "x \in 0..MaxBound" for model checking

[ ] TypeOK is checked BEFORE other invariants
    → In the TLC config: INVARIANT TypeOK, SafetyProp1, SafetyProp2
    → TypeOK violations are always the first thing to debug

[ ] Function-valued variables use domain constraints
    → state \in [Node -> StatusSet] not just "state is a function"
    → Check both the domain AND the range

TypeOK patterns by variable type:

Phase 2: Abstraction and Modeling

2.1 Seven-Step Abstraction Framework [T2, P3.1, M8]

Follow these steps in order when starting a new specification.

Step 1: State the property first

Before modeling any behavior, write down the property you want to verify in plain English. This determines what MUST be in the specification and what can be abstracted away.

Example:
  Property: "No two nodes hold the lock simultaneously"
  → Need: lock state per node
  → Don't need: network packet format, serialization, retry timers

Step 2: Identify minimal state

List the minimum set of variables needed to express the property from Step 1. Every variable must appear in at least one property or be necessary for computing a variable that does.

Minimal state test: For each variable, ask:
  "If I remove this variable, can I still express all my properties?"
  If yes → remove it.

Step 3: Model the environment as nondeterminism

Do not model external systems (clients, networks, OS) in detail. Replace them with nondeterministic choices that capture all possible behaviors.

\* BAD: modeling a specific client
ClientSend == /\ clientState = "ready"
              /\ clientState' = "waiting"
              /\ ...

\* GOOD: nondeterministic environment
ReceiveRequest(n) == \* A request can arrive at any node, any time
    /\ state[n] = "idle"
    /\ state' = [state EXCEPT ![n] = "processing"]
    /\ UNCHANGED <<...>>

Step 4: Choose communication model

[T2, P3.4, M14]: For most distributed protocol specifications, start with a set model. Upgrade to bag/sequence only if a property depends on message ordering or duplicate counting.

Step 5: Separate WHAT from HOW

Specify the intended behavior, not the implementation mechanism.

\* BAD: specifying HOW (implementation detail)
AcquireLock(n) ==
    /\ Send([type |-> "lock_req", from |-> n])
    /\ WaitForReply(n)
    /\ ProcessReply(n)

\* GOOD: specifying WHAT (observable behavior)
AcquireLock(n) ==
    /\ lock = None
    /\ lock' = n
    /\ UNCHANGED <<...>>

Step 6: Diagnose over/under-specification

Step 7: Refine incrementally

2.2 Ten Distributed Systems Patterns [T2, P3.4, M14]

Common modeling patterns for distributed protocols. Each pattern includes the TLA+ idiom and notes on when to use it.

Pattern 1: Quorum

\* Abstract: CONSTANT set of quorums
CONSTANT Quorum
ASSUME QuorumAssumption ==
    /\ \A Q \in Quorum : Q \subseteq Node
    /\ \A Q1, Q2 \in Quorum : Q1 \cap Q2 /= {}

\* OR computed majority (simpler for symmetric systems)
Quorum == {S \in SUBSET Node : Cardinality(S) * 2 > Cardinality(Node)}

For model checking, the computed form is often better (no need to enumerate quorums in the config). For proof, the abstract CONSTANT form is more general.

Pattern 2: Epoch / Term (Monotonic Counter)

\* Per-node monotonic epoch
VARIABLE currentEpoch  \* currentEpoch \in [Node -> Nat]

\* Epoch advances are strictly increasing per node
AdvanceEpoch(n) ==
    /\ currentEpoch' = [currentEpoch EXCEPT ![n] = currentEpoch[n] + 1]
    /\ ...

\* Stale-epoch rejection
ProcessMessage(n, msg) ==
    /\ msg.epoch >= currentEpoch[n]  \* reject stale
    /\ ...

Pattern 3: Leader Election (Nondeterministic)

\* Abstract leader election — nondeterministic candidacy
BecomeLeader(n) ==
    /\ \A m \in Node : leader[m] /= n  \* not already leader
    /\ leader' = [leader EXCEPT ![n] = n]
    /\ currentEpoch' = [currentEpoch EXCEPT ![n] = currentEpoch[n] + 1]
    /\ UNCHANGED <<...>>

Do NOT model the election protocol (Raft, ZAB) unless you are verifying the election protocol itself. Abstract it as nondeterministic leader selection with appropriate constraints.

Pattern 4: Timeout (Always-Enabled Action, NO Clock)

\* BAD: modeling a clock for timeouts
Timeout(n) ==
    /\ clock[n] >= deadline[n]  \* NO — do not model clocks
    /\ ...

\* GOOD: timeout as an always-enabled action
Timeout(n) ==
    /\ state[n] = "waiting"
    /\ state' = [state EXCEPT ![n] = "timed_out"]
    /\ UNCHANGED <<...>>
\* Timeout can fire at ANY time a node is waiting — this captures
\* all possible timeout behaviors without modeling time.

[T1]: Never model real-time clocks in a TLA+ specification unless you are specifically verifying real-time properties. Timeouts are modeled as always-enabled actions that capture the nondeterminism of when they fire.

Pattern 5: Log Replication (Sequence with Truncate-Overwrite)

\* Log as sequence with append and truncate-overwrite
VARIABLE log  \* log \in [Node -> Seq(Entry)]

AppendEntry(n, entry) ==
    /\ log' = [log EXCEPT ![n] = Append(log[n], entry)]
    /\ UNCHANGED <<...>>

TruncateAndOverwrite(n, idx, entries) ==
    /\ log' = [log EXCEPT ![n] = SubSeq(log[n], 1, idx) \o entries]
    /\ UNCHANGED <<...>>

Pattern 6: Membership Change (Config as Log Entry)

\* Joint consensus: configuration transitions via log
VARIABLE config  \* config \in [Node -> SUBSET Node]

\* Configuration change is a special log entry
ProposeConfigChange(leader, newMembers) ==
    /\ AppendEntry(leader, [type |-> "config", members |-> newMembers])
    /\ config' = [config EXCEPT ![leader] = config[leader] \cup newMembers]
    /\ UNCHANGED <<...>>

Pattern 7: Idempotency (Accumulating Set)

\* Track processed operation IDs to ensure idempotency
VARIABLE processed  \* processed \in [Node -> SUBSET OpId]

ProcessOp(n, op) ==
    /\ op.id \notin processed[n]  \* skip if already processed
    /\ \* ... perform operation ...
    /\ processed' = [processed EXCEPT ![n] = processed[n] \cup {op.id}]
    /\ UNCHANGED <<...>>

Pattern 8: Two-Phase Protocol (Prepare-Collect-Decide)

\* Two-phase pattern: prepare → collect votes → decide
VARIABLE phase    \* [Node -> {"idle", "prepare", "decide"}]
VARIABLE votes    \* [Node -> SUBSET Node]

Prepare(n) ==
    /\ phase[n] = "idle"
    /\ phase' = [phase EXCEPT ![n] = "prepare"]
    /\ votes' = [votes EXCEPT ![n] = {}]
    /\ \* send prepare messages
    /\ UNCHANGED <<...>>

ReceiveVote(n, voter) ==
    /\ phase[n] = "prepare"
    /\ votes' = [votes EXCEPT ![n] = votes[n] \cup {voter}]
    /\ UNCHANGED <<...>>

Decide(n) ==
    /\ phase[n] = "prepare"
    /\ votes[n] \in Quorum  \* collected a quorum
    /\ phase' = [phase EXCEPT ![n] = "decide"]
    /\ UNCHANGED <<...>>

Pair with a refinement mapping to show this implements a simpler atomic specification. [T2]

Pattern 9: Failure and Restart (Durable vs Volatile State)

\* Failure: volatile state lost, durable state preserved
VARIABLE volatileState  \* lost on restart
VARIABLE durableState   \* preserved on restart

Crash(n) ==
    /\ volatileState' = [volatileState EXCEPT ![n] = InitVolatile]
    /\ UNCHANGED <<durableState, ...>>
    \* Durable state is UNCHANGED — this is the key invariant

Restart(n) ==
    /\ \* recover from durableState[n]
    /\ volatileState' = [volatileState EXCEPT
                           ![n] = RecoverFrom(durableState[n])]
    /\ UNCHANGED <<durableState, ...>>

Pattern 10: Ghost Variables (Append-Only History)

\* Ghost variable: records history for property checking
\* Does NOT affect system behavior — only used in invariants/properties
VARIABLE history  \* history \in Seq(Event)

\* Append to history in every action that is relevant
ActionWithHistory(n) ==
    /\ \* ... normal action logic ...
    /\ history' = Append(history, [type |-> "action", node |-> n, ...])
    /\ UNCHANGED <<...>>

\* Use ghost variable in properties
NoDoubleCommit ==
    \A i, j \in 1..Len(history) :
        /\ history[i].type = "commit"
        /\ history[j].type = "commit"
        /\ history[i].key = history[j].key
        => history[i].value = history[j].value

Ghost variables are specification-only artifacts. They are append-only (history never shrinks) and used solely to express properties that reference past states. Mark them clearly in comments.

Phase 3: Safety Verification

After establishing the specification structure (Phase 1) and modeling decisions (Phase 2), verify safety properties.

3.1 Safety Specification Checklist [T1, M9, M5]

SAFETY CHECKLIST
────────────────
[ ] TypeOK is defined and checked first
    → Every variable constrained to its domain [T1, M9]

[ ] All safety invariants expressed as state predicates
    → "P is always true" written as INVARIANT P in TLC config
    → For action properties, use [][ActionProp]_vars form

[ ] Deadlock-freedom checked between safety and liveness phases
    → TLC checks deadlock by default (all states have a Next step)
    → Intentional termination: add a Done action or use
      PROPERTY [][Next]_vars => <>[]done to distinguish
    → [P4.4.F6]: ALWAYS check deadlock-freedom after safety passes
      but BEFORE adding liveness. A deadlocked spec trivially
      satisfies some liveness properties (vacuous truth).

[ ] No CONSTRAINT in the TLC config for the final check
    → CONSTRAINT is acceptable during development to speed iteration
    → CONSTRAINT MUST be removed for the final safety verification
    → [T1, M11]: CONSTRAINT can hide states reachable via longer paths

[ ] Invariants are inductive (for TLAPS proofs)
    → An invariant I is inductive if: I /\ Next => I'
    → TypeOK is typically inductive; complex safety properties
      may need strengthening (adding auxiliary conjuncts)

3.2 Deadlock-Freedom Gate [T1, P4.4.F6]

This check is mandatory between the safety and liveness phases.

A specification deadlocks if some reachable state has no Next step enabled. TLC checks this by default. However:

If your protocol has intentional termination, you must distinguish it from unintentional deadlock:

\* Termination action — prevents false deadlock reports
Done ==
    /\ AllNodesFinished
    /\ UNCHANGED vars

Next == ActionA \/ ActionB \/ ... \/ Done

A deadlocked spec can vacuously satisfy liveness properties. If the system deadlocks in a state where the liveness property is already satisfied, TLC will report no error. This is not a real liveness guarantee.

DEADLOCK GATE
─────────────
[ ] TLC reports zero deadlock states
    OR
[ ] All "deadlock" states are intentional termination states
    AND they are documented in the spec
    AND a Done action is present

3.3 TLC Safety Configuration [T1, M11, M12]

Recommended TLC configuration for safety checking:

\* MyProtocol.cfg — safety checking configuration
SPECIFICATION Spec

\* Type invariant first, then safety properties
INVARIANT TypeOK
INVARIANT SafetyProperty1
INVARIANT SafetyProperty2

\* State constraint for development ONLY — remove for final check
\* CONSTRAINT StateConstraint

\* Symmetry sets — safe for safety properties [T1, M11]
SYMMETRY NodeSymmetry

\* Constants
CONSTANT Node = {n1, n2, n3}
CONSTANT MaxValue = 3

Phase 4: Liveness Verification

Liveness properties ("something good eventually happens") require fairness assumptions and careful TLC configuration. This is where most soundness errors occur.

4.1 Six-Phase Liveness Methodology [T1, M3, P3.2, P3.3]

Follow these phases strictly in order.

Phase L1: Write safety first (complete Phase 3)

Never add liveness to a specification that has not passed safety checking. Liveness debugging on an incorrect safety spec is wasted effort.

Phase L2: Identify liveness requirements

Write down what MUST eventually happen in plain English:

Examples:
  - "Every request eventually receives a response"
  - "Every node eventually learns of every update"
  - "The leader eventually commits every proposed entry"
  - "The system eventually reaches a stable configuration"

Express each as a temporal formula:

\* "Every request eventually receives a response"
Liveness1 == \A r \in requests : <>(responded[r])

\* "The leader eventually commits" (conditional on leadership)
Liveness2 == []<>(committed = proposed)

\* Response: "always eventually" (repeated liveness)
Liveness3 == []<>(\E n \in Node : state[n] = "active")

Phase L3: Select fairness — WF first, SF only when needed [T2, M4, P3.3]

Use the fairness selection decision table:

WF vs SF decision procedure:

Start with WF_vars(A) for every system action A
Check if TLC finds a liveness counterexample
If the counterexample shows Action A repeatedly enabled then disabled without executing, upgrade to SF_vars(A) for that specific action
Never start with SF — it is a stronger assumption and may hide real liveness bugs

[T2, M4]: Lamport's convention: "Weak fairness is almost always what you want." Strong fairness is needed only for actions that are intermittently disabled by other actions (e.g., a receive action that is enabled only when a message is in transit, and messages can be repeatedly consumed and re-sent).

Phase L4: Verify machine closure (automatic in canonical form)

In the canonical form, machine closure is guaranteed [T1, M1]. You only need to worry about machine closure if you deviate from canonical form (e.g., adding raw temporal formulas to the spec).

Phase L5: Model-check liveness [T1, P3.2, M11, M12]

THIS IS THE CRITICAL STEP. TLC's liveness checking has soundness traps that do not exist in safety checking.

LIVENESS MODEL-CHECKING RULES
──────────────────────────────
[ ] REMOVE all CONSTRAINT directives from the TLC config
    → CONSTRAINT + liveness = UNSOUND [T1, M11]
    → CONSTRAINT can hide infinite behaviors that violate liveness
    → This is NON-NEGOTIABLE for liveness checking

[ ] REMOVE all SYMMETRY directives from the TLC config
    → SYMMETRY + liveness = DANGEROUS [T1, M11]
    → Symmetry reduction may cause TLC to miss liveness violations
    → It is sometimes safe but the conditions are subtle;
      removing it is always safe

[ ] Use -lncheck final for production liveness verification
    → -lncheck final performs the complete liveness check
    → Without -lncheck, TLC may use an incomplete algorithm

[ ] Liveness properties go in PROPERTY (not INVARIANT)
    → INVARIANT is for state predicates (safety)
    → PROPERTY is for temporal formulas (liveness)

Phase L6: Debug liveness failures [T1, P3.2]

Lasso structure:
  s0 → s1 → s2 → ... → sk → ... → sj
                               ↑        |
                               └────────┘
                             (back-loop: infinite repetition)

Debugging procedure:
1. Read the finite prefix: how does the system reach state sk?
2. Read the back-loop: what actions repeat forever?
3. Check: is there an action that SHOULD fire but doesn't?
   → If yes: add or strengthen fairness for that action
4. Check: is the counterexample realistic?
   → Unrealistic: the environment acts adversarially in a way your
     system need not tolerate. Consider if your liveness requirement
     is too strong.
   → Realistic: the protocol has a liveness bug. Fix the protocol.
5. Check: does the back-loop involve only stutter steps?
   → If yes: your specification may be missing an action.

4.2 Complete TLC Soundness Matrix [T1, M11, M12]

This is the complete map of TLC configuration options versus soundness for safety and liveness checking. Reference this table before every TLC run.

[T1, P3.2]: The safe liveness workflow: (1) pass all safety checks with CONSTRAINT/SYMMETRY, (2) remove CONSTRAINT and SYMMETRY, (3) add PROPERTY directives, (4) run with -lncheck final, (5) accept the longer runtime.

Phase 5: Model Checking Configuration

5.1 Small Model Heuristics [T2, M10, P4.4.F3]

AWS's experience (Newcombe et al., CACM 2015): 3-5 processes are sufficient for finding most design bugs. Start small, increase only when needed.

Caveat [P4.4.F3]: Small models miss:

Threshold-dependent bugs: Properties that only fail at specific quorum sizes (e.g., a bug that manifests only with exactly 2 of 5 nodes failing)

Symmetry-breaking bugs: Bugs that require specific asymmetric configurations

Scaling-related issues: Performance properties that degrade non-linearly

If your protocol has threshold-dependent behavior, test at least one configuration AT the threshold and one ABOVE it.

5.2 TLC Configuration Guide

Development configuration (fast iteration):

SPECIFICATION Spec
INVARIANT TypeOK
INVARIANT SafetyProperty
CONSTRAINT SmallStateConstraint   \* OK for safety development
SYMMETRY NodeSymmetry             \* OK for safety development
CONSTANT Node = {n1, n2, n3}     \* Start small

Final safety configuration (production):

SPECIFICATION Spec
INVARIANT TypeOK
INVARIANT SafetyProperty
\* NO CONSTRAINT — full state space
SYMMETRY NodeSymmetry             \* OK for safety (if correct)
CONSTANT Node = {n1, n2, n3}

Final liveness configuration (production):

SPECIFICATION Spec
INVARIANT TypeOK                  \* Keep safety checks active
PROPERTY LivenessProperty
\* NO CONSTRAINT — UNSOUND with liveness
\* NO SYMMETRY — DANGEROUS with liveness
CONSTANT Node = {n1, n2, n3}

TLC command-line flags:

5.3 Multi-Tool Awareness [T3, P4.3.F6, P4.4.F2]

TLC is not the only TLA+ verification tool. Be aware of alternatives and their strengths.

TLC-specific advice is marked throughout this skill. Where behavior differs in Apalache, notes are included. However, this skill primarily targets TLC workflows. For Apalache-specific guidance, consult the Apalache documentation.

Apalache type annotations: Apalache requires type annotations that TLC ignores. If you plan to use both tools, add Apalache type annotations as comments:
\* @type: Set(NODE) => Bool;
IsQuorum(S) == Cardinality(S) * 2 > Cardinality(Node)

Phase 6: PlusCal Considerations [T3, M16, P4.4.F1]

PlusCal is a higher-level algorithmic language that compiles to TLA+. It is a valid choice for certain specifications.

6.1 PlusCal Decision Table

[T3, P4.4.F1]: A finished PlusCal specification is better than an unfinished TLA+ specification. If PlusCal gets you to a checkable spec faster, use it. But know the ceiling: PlusCal cannot express all TLA+ patterns (e.g., arbitrary interleaving of non-process actions, complex fairness, stuttering-sensitive refinement).

When to switch from PlusCal to TLA+:

You need fairness on individual actions (not just processes)

You need to compose specifications via refinement

You need ghost variables that participate in the temporal formula

PlusCal's process model does not match your system's concurrency model

Quick Reference

Common mistakes catalog, specification template, decision tables, evidence traceability matrix, and references are in references/quick-reference.md.

Adoption

ahrav/tla-spec

$ install --global

Security Scan Results

SKILL.md

TLA+ Specification Correctness Guide

Evidence Base

Confidence Tiers

When to Use

When NOT to Use

Phase 1: Specification Foundations

1.1 Canonical Form [T1, M1, P4.1.F1]

1.2 Stuttering Invariance Checklist [T1, M2, M5]

1.3 TypeOK as First Invariant [T1, M9]

Phase 2: Abstraction and Modeling

2.1 Seven-Step Abstraction Framework [T2, P3.1, M8]

2.2 Ten Distributed Systems Patterns [T2, P3.4, M14]

Pattern 1: Quorum

Pattern 2: Epoch / Term (Monotonic Counter)

Pattern 3: Leader Election (Nondeterministic)

Pattern 4: Timeout (Always-Enabled Action, NO Clock)

Pattern 5: Log Replication (Sequence with Truncate-Overwrite)

Pattern 6: Membership Change (Config as Log Entry)

Pattern 7: Idempotency (Accumulating Set)

Pattern 8: Two-Phase Protocol (Prepare-Collect-Decide)

Pattern 9: Failure and Restart (Durable vs Volatile State)

Pattern 10: Ghost Variables (Append-Only History)

Phase 3: Safety Verification

3.1 Safety Specification Checklist [T1, M9, M5]

3.2 Deadlock-Freedom Gate [T1, P4.4.F6]

3.3 TLC Safety Configuration [T1, M11, M12]

Phase 4: Liveness Verification

4.1 Six-Phase Liveness Methodology [T1, M3, P3.2, P3.3]

4.2 Complete TLC Soundness Matrix [T1, M11, M12]

Phase 5: Model Checking Configuration

5.1 Small Model Heuristics [T2, M10, P4.4.F3]

5.2 TLC Configuration Guide

5.3 Multi-Tool Awareness [T3, P4.3.F6, P4.4.F2]

Phase 6: PlusCal Considerations [T3, M16, P4.4.F1]

6.1 PlusCal Decision Table

Quick Reference

Related Skills

ahrav/first-principles

ahrav/task-forge

ahrav/.claude/skills/autoresearch

ahrav/test-pipeline

ahrav/tla-spec

$ install --global

Security Scan Results

SKILL.md

TLA+ Specification Correctness Guide

Evidence Base

Confidence Tiers

When to Use

When NOT to Use

Phase 1: Specification Foundations

1.1 Canonical Form [T1, M1, P4.1.F1]

1.2 Stuttering Invariance Checklist [T1, M2, M5]

1.3 TypeOK as First Invariant [T1, M9]

Phase 2: Abstraction and Modeling

2.1 Seven-Step Abstraction Framework [T2, P3.1, M8]

2.2 Ten Distributed Systems Patterns [T2, P3.4, M14]

Pattern 1: Quorum

Pattern 2: Epoch / Term (Monotonic Counter)

Pattern 3: Leader Election (Nondeterministic)

Pattern 4: Timeout (Always-Enabled Action, NO Clock)

Pattern 5: Log Replication (Sequence with Truncate-Overwrite)

Pattern 6: Membership Change (Config as Log Entry)

Pattern 7: Idempotency (Accumulating Set)

Pattern 8: Two-Phase Protocol (Prepare-Collect-Decide)

Pattern 9: Failure and Restart (Durable vs Volatile State)

Pattern 10: Ghost Variables (Append-Only History)

Phase 3: Safety Verification

3.1 Safety Specification Checklist [T1, M9, M5]

3.2 Deadlock-Freedom Gate [T1, P4.4.F6]

3.3 TLC Safety Configuration [T1, M11, M12]

Phase 4: Liveness Verification

4.1 Six-Phase Liveness Methodology [T1, M3, P3.2, P3.3]

4.2 Complete TLC Soundness Matrix [T1, M11, M12]

Phase 5: Model Checking Configuration