Run Deterministic Simulation Tests

Execute simulation tests with progressive difficulty levels, inspired by TigerBeetle's VOPR three-level system. Manages seeds, selects workloads, injects faults, and reports invariant violations.

Evidence Base

| Source | Principle | |--------|-----------| | TigerBeetle VOPR | 3 levels: clean → faults → catastrophic; 3.3 sec sim = 39 min real-world | | FoundationDB (SIGMOD 2021) | Tens of thousands of sims nightly, seeded determinism | | sled (simulation.html) | Discrete-event simulation with priority queue | | Eaton (practical DST guide) | Seeds break on code changes — use continuous exploration | | CockroachDB (Jepsen lessons) | Workload design determines simulation effectiveness | | Yuan et al. (OSDI 2014) | Error handling paths are highest-value targets |

When to Use

Level 1 (Sunny Day): Before committing coordination changes (quick sanity check)
Level 2 (Stormy): Nightly CI or before merging coordination PRs
Level 3 (Radioactive): Before releases or after significant coordination changes
When a coordination bug is suspected and you want to explore the state space
When verifying a fix for a simulation-found bug

When NOT to Use

Code that doesn't touch coordination, gossip, or shard lifecycle
Detection engine changes (use /run-fuzz instead)
When simulation harness isn't implemented yet (use /sim-scaffold first)

Prerequisites

Before running simulations, verify:

SimContext exists in gossip-coordination/src/sim/ (P1 prerequisite)
At least one simulation harness exists (generated by /sim-scaffold)
The test-support feature gate is configured
proptest is available as a dev-dependency

If prerequisites are missing, report what needs to be created and suggest running /sim-scaffold first.

Procedure

Step 1: Select Level

| Level | Name | Faults | Seeds | Context | |-------|------|--------|-------|---------| | 1 | Sunny Day | None — perfect conditions | 10 | Per-PR, pre-commit, quick sanity | | 2 | Stormy | Network partitions, lease expiry, message loss/reorder, clock skew | 100 | Nightly CI, pre-merge | | 3 | Radioactive | Storage corruption, cascading failures, Byzantine faults, all Level 2 combined | 1000+ | Pre-release, deep exploration |

If the user doesn't specify a level:

Default to Level 1 for quick checks
Recommend Level 2 if the change touches lease/fence/split logic
Recommend Level 3 if the change is a significant architectural modification

Step 2: Select Workload Templates

Each workload template exercises a specific coordination scenario. Select all that are relevant to the changed code, or run all for comprehensive coverage.

| Template | Exercises | Priority | |----------|-----------|----------| | Full scan lifecycle | Acquire → checkpoint x N → complete | ALWAYS | | Lease expiry during checkpoint | Acquire → delay → checkpoint with stale lease | HIGH — etcd Jepsen 3.4.3 pattern | | Split during active scan | Acquire → checkpoint → split → verify children | HIGH — shard coverage invariant | | Concurrent acquisition | 2+ workers acquire same shard | HIGH — fence monotonicity | | Ambiguous failure | Network drop after mutation succeeds | MED — Jepsen Redis-Raft pattern | | Park and unpark | Active → Parked → Active (reacquire) | MED — state machine completeness | | Mass restart | All workers restart simultaneously | MED — Serf/memberlist edge case | | Asymmetric partition | A→B works, B→A drops | LOW — Lifeguard false positives |

Step 3: Run Simulation

Execute tests with the selected level and seeds:

# Level 1 (Sunny Day) — quick sanity check
cargo test --features test-support -- --test-threads=1 sim::level1

# Level 2 (Stormy) — fault injection
SIM_SEEDS=100 SIM_LEVEL=2 cargo test --features test-support -- --test-threads=1 sim::level2

# Level 3 (Radioactive) — maximum chaos
SIM_SEEDS=1000 SIM_LEVEL=3 cargo test --features test-support -- --test-threads=1 sim::level3

# Reproduce a specific failure
SIM_SEED=12345 cargo test --features test-support -- --test-threads=1 sim::specific_test

Key flags:

--test-threads=1: Required for deterministic execution (no thread interleaving)
SIM_SEEDS: Number of random seeds to explore
SIM_LEVEL: Fault injection intensity
SIM_SEED: Specific seed for reproduction

Step 4: Analyze Results

On success (all seeds pass):

SIMULATION REPORT — Level {N}
═════════════════════════════
Result:           PASS
Seeds tested:     {count}
Wall time:        {duration}
Sim time:         {total simulated time}
Time compression: {sim_time / wall_time}x

Workload coverage:
  [x] Full scan lifecycle          — {N} occurrences
  [x] Lease expiry during checkpoint — {N} occurrences
  [x] Split during active scan     — {N} occurrences
  [x] Concurrent acquisition       — {N} occurrences
  [ ] Asymmetric partition          — 0 occurrences (not exercised)

Invariants verified:
  - Mutual exclusion (lease)       — checked {N} times
  - Fence monotonicity             — checked {N} times
  - Shard coverage (no gaps)       — checked {N} times
  - Terminal irreversibility       — checked {N} times

On failure:

SIMULATION REPORT — Level {N}
═════════════════════════════
Result:           FAIL
Seeds tested:     {count} ({failures} failures)
Unique bugs:      {count}

Invariant violations:
  ┌─────────────────────────────┬───────┬─────────────────────┐
  │ Invariant                   │ Count │ Seeds               │
  ├─────────────────────────────┼───────┼─────────────────────┤
  │ Mutual exclusion (lease)    │ 3     │ 42, 1337, 99999     │
  │ Fence monotonicity          │ 1     │ 42                  │
  └─────────────────────────────┴───────┴─────────────────────┘

Reproduction commands:
  SIM_SEED=42 cargo test --features test-support -- sim::level2::lease_expiry_during_checkpoint
  SIM_SEED=1337 cargo test --features test-support -- sim::level2::concurrent_acquisition

Failure trace (seed=42):
  t=0    Worker-1: acquire_shard(shard-A) → Ok(epoch=1)
  t=5    Worker-2: acquire_shard(shard-A) → Ok(epoch=2)
  t=7    FAULT: pause Worker-2 for 100 ticks
  t=10   Worker-1: checkpoint(shard-A, epoch=1) → Ok  ← VIOLATION: stale epoch accepted
  t=107  Worker-2: resume
  ...

Step 5: Handle Failures

When a simulation failure is found:

Minimize: Use proptest shrinking to find the minimal reproduction case
Classify: Is this a real bug or a simulation artifact?
Root cause: Trace the failure to specific code
Fix: Apply the fix
Verify: Re-run with the failing seed AND fresh seeds
Do NOT save the specific seed as a regression test — seeds break on code changes (Eaton, Antithesis). Instead, ensure the invariant check catches the class of bug under continuous exploration.

Step 6: Coverage Assessment

After the run, assess workload coverage:

Well-covered: Workload generated sufficient occurrences across seeds
Under-covered: Workload generated fewer than 5 occurrences across all seeds → May need to tune workload generation probabilities
Not exercised: Workload never triggered → May need explicit workload scenario or higher seed count

If critical workloads are under-covered, recommend:

Increasing seed count
Adding targeted workload scenarios
Adjusting fault injection probabilities

Fault Injection Reference

Level 2 (Stormy) Faults

| Fault | Description | Implementation | |-------|-------------|----------------| | Network partition | Messages between node subsets are dropped | NetworkFaults.partitions | | Message loss | Random message drops (10-30%) | NetworkFaults.drop_rate | | Message reorder | Messages arrive out of order | Random delay in event queue | | Lease expiry | Force lease timeout mid-operation | Advance clock past lease deadline | | Process pause | Freeze a worker for N ticks | Skip node in simulation loop | | Clock skew | Nodes disagree on current time | Per-node clock offset |

Level 3 (Radioactive) Faults

All Level 2 faults plus:

| Fault | Description | Implementation | |-------|-------------|----------------| | Storage corruption | Backend returns garbage data | Mock backend returns corrupted state | | Cascading failure | Failure of one node triggers failures in others | Fault propagation rules | | Byzantine | Node sends conflicting messages to different peers | Duplicate + modify messages | | Split brain | Two partitions each believe they're the majority | Symmetric partition | | Repeated crash-restart | Node crashes and restarts multiple times | Reset node state periodically |

Invariant Catalog

These are the core invariants that every simulation run must verify. They map to the phase 2 spec invariant catalog.

| ID | Invariant | Check | |----|-----------|-------| | S1 | Mutual exclusion — at most one active lease per shard | Count active leases per shard ≤ 1 | | S2 | Fence monotonicity — epochs never decrease | Track max epoch per shard, assert non-decreasing | | S3 | Terminal irreversibility — Done/Failed shards never transition | Assert no transitions from terminal states | | S4 | Shard coverage — split children cover parent range exactly | Verify range algebra on split | | S5 | Idempotency — duplicate OpId returns same result | Replay operations, assert identical results | | S6 | Progress — under fair scheduling, work eventually completes | Assert completion within bounded time | | L1 | Lease exclusivity — stale-epoch checkpoints are rejected | Attempt checkpoint with old epoch, assert rejection | | L2 | Zombie rejection — restarted worker cannot use old lease | Kill and restart worker, assert old operations fail |

Seed Management Philosophy

Do NOT save specific seeds as regression tests.

Seeds break whenever code changes (Eaton survey, Antithesis blog). Instead:

Run many fresh seeds continuously (exploration, not regression)
When a bug is found, fix it and verify the fix with fresh seeds
Ensure the invariant assertion catches the class of bug, not just the specific sequence that triggered it
The invariant check IS the regression test — not the seed

This is counterintuitive but well-established: FoundationDB, TigerBeetle, and Antithesis all use continuous exploration over saved seeds.

Related Skills

/sim-review — Verify code is DST-compatible before running simulations
/sim-scaffold — Generate simulation harnesses
/jepsen-test — Complement DST with real-network cluster tests
/test-strategy — Choose between simulation and other test approaches

Run Deterministic Simulation Tests

Execute simulation tests with progressive difficulty levels, inspired by TigerBeetle's VOPR three-level system. Manages seeds, selects workloads, injects faults, and reports invariant violations.

Evidence Base

When to Use

Level 1 (Sunny Day): Before committing coordination changes (quick sanity check)
Level 2 (Stormy): Nightly CI or before merging coordination PRs
Level 3 (Radioactive): Before releases or after significant coordination changes
When a coordination bug is suspected and you want to explore the state space
When verifying a fix for a simulation-found bug

When NOT to Use

Code that doesn't touch coordination, gossip, or shard lifecycle
Detection engine changes (use /run-fuzz instead)
When simulation harness isn't implemented yet (use /sim-scaffold first)

Prerequisites

Before running simulations, verify:

SimContext exists in gossip-coordination/src/sim/ (P1 prerequisite)
At least one simulation harness exists (generated by /sim-scaffold)
The test-support feature gate is configured
proptest is available as a dev-dependency

If prerequisites are missing, report what needs to be created and suggest running /sim-scaffold first.

Procedure

Step 1: Select Level

If the user doesn't specify a level:

Default to Level 1 for quick checks
Recommend Level 2 if the change touches lease/fence/split logic
Recommend Level 3 if the change is a significant architectural modification

Step 2: Select Workload Templates

Each workload template exercises a specific coordination scenario. Select all that are relevant to the changed code, or run all for comprehensive coverage.

Step 3: Run Simulation

Execute tests with the selected level and seeds:

# Level 1 (Sunny Day) — quick sanity check
cargo test --features test-support -- --test-threads=1 sim::level1

# Level 2 (Stormy) — fault injection
SIM_SEEDS=100 SIM_LEVEL=2 cargo test --features test-support -- --test-threads=1 sim::level2

# Level 3 (Radioactive) — maximum chaos
SIM_SEEDS=1000 SIM_LEVEL=3 cargo test --features test-support -- --test-threads=1 sim::level3

# Reproduce a specific failure
SIM_SEED=12345 cargo test --features test-support -- --test-threads=1 sim::specific_test

Key flags:

--test-threads=1: Required for deterministic execution (no thread interleaving)
SIM_SEEDS: Number of random seeds to explore
SIM_LEVEL: Fault injection intensity
SIM_SEED: Specific seed for reproduction

Step 4: Analyze Results

On success (all seeds pass):

SIMULATION REPORT — Level {N}
═════════════════════════════
Result:           PASS
Seeds tested:     {count}
Wall time:        {duration}
Sim time:         {total simulated time}
Time compression: {sim_time / wall_time}x

Workload coverage:
  [x] Full scan lifecycle          — {N} occurrences
  [x] Lease expiry during checkpoint — {N} occurrences
  [x] Split during active scan     — {N} occurrences
  [x] Concurrent acquisition       — {N} occurrences
  [ ] Asymmetric partition          — 0 occurrences (not exercised)

Invariants verified:
  - Mutual exclusion (lease)       — checked {N} times
  - Fence monotonicity             — checked {N} times
  - Shard coverage (no gaps)       — checked {N} times
  - Terminal irreversibility       — checked {N} times

On failure:

SIMULATION REPORT — Level {N}
═════════════════════════════
Result:           FAIL
Seeds tested:     {count} ({failures} failures)
Unique bugs:      {count}

Invariant violations:
  ┌─────────────────────────────┬───────┬─────────────────────┐
  │ Invariant                   │ Count │ Seeds               │
  ├─────────────────────────────┼───────┼─────────────────────┤
  │ Mutual exclusion (lease)    │ 3     │ 42, 1337, 99999     │
  │ Fence monotonicity          │ 1     │ 42                  │
  └─────────────────────────────┴───────┴─────────────────────┘

Reproduction commands:
  SIM_SEED=42 cargo test --features test-support -- sim::level2::lease_expiry_during_checkpoint
  SIM_SEED=1337 cargo test --features test-support -- sim::level2::concurrent_acquisition

Failure trace (seed=42):
  t=0    Worker-1: acquire_shard(shard-A) → Ok(epoch=1)
  t=5    Worker-2: acquire_shard(shard-A) → Ok(epoch=2)
  t=7    FAULT: pause Worker-2 for 100 ticks
  t=10   Worker-1: checkpoint(shard-A, epoch=1) → Ok  ← VIOLATION: stale epoch accepted
  t=107  Worker-2: resume
  ...

Step 5: Handle Failures

When a simulation failure is found:

Minimize: Use proptest shrinking to find the minimal reproduction case
Classify: Is this a real bug or a simulation artifact?
Root cause: Trace the failure to specific code
Fix: Apply the fix
Verify: Re-run with the failing seed AND fresh seeds
Do NOT save the specific seed as a regression test — seeds break on code changes (Eaton, Antithesis). Instead, ensure the invariant check catches the class of bug under continuous exploration.

Step 6: Coverage Assessment

After the run, assess workload coverage:

Well-covered: Workload generated sufficient occurrences across seeds
Under-covered: Workload generated fewer than 5 occurrences across all seeds → May need to tune workload generation probabilities
Not exercised: Workload never triggered → May need explicit workload scenario or higher seed count

If critical workloads are under-covered, recommend:

Increasing seed count
Adding targeted workload scenarios
Adjusting fault injection probabilities

Fault Injection Reference

Level 2 (Stormy) Faults

Level 3 (Radioactive) Faults

All Level 2 faults plus:

Invariant Catalog

These are the core invariants that every simulation run must verify. They map to the phase 2 spec invariant catalog.

Seed Management Philosophy

Do NOT save specific seeds as regression tests.

Seeds break whenever code changes (Eaton survey, Antithesis blog). Instead:

Run many fresh seeds continuously (exploration, not regression)
When a bug is found, fix it and verify the fix with fresh seeds
Ensure the invariant assertion catches the class of bug, not just the specific sequence that triggered it
The invariant check IS the regression test — not the seed

This is counterintuitive but well-established: FoundationDB, TigerBeetle, and Antithesis all use continuous exploration over saved seeds.

Related Skills

/sim-review — Verify code is DST-compatible before running simulations
/sim-scaffold — Generate simulation harnesses
/jepsen-test — Complement DST with real-network cluster tests
/test-strategy — Choose between simulation and other test approaches

Adoption

ahrav/sim-run

$ install --global

Security Scan Results

SKILL.md

Run Deterministic Simulation Tests

Evidence Base

When to Use

When NOT to Use

Prerequisites

Procedure

Step 1: Select Level

Step 2: Select Workload Templates

Step 3: Run Simulation

Step 4: Analyze Results

Step 5: Handle Failures

Step 6: Coverage Assessment

Fault Injection Reference

Level 2 (Stormy) Faults

Level 3 (Radioactive) Faults

Invariant Catalog

Seed Management Philosophy

Related Skills

Related Skills

ahrav/first-principles

ahrav/task-forge

ahrav/.claude/skills/autoresearch

ahrav/test-pipeline

ahrav/sim-run

$ install --global

Security Scan Results

SKILL.md

Run Deterministic Simulation Tests

Evidence Base

When to Use

When NOT to Use

Prerequisites

Procedure

Step 1: Select Level

Step 2: Select Workload Templates

Step 3: Run Simulation

Step 4: Analyze Results

Step 5: Handle Failures

Step 6: Coverage Assessment

Fault Injection Reference

Level 2 (Stormy) Faults

Level 3 (Radioactive) Faults

Invariant Catalog

Seed Management Philosophy

Related Skills

Related Skills

ahrav/first-principles

ahrav/task-forge

ahrav/.claude/skills/autoresearch

ahrav/test-pipeline