Distributed Systems Auditor

Enforce evidence-backed distributed systems design and implementation for the gossip-rs project. Every coordination decision must trace to established academic literature, a battle-tested production system, or a formally verified specification. No novel protocols. No uncited coordination logic.

When to Use

Proposing or evaluating any distributed systems pattern (leases, fencing, sharding, consensus, failure detection, idempotency)
Writing or modifying code in gossip-coordination, or any code touching Boundary 2 (Coordination) contracts
Designing or modifying types/traits in gossip-contracts that relate to coordination, cursors, shards, epochs, or OpIds
Discussing failure modes, consistency guarantees, ordering, or CAP trade-offs
Adding or modifying any state machine that coordinates distributed workers
Implementing checkpointing, lease renewal, cursor advancement, or shard assignment logic

When NOT to Use

Pure detection engine work (regex, rule matching) with no coordination aspect
Connector I/O code that doesn't touch coordination primitives
Build tooling, CI, or documentation-only changes

Step 0: Load Evidence Base

On every invocation, read these files before doing anything else:

tmp/gossip-project-artifacts/distributed-secret-scanner-instructions.md — master instructions with evidence hierarchy, anti-patterns, citation table
tmp/gossip-project-artifacts/gossip-contracts-references.md — curated references organized by boundary

These are the primary evidence sources. Patterns found here are pre-approved.

If the current work touches specific boundary contracts, also read the relevant boundary draft files in tmp/gossip-project-artifacts/boundary_*.rs to check locked decisions and specified APIs.

Step 1: Detect Mode

Design Mode — you are in a brainstorming session, plan mode, writing a design doc, or discussing architecture without modifying a specific code file.

Implementation Mode — you are writing, reviewing, or modifying Rust code that embodies coordination patterns.

If both apply (designing while coding), use both workflows.

Design Mode Workflow

1. Check Locked Decisions

Verify the proposal does not contradict any locked architectural decision:

D2.1: Cursor is two-layer (last_key + token)
D2.2: ShardSpec uses half-open [start, end) ranges
D2.3: Cursor monotonicity is a hard safety invariant
D2.4: Cursor bounds checking is a hard safety invariant
D2.5: Checkpoints require a last_key (no regression to None)
D3.3: Lex-ordered byte strings for all key ranges
Decision A: No custom leader election — use consensus backends
Decision B: No gossip protocols — use leases + fencing

If a conflict is found, BLOCK and explain the contradiction.

2. Check Anti-Patterns

Verify against the 7 explicit prohibitions from Section 9 of the instructions:

Do not design novel consensus protocols
Do not assume clocks are synchronized
Do not conflate "no response" with "failure"
Do not use unbounded queues
Do not rely on wall-clock timeouts for correctness
Do not silently drop scan work
Do not assume network reliability

If any anti-pattern is detected, BLOCK with the specific prohibition number and reference.

3. Produce Decision Record

Every design decision must be presented in this format:

## Distributed Systems Decision: [topic]

### Locked Decisions Check
- [x] No conflict with D2.1–D2.5, D3.3, Decision A/B
  (or [ ] Conflicts with [decision] — resolution required)

### Anti-Pattern Check
- [x] No novel consensus protocols
- [x] No clock synchronization assumptions
- [x] No conflating "no response" with "failure"
- [x] No unbounded queues
- [x] No wall-clock timeouts for correctness
- [x] No silent work drops
- [x] No network reliability assumptions

### Decision Record

| Field | Detail |
|-------|--------|
| **Problem** | [what we're deciding] |
| **Approach** | [pattern/technique name] |
| **Provenance** | [Level N] — [citation with URL/paper] |
| **Fit** | [why this matches our constraints] |
| **Trade-offs** | [what we give up] |
| **Safety invariant** | [what must never happen] |
| **Liveness invariant** | [what must eventually happen] |
| **Verification** | [how we test invariants hold] |

### Alternatives Considered

| Alternative | Why Rejected | Reference |
|-------------|-------------|-----------|
| [option] | [reason] | [citation] |

### Open Questions
[anything unresolved, flagged for follow-up]

4. Citation Requirement (HARD GATE)

The Provenance field is mandatory. Use this evidence hierarchy — prefer higher levels:

| Level | Label | Example | |-------|-------|---------| | 5 | Formally verified | TLA+ specs, Coq/Isabelle proofs (IronFleet, Verdi) | | 4 | Peer-reviewed | SOSP, OSDI, NSDI, EuroSys, VLDB, SIGMOD papers | | 3 | Battle-tested | FoundationDB, CockroachDB, Kafka, etcd, TigerBeetle | | 2 | Industry whitepapers | AWS, Google, Meta engineering blogs with depth | | 1 | Textbooks | DDIA (Kleppmann), Attiya & Welch, Tanenbaum |

If no citation from levels 1–5 exists, the recommendation is NOVEL and must be explicitly flagged:

**Provenance**: Level 0 — NOVEL/UNJUSTIFIED
**WARNING**: No established reference found. This pattern requires:
  - Extra review from a second opinion
  - Property-based tests covering all invariants
  - Consider whether an existing proven pattern could substitute

5. Gap Analysis

If the pattern is not covered by the curated references, do targeted web research. Search for:

"{pattern name}" paper SOSP OSDI VLDB
"{pattern name}" production implementation
"{pattern name}" formal verification TLA+
"{pattern name}" failure mode post-mortem

When a valuable reference is found, recommend adding it to gossip-contracts-references.md:

### Suggested Reference Addition

Add to section: [boundary section name]

| Title | Authors / Source | Year | Why It Matters |
|-------|-----------------|------|----------------|
| [title](URL) | [authors] | [year] | [relevance to our project] |

Implementation Mode Workflow

1. Identify Coordination Patterns

Scan the code being written or modified for these patterns:

Lease logic — acquire, renew, expire, revoke
Epoch/fence checks — monotonic token validation, stale-write rejection
Cursor advancement — progress tracking, checkpoint writes
Shard boundary operations — range checks, split/merge, key encoding
Idempotency — OpId generation, duplicate detection, payload hash
State machine transitions — coordinator states, worker lifecycle
Failure detection — heartbeat, lease expiry, accrual detection
Retry/backpressure — exponential backoff, circuit breaking, queue bounds

2. Emit Inline Audit

For each coordination pattern identified, produce:

**Pattern: [name]** (file:line_start–line_end)
- **Invariant**: [the safety or liveness property this code upholds]
- **Reference**: [paper/system/post] — [URL or citation from references doc]
- **Violation scenario**: [concrete description of what breaks — e.g.,
  "zombie worker with expired lease commits stale results, overwriting
  valid progress from the new lease holder"]
- **Verification**: [specific test — e.g., "property test: for any sequence
  of checkpoint operations, cursor.last_key is monotonically non-decreasing"]

3. Monotonicity & Ordering Check

These are hard safety invariants (D2.3, D2.4). Explicitly verify:

### Monotonicity & Ordering Check
- [ ] Cursor advancement: `new_key >= old_key` enforced
- [ ] Epoch/fence tokens: monotonically increasing, no reuse
- [ ] Shard bounds: half-open [start, end), start < end
- [ ] Checkpoint regression: no regression of last_key to None (D2.5)

If any of these cannot be verified from the code, emit a BLOCK.

4. Flag Unproven Code

Any coordination logic that doesn't map to a known pattern with a citation gets flagged:

### Flags

| Severity | Location | Issue |
|----------|----------|-------|
| BLOCK | file:NN | No citation for [pattern] — NOVEL/UNJUSTIFIED |
| BLOCK | file:NN | Contradicts locked decision [D2.X] |
| BLOCK | file:NN | Anti-pattern violation: [prohibition #N] |
| WARN | file:NN | Evidence level 1-2 only — consider stronger reference |
| WARN | file:NN | Missing verification strategy for [invariant] |
| INFO | file:NN | Stronger test available: [suggestion] |

5. Missing Coverage Report

Identify coordination patterns in the code that lack corresponding tests or property-based verification:

### Missing Coverage
| Pattern | Invariant | Suggested Verification |
|---------|-----------|----------------------|
| [pattern] | [invariant] | [test type + description] |

Severity Levels

BLOCK: Must be resolved before proceeding. No citation for a coordination pattern, contradiction with a locked decision, or anti-pattern violation.
WARN: Should be addressed. Weak evidence level, missing verification strategy, or subtle failure mode risk.
INFO: Improvement opportunity. Stronger test available, additional reference found, or pattern could be tightened.

Quick Reference: Locked Decisions

These are non-negotiable. Code must conform:

| ID | Decision | Source | |----|----------|--------| | D2.1 | Cursor is two-layer: last_key (coordinator-visible) + token (connector-opaque) | Spanner query restart (2017) | | D2.2 | ShardSpec uses half-open [start, end) byte ranges | Spanner, Bigtable, FoundationDB | | D2.3 | Cursor monotonicity: coordinator rejects new_key < old_key | Hard safety invariant | | D2.4 | Cursor bounds: cursor must fall within shard's [start, end) | Hard safety invariant | | D2.5 | Checkpoints require last_key — no regression to None | Hard safety invariant | | D3.3 | Lex-ordered byte strings for all key ranges | Bigtable, FoundationDB, CockroachDB | | A | No custom leader election — use consensus backends (etcd/Raft, ZK/ZAB) | Ongaro & Ousterhout 2014 | | B | No gossip protocols — use leases + fencing | Gray & Cheriton 1989 |

Quick Reference: Anti-Patterns

| # | Prohibition | Reference | |---|-------------|-----------| | 1 | No novel consensus protocols | Use Raft via etcd or existing impl | | 2 | No clock synchronization assumptions | Use logical clocks or HLCs | | 3 | No conflating "no response" with "failure" | FLP result (Fischer, Lynch, Paterson 1985) | | 4 | No unbounded queues | Explicit backpressure required | | 5 | No wall-clock timeouts for correctness | Clocks for liveness only | | 6 | No silent work drops | Every scan unit reaches terminal state | | 7 | No network reliability assumptions | Bailis & Kingsbury 2014 |

Quick Reference: Key Citations

| Concept | Primary Reference | |---------|-------------------| | Leases | Gray & Cheriton, SOSP 1989 | | Fencing tokens | Kleppmann, "How to do distributed locking" (2016) | | Consensus (Raft) | Ongaro & Ousterhout, USENIX ATC 2014 | | Range sharding | Corbett et al. (Spanner), OSDI 2012 | | Cursor restart | Bacon et al. (Spanner), 2017 | | Idempotency keys | Stripe (Brandur Leach), 2017 | | Exactly-once | Akidau et al. (Dataflow Model), VLDB 2015 | | Linearizability | Herlihy & Wing, TOPLAS 1990 | | Distributed snapshots | Chandy & Lamport, TOCS 1985 | | Failure detection | Hayashibara et al. (Phi accrual), SRDS 2004 | | Work stealing | Blumofe & Leiserson, JACM 1999 | | FLP impossibility | Fischer, Lynch, Paterson, JACM 1985 | | Sagas | Garcia-Molina & Salem, SIGMOD 1987 | | WAL / ARIES | Mohan et al., TODS 1992 | | DST | FoundationDB (Apple, 2021) | | TLA+ in industry | Newcombe et al., CACM 2015 | | Causal ordering | Lamport, CACM 1978 | | HLC | Kulkarni et al., OPODIS 2014 |

Related Skills

/deep-research — use before this skill when entering entirely new territory. Deep-research gathers evidence; dist-sys-auditor enforces it.
/interface-design-review — complementary: checks API ergonomics; dist-sys-auditor checks distributed correctness. Run both on coordination APIs.
/design-tournament — use when multiple viable approaches surface. Feed tournament results into dist-sys-auditor for citation validation.
/test-strategy — dist-sys-auditor recommends verification strategies; test-strategy helps implement them.
/security-reviewer — covers memory safety; dist-sys-auditor covers coordination safety (zombie writes, split-brain, lost progress).
/doc-rigor — after dist-sys-auditor approves a pattern, doc-rigor ensures documentation captures why the pattern was chosen with its citation.

Distributed Systems Auditor

When to Use

Proposing or evaluating any distributed systems pattern (leases, fencing, sharding, consensus, failure detection, idempotency)
Writing or modifying code in gossip-coordination, or any code touching Boundary 2 (Coordination) contracts
Designing or modifying types/traits in gossip-contracts that relate to coordination, cursors, shards, epochs, or OpIds
Discussing failure modes, consistency guarantees, ordering, or CAP trade-offs
Adding or modifying any state machine that coordinates distributed workers
Implementing checkpointing, lease renewal, cursor advancement, or shard assignment logic

When NOT to Use

Pure detection engine work (regex, rule matching) with no coordination aspect
Connector I/O code that doesn't touch coordination primitives
Build tooling, CI, or documentation-only changes

Step 0: Load Evidence Base

On every invocation, read these files before doing anything else:

tmp/gossip-project-artifacts/distributed-secret-scanner-instructions.md — master instructions with evidence hierarchy, anti-patterns, citation table
tmp/gossip-project-artifacts/gossip-contracts-references.md — curated references organized by boundary

These are the primary evidence sources. Patterns found here are pre-approved.

If the current work touches specific boundary contracts, also read the relevant boundary draft files in tmp/gossip-project-artifacts/boundary_*.rs to check locked decisions and specified APIs.

Step 1: Detect Mode

Design Mode — you are in a brainstorming session, plan mode, writing a design doc, or discussing architecture without modifying a specific code file.

Implementation Mode — you are writing, reviewing, or modifying Rust code that embodies coordination patterns.

If both apply (designing while coding), use both workflows.

Design Mode Workflow

1. Check Locked Decisions

Verify the proposal does not contradict any locked architectural decision:

D2.1: Cursor is two-layer (last_key + token)
D2.2: ShardSpec uses half-open [start, end) ranges
D2.3: Cursor monotonicity is a hard safety invariant
D2.4: Cursor bounds checking is a hard safety invariant
D2.5: Checkpoints require a last_key (no regression to None)
D3.3: Lex-ordered byte strings for all key ranges
Decision A: No custom leader election — use consensus backends
Decision B: No gossip protocols — use leases + fencing

If a conflict is found, BLOCK and explain the contradiction.

2. Check Anti-Patterns

Verify against the 7 explicit prohibitions from Section 9 of the instructions:

Do not design novel consensus protocols
Do not assume clocks are synchronized
Do not conflate "no response" with "failure"
Do not use unbounded queues
Do not rely on wall-clock timeouts for correctness
Do not silently drop scan work
Do not assume network reliability

If any anti-pattern is detected, BLOCK with the specific prohibition number and reference.

3. Produce Decision Record

Every design decision must be presented in this format:

## Distributed Systems Decision: [topic]

### Locked Decisions Check
- [x] No conflict with D2.1–D2.5, D3.3, Decision A/B
  (or [ ] Conflicts with [decision] — resolution required)

### Anti-Pattern Check
- [x] No novel consensus protocols
- [x] No clock synchronization assumptions
- [x] No conflating "no response" with "failure"
- [x] No unbounded queues
- [x] No wall-clock timeouts for correctness
- [x] No silent work drops
- [x] No network reliability assumptions

### Decision Record

| Field | Detail |
|-------|--------|
| **Problem** | [what we're deciding] |
| **Approach** | [pattern/technique name] |
| **Provenance** | [Level N] — [citation with URL/paper] |
| **Fit** | [why this matches our constraints] |
| **Trade-offs** | [what we give up] |
| **Safety invariant** | [what must never happen] |
| **Liveness invariant** | [what must eventually happen] |
| **Verification** | [how we test invariants hold] |

### Alternatives Considered

| Alternative | Why Rejected | Reference |
|-------------|-------------|-----------|
| [option] | [reason] | [citation] |

### Open Questions
[anything unresolved, flagged for follow-up]

4. Citation Requirement (HARD GATE)

The Provenance field is mandatory. Use this evidence hierarchy — prefer higher levels:

If no citation from levels 1–5 exists, the recommendation is NOVEL and must be explicitly flagged:

**Provenance**: Level 0 — NOVEL/UNJUSTIFIED
**WARNING**: No established reference found. This pattern requires:
  - Extra review from a second opinion
  - Property-based tests covering all invariants
  - Consider whether an existing proven pattern could substitute

5. Gap Analysis

If the pattern is not covered by the curated references, do targeted web research. Search for:

"{pattern name}" paper SOSP OSDI VLDB
"{pattern name}" production implementation
"{pattern name}" formal verification TLA+
"{pattern name}" failure mode post-mortem

When a valuable reference is found, recommend adding it to gossip-contracts-references.md:

### Suggested Reference Addition

Add to section: [boundary section name]

| Title | Authors / Source | Year | Why It Matters |
|-------|-----------------|------|----------------|
| [title](URL) | [authors] | [year] | [relevance to our project] |

Implementation Mode Workflow

1. Identify Coordination Patterns

Scan the code being written or modified for these patterns:

Lease logic — acquire, renew, expire, revoke
Epoch/fence checks — monotonic token validation, stale-write rejection
Cursor advancement — progress tracking, checkpoint writes
Shard boundary operations — range checks, split/merge, key encoding
Idempotency — OpId generation, duplicate detection, payload hash
State machine transitions — coordinator states, worker lifecycle
Failure detection — heartbeat, lease expiry, accrual detection
Retry/backpressure — exponential backoff, circuit breaking, queue bounds

2. Emit Inline Audit

For each coordination pattern identified, produce:

**Pattern: [name]** (file:line_start–line_end)
- **Invariant**: [the safety or liveness property this code upholds]
- **Reference**: [paper/system/post] — [URL or citation from references doc]
- **Violation scenario**: [concrete description of what breaks — e.g.,
  "zombie worker with expired lease commits stale results, overwriting
  valid progress from the new lease holder"]
- **Verification**: [specific test — e.g., "property test: for any sequence
  of checkpoint operations, cursor.last_key is monotonically non-decreasing"]

3. Monotonicity & Ordering Check

These are hard safety invariants (D2.3, D2.4). Explicitly verify:

### Monotonicity & Ordering Check
- [ ] Cursor advancement: `new_key >= old_key` enforced
- [ ] Epoch/fence tokens: monotonically increasing, no reuse
- [ ] Shard bounds: half-open [start, end), start < end
- [ ] Checkpoint regression: no regression of last_key to None (D2.5)

If any of these cannot be verified from the code, emit a BLOCK.

4. Flag Unproven Code

Any coordination logic that doesn't map to a known pattern with a citation gets flagged:

### Flags

| Severity | Location | Issue |
|----------|----------|-------|
| BLOCK | file:NN | No citation for [pattern] — NOVEL/UNJUSTIFIED |
| BLOCK | file:NN | Contradicts locked decision [D2.X] |
| BLOCK | file:NN | Anti-pattern violation: [prohibition #N] |
| WARN | file:NN | Evidence level 1-2 only — consider stronger reference |
| WARN | file:NN | Missing verification strategy for [invariant] |
| INFO | file:NN | Stronger test available: [suggestion] |

5. Missing Coverage Report

Identify coordination patterns in the code that lack corresponding tests or property-based verification:

### Missing Coverage
| Pattern | Invariant | Suggested Verification |
|---------|-----------|----------------------|
| [pattern] | [invariant] | [test type + description] |

Severity Levels

BLOCK: Must be resolved before proceeding. No citation for a coordination pattern, contradiction with a locked decision, or anti-pattern violation.
WARN: Should be addressed. Weak evidence level, missing verification strategy, or subtle failure mode risk.
INFO: Improvement opportunity. Stronger test available, additional reference found, or pattern could be tightened.

Quick Reference: Locked Decisions

These are non-negotiable. Code must conform:

Quick Reference: Anti-Patterns

Quick Reference: Key Citations

Related Skills

/deep-research — use before this skill when entering entirely new territory. Deep-research gathers evidence; dist-sys-auditor enforces it.
/interface-design-review — complementary: checks API ergonomics; dist-sys-auditor checks distributed correctness. Run both on coordination APIs.
/design-tournament — use when multiple viable approaches surface. Feed tournament results into dist-sys-auditor for citation validation.
/test-strategy — dist-sys-auditor recommends verification strategies; test-strategy helps implement them.
/security-reviewer — covers memory safety; dist-sys-auditor covers coordination safety (zombie writes, split-brain, lost progress).
/doc-rigor — after dist-sys-auditor approves a pattern, doc-rigor ensures documentation captures why the pattern was chosen with its citation.

Adoption

ahrav/dist-sys-auditor

$ install --global

Security Scan Results

SKILL.md

Distributed Systems Auditor

When to Use

When NOT to Use

Step 0: Load Evidence Base

Step 1: Detect Mode

Design Mode Workflow

1. Check Locked Decisions

2. Check Anti-Patterns

3. Produce Decision Record

4. Citation Requirement (HARD GATE)

5. Gap Analysis

Implementation Mode Workflow

1. Identify Coordination Patterns

2. Emit Inline Audit

3. Monotonicity & Ordering Check

4. Flag Unproven Code

5. Missing Coverage Report

Severity Levels

Quick Reference: Locked Decisions

Quick Reference: Anti-Patterns

Quick Reference: Key Citations

Related Skills

Related Skills

ahrav/first-principles

ahrav/task-forge

ahrav/.claude/skills/autoresearch

ahrav/test-pipeline

ahrav/dist-sys-auditor

$ install --global

Security Scan Results

SKILL.md

Distributed Systems Auditor

When to Use

When NOT to Use

Step 0: Load Evidence Base

Step 1: Detect Mode

Design Mode Workflow

1. Check Locked Decisions

2. Check Anti-Patterns

3. Produce Decision Record

4. Citation Requirement (HARD GATE)

5. Gap Analysis

Implementation Mode Workflow

1. Identify Coordination Patterns

2. Emit Inline Audit

3. Monotonicity & Ordering Check

4. Flag Unproven Code

5. Missing Coverage Report

Severity Levels

Quick Reference: Locked Decisions

Quick Reference: Anti-Patterns

Quick Reference: Key Citations

Related Skills

Related Skills

ahrav/first-principles

ahrav/task-forge

ahrav/.claude/skills/autoresearch

ahrav/test-pipeline