codex/skills/deadlock-finder-and-fixer/SKILL.md
Find and fix concurrency bugs - deadlocks, races, livelocks, await-holding-lock, database locks, LD_PRELOAD init, swarm races. Use when processes hang, tests flake, or auditing concurrency.
npx skillsauth add tkersey/dotfiles deadlock-finder-and-fixerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Core Insight. Concurrency bugs do not come from one missing lock — they come from one lock acquired in the wrong place, at the wrong time, held across the wrong operation, by a thread that didn't know it was holding it. Find every instance of the hazard, not just the one that fired.
The Universal Rule. When you think you found the deadlock and fixed the three instances you could see, there is almost always a fourth. This is the single most common failure mode across every concurrency debugging session in this repo's history. Keep searching until you can prove exhaustively — by code audit — that no hazard remains. See THE FOURTH INSTANCE.
# 1. Is it CPU-alive or CPU-dead?
ps -Lp $PID -o tid,pcpu,pmem,comm --no-headers | head -20
# 2. Snapshot all thread states (pick ONE, in order of availability):
gdb --batch -ex "set pagination off" -ex "thread apply all bt full" -p $PID 2>&1 | tee /tmp/bt.txt
# OR (if ptrace blocked / LD_PRELOAD hazard):
strace -k -f -p $PID 2>&1 | head -200
# OR (sample /proc):
for i in 1 2 3; do cat /proc/$PID/task/*/stack 2>/dev/null | sort -u; sleep 1; done
# 3. Classify (pick the matching row from the Symptom Triage Table below).
# 4. Jump to the matching section in this skill or in gdb-for-debugging.
Diagnosis depth is in gdb-for-debugging — which already contains the Lock Graph Construction algorithm, mutex ownership inspector, async runtime analysis, and TSAN/rr workflow. This skill is the complement: it covers taxonomy, static-audit discovery, fix catalog, and prevention by design — the parts that don't need a running process.
| Observed Symptom | Likely Bug Class | Jump To |
|------------------|------------------|---------|
| Process 0% CPU, won't respond, threads in futex_wait / __lll_lock_wait | Classic deadlock (AB-BA or self) | Class 1 |
| Async tasks pending but all tokio workers in epoll_wait | Mutex held across .await or channel cycle | Class 2 |
| 100% CPU, futex spam, no progress | Livelock / retry storm / broken condvar | Class 3 |
| database is locked, SQLITE_BUSY, timeouts | SQLite WAL contention / long transaction / writer fight | Class 4 |
| Hang during library load, strlen or malloc call hangs | LD_PRELOAD / runtime-init reentrancy | Class 5 |
| Test flakes under load, passes under --test-threads=1 | Data race (TSAN) or TOCTOU | Class 6 |
| Agent swarm stalls; two agents editing same file | Advisory-lease race or missing reservation | Class 7 |
| tmux pane hung, mux unresponsive | External process holding a shared lock / fd | Class 7 |
| Task starvation: one worker CPU-pegged, others idle | Blocking call on async runtime thread | Class 2 |
| Poisoned std::sync::Mutex after panic | Cascading panic-in-critical-section | Class 8 |
| Lost updates, wrong counter values, weird retries | Lost wakeup / missed notification / incorrect memory ordering | Class 9 |
Definition: Two or more threads each hold a lock the other needs; circular wait in the lock-wait graph.
Canonical forms:
RwLock::read, then asks for RwLock::write in the same thread → guaranteed hang.pthread_cond_wait on M; its waker needs to acquire M to signal but can't.How to spot at rest (static audit): search for any function that acquires two distinct mutexes, verify all call paths acquire them in the same order everywhere. Any deviation is a latent deadlock. See STATIC-AUDIT.md for ast-grep recipes.
How to spot at runtime: see gdb-for-debugging §"Lock Graph Construction & Deadlock Proof". The algorithm: identify all threads in __lll_lock_wait, read the __owner field on each contested pthread_mutex_t to build the wait-for graph, find a cycle.
Fix catalog:
parking_lot::deadlock detector can enforce this at runtime..await DeadlocksDefinition: The logical task graph has a cycle, or a task that holds a non-.await-aware lock yields to the runtime and is never re-polled because the next task needs the same lock.
Canonical forms:
std::sync::Mutex held across .await. The guard crosses the yield point; the task is parked with the lock still held; another task needs the lock and blocks the worker thread.block_on inside an async runtime. Runtime thread enters a synchronous wait; the thing it's waiting for needs the runtime to make progress.spawn_blocking missing (or misused for sync I/O from async context)..awaits B's handle; B .awaits A's.Signature at rest: grep the codebase for let guard = lock.lock(); ... .await and std::sync::Mutex inside async fn. Use the recipes in STATIC-AUDIT.md — this is the highest-ROI static check you can run on an async Rust codebase.
Signature at runtime: workers idle in epoll_wait, but requests pending. See gdb-for-debugging §"Diagnosing Async Deadlocks".
Fix catalog:
.await. Explicitly: let data = { let g = lock.lock(); g.clone() }; do_io(data).await;.tokio::sync::Mutex only when you must hold the lock across .await. It is slower — prefer dropping the guard.spawn_blocking for synchronous I/O from an async context (synchronous SQLite, std::fs::read, CPU-heavy work, C library calls).mpsc; replies via oneshot. No shared mutex, no lock-order bugs.try_send + drop-oldest policy.block_on inside an async context. If you must bridge, use Handle::current().spawn_blocking(...) or restructure to avoid the bridge.Definition: Threads make visible activity (futex_wake + futex_wait, high CPU, log noise) but no forward progress. Often mistaken for a deadlock.
Canonical forms:
accept4 returns EAGAIN, immediately retried; no poll, no sleep.Signature: 100% CPU, strace shows a tight loop of the same syscall, logs show retry messages stacked.
Fix catalog:
parking_lot is unfair by default; switch to fair() if starvation is observed.The recurring pain points across our projects:
SQLITE_BUSY / "database is locked". Multiple connections want the write lock simultaneously. The loser fails.rusqlite::Connection is synchronous; using it from an async handler without spawn_blocking blocks the runtime thread.busy_timeout, no journal_mode=WAL, no synchronous=NORMAL. Every writer serializes with exclusive locks and no retry.BEGIN followed by a read followed by a write upgrades the lock; another writer that's already in a write transaction now deadlocks.Fix catalog:
PRAGMA journal_mode = WAL;
PRAGMA synchronous = NORMAL;
PRAGMA busy_timeout = 5000; -- ms; SQLite will retry internally
PRAGMA foreign_keys = ON;
PRAGMA temp_store = MEMORY;
PRAGMA mmap_size = 268435456;
Mutex<Connection> or a single actor task. Readers can use a pool.BEGIN IMMEDIATE for transactions that will write. Acquires the write lock up-front; prevents deferred-to-immediate upgrade deadlocks.SQLITE_BUSY with exponential backoff + jitter, on top of the internal busy_timeout.PRAGMA wal_checkpoint(TRUNCATE) on a schedule or after bulk writes so WAL doesn't grow unbounded.spawn_blocking. Or use sqlx/tokio-rusqlite which do it for you.See DATABASE.md for the full WAL semantics reference, PRAGMA matrix, retry-with-backoff Rust template, and project-sourced incident reports.
Definition: Code that runs during early process/library initialization acquires a lock, and something on the init path re-enters the same lock (or a lock held by the loader itself).
The canonical case from glibc_rust: libglibc_rs_abi.so exports strlen. When loaded via LD_PRELOAD, the dynamic loader calls strlen during symbol resolution. strlen calls into the membrane crate, which touches a OnceLock holding global policy. OnceLock::get_or_init takes a lock. The allocator inside get_or_init also goes through the same libc and re-enters the ABI. Reentrant lock on a non-recursive primitive → infinite hang.
The broader rule: Any function that may be called before main — or by a library interposition — cannot use OnceLock, std::sync::Mutex, lazy_static, RwLock, or the allocator. All of these can block.
Static-audit signature:
ast-grep run -l Rust -p '$X::get_or_init($$$)'
rg -n 'OnceLock|OnceCell|Lazy|lazy_static|thread_local!' crates/<preload_lib>/
Every hit is a potential hazard in an LD_PRELOAD context.
Fix catalog:
OnceLock. Encode {UNINIT=0, INIT_IN_PROGRESS=1, INIT_DONE=2} in an AtomicU8; race losers spin-wait briefly (rare path) or fall back to a null-safe default.const fn, static).main).LD_PRELOAD the binary against a small program that calls every exported function; any hang means reentrant init.See LD-PRELOAD.md for the full incident + fix narrative from glibc-rust/frankenlibc sessions.
Definition: Unsynchronized concurrent access to the same memory; one of the accesses is a write. In a language with a defined memory model (Rust, Go, Java, C11+), this is undefined behavior.
TOCTOU (time-of-check-to-time-of-use): Check a condition, then act on it, assuming it's still true. It isn't.
Discovery: TSAN is ground truth. RUSTFLAGS="-Zsanitizer=thread" cargo +nightly build ... then run the test suite with high concurrency. For Go: go test -race. For C/C++: -fsanitize=thread.
Fix catalog:
Mutex / RwLock / Atomic. The compiler enforces this in Rust; listen to it.AtomicUsize with Ordering::Relaxed only if you've read the memory-ordering rules; otherwise SeqCst. Err on the side of stronger.compare_exchange, transactional updates, or hold the lock across check + action.See gdb-for-debugging §"Race Condition Methodology" for the reproduce → detect → localize → fix → verify loop.
Definition: Multiple processes (or agents) contend for a shared resource — a file, a database, a tmux session, a git working tree — without in-process synchronization.
Our typical forms:
PRAGMA locking_mode=NORMAL (not EXCLUSIVE).Fix catalog:
file_reservation_paths with an appropriate TTL + a reason tying back to the bead/task. Release explicitly; don't rely on TTL.flock(2) for filesystem-only coordination. Advisory, cooperative. Every consumer must call it.wezterm-mux-server is sacred — protect it explicitly (see system-performance-remediation).Definition: A thread panics while holding a Mutex. Rust's std::sync::Mutex poisons the mutex; subsequent .lock() calls return Err(PoisonError). If the panic left shared state partially updated, every caller must now decide: trust or discard.
Fix catalog:
parking_lot::Mutex does not poison. It's faster and simpler, but callers must handle partial state explicitly.Definition: Correct locks, incorrect assumptions about visibility or ordering. The observed behavior seems to violate program order — because it does, on the CPU's reordered view.
Canonical forms:
Notify::notify_one before notified().await — the notification is dropped.Ordering::Relaxed on a pointer publication — reader sees a garbage object because the initializer store hasn't become visible.Fix catalog:
while !ready { cv.wait(lock) }. Never if.Notify with notified() set up before the event can happen (see Tokio Notify docs — the notified() future must be polled at least once to subscribe).Ordering::Release for the producer store and Ordering::Acquire for the consumer load when publishing a pointer / building an atomic state machine. Never Relaxed for data publication.When a bug has been reported:
thread apply all bt full → file. Once the process dies, the evidence is gone..lock() is a smoke alarm, not a fire extinguisher.--test-threads=N and loom (Rust) or go test -race. Fuzz the scheduler with rr --chaos if you have it.When doing a preemptive audit (no bug reported yet):
parking_lot deadlock detection in debug builds; run the test suite. Any detection is a proof of deadlock.loom (if Rust) on the core concurrency primitives of the project.unsafe impl Send/Sync. Each one is a hand-written promise the compiler couldn't check.See STATIC-AUDIT.md for the full catalog. Highlights:
# Rust: guard held across await (manual inspection required)
rg -n --type rust -U 'let\s+\w+\s*=\s*.*\.(lock|read|write)\(\).*\n[^}]*\.await' .
# Rust: std::sync::Mutex inside async fn (smell)
ast-grep run -l Rust -p 'async fn $F($$$) { $$$ std::sync::Mutex $$$ }'
# Rust: block_on inside anywhere (double-check: may be inside a sync bridge)
rg -n --type rust 'block_on' .
# Rust: OnceLock / Lazy in LD_PRELOAD libs (Class 5)
rg -n --type rust 'OnceLock|OnceCell|Lazy::new|lazy_static!|thread_local!' crates/<preload>/
# Two different lock orderings in the same code (Class 1)
rg -n --type rust 'let\s+\w+\s*=\s*self\.\w+\.lock\(\)' . | sort -u
# SQLite: missing busy_timeout (Class 4)
rg -n 'Connection::open|open_in_memory' . | rg -v 'busy_timeout'
# Rust: unbounded channel (Class 2 back-pressure risk)
rg -n 'unbounded_channel|mpsc::unbounded' --type rust .
# Missing fairness on rwlock (Class 3)
rg -n 'RwLock::new' --type rust . # review each for writer-starvation risk
See FIX-CATALOG.md. Summary:
| Broken Pattern | Replace With | Why |
|----------------|--------------|-----|
| OnceLock on LD_PRELOAD path | AtomicU8 state machine | No allocator, no reentrancy |
| std::sync::Mutex held across .await | Scoped guard dropped before .await | Task yield with lock is a bug |
| Deep call holding two locks | Total lock order + assertion | Eliminate cycle possibility |
| Retry-on-BUSY tight loop | Exponential backoff + jitter | Break livelock |
| Connection-per-request SQLite | Single writer, read pool | Prevent lock escalation storms |
| Shared Mutex<Vec<Work>> | mpsc::channel + actor | No lock for producers |
| lazy_static in LD_PRELOAD | const / compile-time init | No lock needed |
| std::Mutex + panic risk | parking_lot::Mutex + transaction-style updates | No poisoning, clearer semantics |
| flock only in-process | flock + app-level lease + TTL | Multi-process coordination |
try_lock_for(Duration) over .lock(); timeout(Duration, fut).await over bare .await. Every hang becomes a log line, not a stall.Before you declare a concurrency fix done:
#[test] with --test-threads=N, or loom::model, or a stress harness with N=100× the old workload.loom::model passes for the critical primitive if Rust.| Topic | Reference | |-------|-----------| | The Fourth Instance (find ALL hazards, not just one) | THE-FOURTH-INSTANCE.md | | Static-audit recipes (ast-grep + ripgrep, all languages) | STATIC-AUDIT.md | | Fix catalog (14+ canonical replacements) | FIX-CATALOG.md | | Diagnosis techniques (pointers to gdb-for-debugging) | DIAGNOSIS.md | | Anti-patterns (what NOT to do, all classes) | ANTI-PATTERNS.md | | Incident narratives (8+ real project stories) | INCIDENTS.md | | Validation tooling (TSAN, loom, miri, parking_lot, rr) | VALIDATION.md |
| Language | Reference | |----------|-----------| | Rust (asupersync) — PRIMARY: Cx, Scope, obligations, lab/DPOR, structured concurrency | ASUPERSYNC.md | | Rust (tokio/std ecosystem) — tokio, parking_lot, crossbeam, rayon, dashmap, sqlx | RUST.md | | Go — goroutines, channels, sync, context, errgroup, pprof, race detector | GO.md | | Python — GIL, asyncio, threading, multiprocessing, trio/anyio, py-spy | PYTHON.md | | TypeScript / Node.js — event loop, promises, worker_threads, React, Next.js, Prisma | TYPESCRIPT.md |
| Topic | Reference | |-------|-----------| | Database concurrency (SQLite WAL, PRAGMAs, retries) | DATABASE.md | | LD_PRELOAD / reentrant init (glibc-rust incident) | LD-PRELOAD.md | | Async / await (cross-language async patterns) | ASYNC.md | | Multi-process / swarm (agent-mail, flock, leases) | SWARM.md | | Distributed concurrency (Redlock, pg_advisory, etcd, CRDTs, saga, outbox) | DISTRIBUTED.md | | Creative patterns (actor, STM, CSP, structured concurrency, single-writer, "do nothing") | CREATIVE-PATTERNS.md | | Lock-free (CAS, ABA, epoch reclamation, seqlocks, flat combiner, HTM) | LOCK-FREE.md | | Formal methods (loom, DPOR, TLA+, miri, linearizability, evidence ledgers) | FORMAL-METHODS.md | | Resilience patterns (circuit breaker, bulkhead, singleflight, backpressure, hedge, quorum) | RESILIENCE-PATTERNS.md | | Concurrency operators (composable diagnostic moves with triggers + failure modes + prompts) | CONCURRENCY-OPERATORS.md | | C/C++ systems (pthread, memory model, signal safety, fork hazards, io_uring, epoll) | C-CPP.md | | Database advanced (Postgres advisory, SKIP LOCKED, SSI, MVCC, Prisma/Drizzle, Redis) | DATABASE-ADVANCED.md | | Cookbook index (dispatch by language, topic, or bug class) | COOKBOOK-INDEX.md | | Cross-language matrix (primitive equivalents, same-bug-different-language, detection tools) | CROSS-LANGUAGE.md |
| Skill | Use When |
|-------|----------|
| /cs/gdb-for-debugging/ | Lock-graph construction, async runtime debugging, TSAN, rr |
| /cs/asupersync-mega-skill/ | Full asupersync runtime, migration, all reference files |
| /cs/agent-mail/ | Advisory file reservations, multi-agent coordination |
| /cs/system-performance-remediation/ | Process triage, kill hierarchy, mux protection |
tools
Convert markdown plans into beads with dependencies using br CLI. Use when creating task graphs, polishing beads before implementation, or bridging planning to agent swarm execution.
development
Orchestrate Codex skill optimization during active sessions through $cas goal control, $shadow single-session evidence, $tune diagnosis/refinement briefs, and the skill-optimizer custom subagent. Trigger for $opt, skill optimization loops, session-driven skill tuning, meta-skill audits, or explicit validated skill edits. Do not use for general code optimization, product optimization, or performance tuning.
development
Run a targeted fresh-eyes blunder pass over code, specs, plans, adjudications, closure gates, skill edits, or negative-evidence ledgers. Trigger when asked to reread with fresh eyes, find obvious bugs, catch mistakes/oversights/omissions, check for embarrassing misses, or perform a second independent blunder pass before closure. Do not use as a substitute for implementation, adjudication, or verification; use it as the final falsification/check pass for those workflows.
development
Explicitly shadow, tail, watch, follow, monitor, supervise, or companion exactly one Codex session id/path through `$seq`, then apply a named target skill as an interpretation/reporting/proposal/action lens until the watched session stops.