
Use when a markdown plan file exists and needs validation before implementation — catches design flaws, logic holes, footguns, unnecessary complexity, and performance concerns while changes are still cheap
Use when a markdown plan file exists and needs validation before implementation — catches design flaws, logic holes, footguns, unnecessary complexity, and performance concerns while changes are still cheap
Use when writing or reviewing TLA+ specifications for coordination protocols, when verifying safety/liveness properties of distributed algorithms, or when TLC model checking fails and you need diagnostic guidance. Evidence-backed TLA+ correctness methodology.
Use when the user wants a minimal source-only archive for upload or checkpoint. Creates a tar.gz with just source and test files from all workspace crates.
Use when measuring optimization impact against a baseline, when validating that a code change didn't regress performance, or when comparing two implementation approaches. Criterion benchmark baseline comparison workflow.
Use when designing safety-critical code, distributed protocols, or novel algorithms where getting the design wrong is expensive. Parallel research agents survey papers, production systems, and prior art, then synthesize into an evidence-backed codebase plan.
Use when you have code review findings, PR comments, or review reports that need to be systematically addressed — especially when there are multiple findings across different files and severities
Use when the learning guide may be stale after codebase changes, when types have been renamed or modules deleted, or when verifying guide code examples still compile. Synchronizes gossip-rs-learning-guide with current codebase state.
Use when validating coordination correctness under real network partitions, when DST passes but you suspect distributed bugs, or before releasing coordination protocol changes. Runs Jepsen-style cluster tests via Maelstrom or full Jepsen.
Use when you need to classify why code is slow (front-end vs back-end vs speculation), when hunting branch misprediction sites, after /bench-compare or /perf-regression finds a regression needing root cause, or when building an isolated hot-loop harness. Cross-arch TMA and branch tracing.
Use when a task needs an implementation plan that is iteratively created and stress-tested through review-and-revise cycles before implementation begins — catches blind spots, incorrect codebase assumptions, unnecessary complexity, and performance pitfalls while changes are still cheap
Use when PR review comments claim a bug or incorrect behavior, when multiple reviewer comments need systematic triage, or when a correctness claim needs proof before changing code. Verify-first PR comment response with evidence-based fixes.
Use when adding or modifying rules in default_rules.yaml, when benchmarking rule performance against test corpuses, or when validating regex anchors and keyword choices. Detection rule edit-bench-compare workflow.
Use when the user wants to launch a runbook, run an audit, review, or analysis task via Jetty, or says "run the X runbook". Also triggers on "launch runbook", "execute runbook", "run dedup-audit", "run security-reviewer", or any request to run a task from runbooks/.
Use when sanity-checking coordination changes before commit, before merging coordination PRs, when a coordination bug is suspected, or when verifying a simulation-found fix. Progressive DST execution with seed management and fault injection.
Use when a test module has many similar unit tests, when repetitive assertions could be replaced by property-based or parameterized tests, or when test maintenance cost is high. Consolidates verbose suites into rstest, proptest, or fuzz tests.
Run Criterion benchmarks with baseline comparison for performance optimization work
Use when creating any beads task — auto-researches the codebase, links related tasks, and produces a rich self-contained description from a structured template. Accepts minimal intent and outputs a complete task ready for agent implementation.
Comprehensive 6-phase research funnel — 8-10 parallel survey agents sweep wide, a synthesizer compiles evidence, deep-dive and adversarial agents run in parallel to elaborate and challenge findings, a final synthesizer reconciles everything, and an integrator maps verified findings to a concrete codebase plan with full traceability
Run a parallel diverge-then-converge design tournament — 3-5 independent agents explore a problem, then 2 ranking agents evaluate and stack-rank the results with confidence scores
Review Rust interfaces for ease of correct use and resistance to misuse, applying "make interfaces easy to use correctly and hard to use incorrectly"
Run Jepsen-style cluster tests using Maelstrom (lightweight) or full Jepsen (heavyweight) — validates correctness of the deployed gossip-rs system with real network behavior, complementing in-process DST
Analyze Rust code for performance issues, allocation hot spots, and optimization opportunities
Use when a task needs an implementation plan that is iteratively created and stress-tested through review-and-revise cycles before implementation begins — catches blind spots, incorrect codebase assumptions, unnecessary complexity, and performance pitfalls while changes are still cheap
Respond to PR review comments by building the smallest proof that confirms or refutes the claim before changing code or docs — never blindly trust a reviewer
Use when a beads task exists and needs validation before implementation — verifies codebase references, identifies edge cases and design flaws, assesses scope and feasibility, splits oversized tasks, dispatches domain-specific skills (test-strategy, unsafe-review, dist-sys-auditor, simd-optimize, asm-forge, performance-analyzer, security-reviewer, interface-design-review, sim-review, safe-over-unsafe) for specialized enrichment, and dispatches /deep-research or /deeper-research for ambiguous areas. The complement of /create-task — ensures tasks are buttoned up and ready for mechanical implementation.
Run cargo-fuzz targets with proper nightly toolchain and options
Use when designing safe public APIs that wrap unsafe Rust code, adding unsafe blocks to existing types, reviewing unsafe code for soundness, or creating new types backed by raw pointers, MaybeUninit, or FFI
Simulation-testability code review — enforces DST-compatible patterns in coordination, gossip, and pipeline code based on FoundationDB, TigerBeetle, sled, and Firezone evidence
Scaffold simulation-testable modules with sans-IO pattern, proptest state machine tests, and fault injection points — prevents retrofitting costs by making code DST-ready from the start
Consolidate verbose test suites by replacing repetitive unit tests with property-based tests, parameterized tests (rstest), or fuzz tests. Less code to maintain, same or better coverage.
Assess and recommend the appropriate testing strategy for Rust code - unit tests, parameterized tests (rstest), property-based tests, fuzz tests, Kani model checking, or simulation testing
Comprehensive review of unsafe code — audits safety invariants, demands benchmark+ASM proof of performance benefit, and verifies Miri/Kani/fuzz/property test coverage for every unsafe block
Use when the user wants to package all source code into a tar.gz archive for upload or checkpoint. Creates a comprehensive archive of all workspace crates, docs, and config excluding binaries.
Use when the user wants to package all source code into a tar.gz archive for upload or checkpoint. Creates a comprehensive archive of all workspace crates, docs, and config excluding binaries.
ASM-guided deep performance optimization. Collects assembly, audits codegen quality, applies targeted transforms, validates with benchmarks. Uses cargo-show-asm + Criterion as ground truth.
Use when flamegraph/perf profiling identified hot functions but you are unsure which are on the critical path, when optimizing a hot function yields no measurable improvement, when concurrent code has hidden contention or pipeline imbalance, or when you need to prioritize optimization effort across multiple hot spots. Linux-only, synchronous code paths only (not async/Tokio).
Use when creating any beads task — auto-researches the codebase, links related tasks, and produces a rich self-contained description from a structured template. Accepts minimal intent and outputs a complete task ready for agent implementation.
Use when design documents in docs/ may be stale after code changes, when verifying boundary specs match current types and APIs, when checking for missing documentation coverage of new crates or features, or before merging branches that touch documented subsystems.
Use when design documents in docs/ may be stale after code changes, when verifying boundary specs match current types and APIs, when checking for missing documentation coverage of new crates or features, or before merging branches that touch documented subsystems.
Distributed systems design and implementation auditor — enforces evidence-backed coordination decisions, citation requirements, invariant tracking, and correctness verification against academic literature, battle-tested systems, and the project's locked architectural decisions
Write-then-verify documentation pipeline. Use when a user asks to improve comments or docs, explain algorithms or design choices, write or upgrade docstrings, or raise documentation quality for a codebase (especially Rust crates). Writes docs, then automatically verifies every claim against code reality using a fresh agent to eliminate confirmation bias.
Verify documentation accuracy against code reality and external claims — runs as a fresh agent after /doc-rigor to prevent confirmation bias
Use when you have code review findings, PR comments, or review reports that need to be systematically addressed — especially when there are multiple findings across different files and severities
Review tests to ensure they actually prove the claimed invariant, especially state-machine, simulation, oracle, and regression tests where extra setup, missing negative paths, or order-sensitive comparisons can hide the real signal
Deep Linux perf profiling — PMU counters, topdown analysis, flamegraphs, and annotated hotspot drill-down on ARM/Graviton
Performance regression testing workflow for hot path changes
Use when optimizing Rust binary performance via profile-guided compilation and post-link layout — squeezing 10-30% from I-cache, branch prediction, and function placement without source changes
Parallel specialist code review — 6 focused agents (correctness, design, performance, safety, docs, complexity) diverge independently, then a single ranker merges findings into an importance-ranked report with confidence scores
Workflow for modifying and benchmarking detection rules
Use when designing safe public APIs that wrap unsafe Rust code, adding unsafe blocks to existing types, reviewing unsafe code for soundness, or creating new types backed by raw pointers, MaybeUninit, or FFI
Audit memory safety and security in unsafe code blocks, buffer handling, and security-sensitive operations
Run deterministic simulation tests with progressive difficulty levels (sunny/stormy/radioactive) inspired by TigerBeetle VOPR — orchestrates seed management, workload selection, and invariant verification
Review and tune SQLite schemas, queries, indexes, and pragmas. Connects to the actual database to gather concrete evidence (EXPLAIN QUERY PLAN, page counts, table stats) before recommending changes.
TLA+ specification correctness guide — evidence-backed methodology for writing correct temporal logic specs, covering canonical form, abstraction selection, safety/liveness decomposition, fairness, TLC soundness, and distributed systems patterns, with every rule grounded in literature
Use when /performance-analyzer identifies a hot function, when /bench-compare shows regression and you need instruction-level analysis, or when you suspect bounds checks or register spills in a tight loop. ASM-guided optimization with cargo-show-asm + Criterion.
--- name: autoresearch description: Autonomous Goal-directed Iteration. Apply Karpathy's autoresearch principles to ANY task. Loops autonomously — modify, verify, keep/discard, repeat. Supports bounded iteration via Iterations: N inline config. version: 1.9.11 --- # Claude Autoresearch — Autonomous Goal-directed Iteration Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). Applies constraint-driven autonomous iteration to ANY work — not just ML research. **Core id
Use when adding or modifying coordination protocols, implementing consensus or gossip mechanisms, or changing distributed state management. Audits designs against academic literature and battle-tested systems with citation requirements.
Use when /doc-rigor has written or updated documentation and you need independent accuracy verification, or when existing docs may contain stale API claims or wrong command examples. Fresh-agent verification against code reality.
Use when writing or updating documentation that makes API claims, includes command examples, or states platform-specific behavior. Write-then-verify pipeline where a fresh agent checks accuracy against code reality with zero confirmation bias.
Deep first-principles code explanation that builds real understanding through phased walkthroughs with diagrams. Covers algorithms, data structures, memory layout, concurrency patterns, and performance tricks — especially for systems code in Rust. Use whenever the user asks to explain, walk through, break down, deep dive into, or understand code. Trigger on "how does this work", "what's happening here", "teach me about this", "why is it done this way", or when the user references a file with @ and wants to understand it. Proactively use when examining code involving lock-free algorithms, atomics/CAS, memory ordering,
Use when profiling on Linux/ARM/Graviton targets, when you need PMU counter data beyond what flamegraphs show, or when /perf-topdown identifies a bottleneck class that needs source-level drill-down. Deep perf profiling with annotated hotspot analysis.
Use when /bench-compare or /perf-regression identifies a regression needing root cause, when multiple performance dimensions need simultaneous triage, or when optimization work should be dispatched automatically. Two-phase diagnose-then-optimize pipeline.
Use when modifying hot-path code in coordination or scanner engine, before merging performance-sensitive changes, or when CI benchmarks flag a regression. Performance regression testing with before/after comparison.
Use when designing or auditing PostgreSQL schemas, reviewing migrations for lock safety, investigating query performance, or optimizing indexes and partitioning. Connects to the database for concrete evidence via EXPLAIN ANALYZE and pg_stat_* views.
Use when you want review AND automated fixes in one pass, when /review-dispatch alone would leave findings unaddressed, or before merging a feature branch that needs thorough diagnosis and remediation. Two-phase diagnose-then-fix pipeline.
Use when testing gossip-contracts or gossip-stdx data structures for crashes, when verifying new Arbitrary impls, or when reproducing a fuzz crash artifact. Runs cargo-fuzz targets with nightly toolchain.
Use when modifying unsafe blocks, adding parsing or decoding logic, changing buffer pool or scratch internals, or before merging changes to data structure implementations with raw pointers. Memory safety and security audit.
Use when designing or auditing SQLite schemas, investigating slow queries, tuning indexes or pragmas, or reviewing WAL/journal configuration. Connects to the database for concrete evidence via EXPLAIN QUERY PLAN and page stats.
Use when creating implementation-ready beads tasks that need testing strategy, optimal implementation approach, and documentation requirements baked in — composes /create-task with parallel enrichment agents that analyze the codebase and produce concrete test specifications, algorithm/data-structure guidance, and doc quality standards so implementing agents don't need to re-research
Use when writing tests for new code and unsure which test type fits, when choosing between unit/rstest/proptest/fuzz/kani/sim, or when coordination or unsafe code changes need test coverage guidance. Recommends the optimal testing approach per code characteristic.
Use when adding or modifying unsafe blocks, when reviewing code that uses raw pointers or transmute, or before merging changes to types with unsafe internals. Audits safety invariants and demands benchmark+ASM proof of performance benefit.
Use when /asm-forge shows autovectorization missed opportunities, when hot loops process arrays of bytes or integers, or when porting x86 SIMD to ARM NEON/SVE. Generates platform-specific intrinsics with correctness and performance validation.
Write-then-verify documentation pipeline — a doc-rigor agent writes/improves docs, then a separate fresh doc-verify agent checks accuracy of API claims, command examples, units, and platform assumptions against code reality with zero confirmation bias
Use when creating a new module in gossip-coordination, adding a gossip protocol component, or building a pipeline stage that touches distributed state. Generates DST-ready boilerplate with sans-IO pattern and proptest harnesses.
Use when implementing a new feature and assessing coverage gaps, during periodic test hygiene, when test suites feel bloated, or before merging code that changes coordination or hot paths. Two-phase assess-then-improve testing pipeline.
Use when modifying gossip-coordination or coordination contracts, after adding trait methods to CoordinationBackend, after changing state machine transitions, or before marking coordination PRs ready. DST-compatibility code review.
Use when preparing to merge a feature branch, after completing a significant implementation, or when critical code paths need deeper review than a single pass. Six parallel specialist agents plus ranked synthesis.
Review test suites for duplicate, redundant, or low-value tests — especially unit tests already subsumed by property-based tests. Remove noise, keep signal.
Use when exploring the codebase conceptually — semantic search via claude-context MCP for queries like "how does X work", "find implementation of Y pattern", "where is the architecture for Z", understanding unfamiliar code, finding code by description rather than exact identifier
Use when test suites feel bloated, when unit tests duplicate coverage already provided by property-based or simulation tests, or during periodic test hygiene. Identifies and removes redundant tests while keeping signal.
Use when a beads task exists and needs validation before implementation — verifies codebase references, identifies edge cases and design flaws, assesses scope and feasibility, splits oversized tasks, dispatches domain-specific skills (test-strategy, unsafe-review, dist-sys-auditor, simd-optimize, asm-forge, performance-analyzer, security-reviewer, interface-design-review, sim-review, safe-over-unsafe) for specialized enrichment, and dispatches /deep-research or /deeper-research for ambiguous areas. The complement of /create-task — ensures tasks are buttoned up and ready for mechanical implementation.
Use when writing or reviewing state-machine tests, simulation tests, oracle tests, or regression tests to verify they actually prove the claimed invariant. Catches hidden weaknesses like missing negative paths and order-sensitive comparisons.
Use when facing a design decision with multiple viable approaches, when you want competing proposals evaluated objectively, or when brainstorming needs structured diverge-then-converge evaluation. 3-5 independent design agents plus ranked synthesis.
Use when the user wants a minimal source-only archive for upload or checkpoint. Creates a tar.gz with just source and test files from all workspace crates.
Explain a PR's purpose, motivation, and architectural context with ASCII diagrams. Use when the user wants to understand what a PR does, why it exists, how it fits into the system, or asks for a visual summary of changes. Triggers on "explain this PR", "what does this PR do", "summarize this branch", "show me what changed", or `/pr-explainer`.
Use when exploring the codebase conceptually — semantic search via claude-context MCP for queries like "how does X work", "find implementation of Y pattern", "where is the architecture for Z", understanding unfamiliar code, finding code by description rather than exact identifier
Use when writing hot-path code in coordination or scanner engine, before committing changes to scanner-engine modules, when benchmarks show unexpected regressions, or during optimization of gossip-stdx data structures. Static performance analysis.
Use when designing new public APIs, adding trait methods, or refactoring type signatures to ensure they are easy to use correctly and hard to use incorrectly. Reviews Rust interfaces for misuse resistance.
Use when design docs in docs/ may be stale after code changes, when verifying diagrams match current types, or when checking that prose claims about invariants and APIs still hold. Audits documentation against code reality with incremental and full modes.
Systematic multi-pass code deduplication audit for Rust workspaces. Use when duplication has accumulated across crates, when error boilerplate is excessive, when repeated From/Display/Error impls appear across modules, when onboarding thiserror, or when establishing CI duplication gates. Triggers on "find duplicates", "reduce duplication", "dedup audit", "thiserror migration", "error boilerplate".
Use when the user wants to create a new runbook, write agent instructions for a repeatable task, or says "create a runbook for X", "write a runbook", "new runbook". Also triggers on "make this into a runbook" or converting an existing skill into a runbook.
SIMD vectorization for Rust — detects ISA features, identifies vectorizable patterns, generates platform-specific intrinsics (ARM NEON/SVE, x86 SSE/AVX/AVX-512), validates correctness and performance. Uses tiered research with baked-in references and /deep-research fallback.
Deep research before design — 3-5 parallel research agents survey papers, production systems, failure modes, and prior art, then a synthesizer compiles evidence, and an integrator maps findings to a concrete codebase plan with citations
Use when AllocGuard trips and you need the call site, when /performance-analyzer flags allocations but you need attribution, when verifying HOT-tier allocation silence, or when /bench-compare shows regression and you suspect allocation overhead. Heap allocation profiling with DHAT.
Use when /deep-research isn't thorough enough, when a topic needs adversarial challenge and deep-dive elaboration, or when producing a polished research report for a complex design decision. 6-phase funnel with 8-10 parallel survey agents plus adversarial review.