.claude/skills/asm-forge/SKILL.md
Use when /performance-analyzer identifies a hot function, when /bench-compare shows regression and you need instruction-level analysis, or when you suspect bounds checks or register spills in a tight loop. ASM-guided optimization with cargo-show-asm + Criterion.
npx skillsauth add ahrav/gossip-rs asm-forgeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Philosophy: ASM and benchmarks are truth. Everything else is hypothesis.
Every optimization starts by reading what the compiler actually emitted, identifying
where it fell short, making a targeted change, and measuring the result. No guessing,
no cargo-culting #[inline] annotations.
/performance-analyzer or /linux-perf-profile identified a hot function/bench-compare shows a regression and you need to understand why at the instruction level# cargo-show-asm: primary ASM inspection tool
cargo install cargo-show-asm
# Criterion benchmarks must exist for the target code
# (this project has 24+ benchmarks in benches/)
# For LLVM-IR analysis when ASM isn't enough
rustup component add llvm-tools-preview
# For instruction-level profiling on Linux
# (optional, use /linux-perf-profile skill instead)
Always build with debug info for source mapping, but full optimization:
RUSTFLAGS="-C target-cpu=native -C debuginfo=2" cargo build --release
The project's Cargo.toml already has opt-level=3, lto="thin", codegen-units=1.
/asm-forge @file.rs "focus description"
│
├─ Phase 0: Recon (3 parallel subagents)
│ ├─ Agent A: Collect ASM for target functions
│ ├─ Agent B: Run baseline benchmarks
│ └─ Agent C: Static hotspot analysis of target code
│
├─ Phase 1: ASM Audit
│ └─ Read assembly, classify codegen issues using references
│
├─ Phase 2: Optimization Plan
│ └─ Ranked list of source transforms with predicted ASM impact
│
├─ Phase 3: Forge Loop (iterative, ONE change at a time)
│ ├─ Apply source change
│ ├─ Re-collect ASM → diff before/after
│ ├─ Run benchmarks → compare against baseline
│ └─ Accept (ASM + benchmark both improved) or revert
│
└─ Phase 4: Summary Report
└─ Total improvement, ASM diffs, remaining opportunities
Launch three parallel subagents using the Task tool:
Use the Task tool with subagent_type=Bash to collect assembly:
# List available functions in a module (find exact symbol names)
cargo asm --lib -p <crate> 2>&1 | grep '<module_path>'
# Collect ASM for a specific function (Intel syntax, interleaved with Rust source)
cargo asm --lib -p <crate> --rust '<full::path::to::function>' > /tmp/asm-before.s
# For multiple functions, collect each:
cargo asm --lib -p <crate> '<function_1>' > /tmp/asm-before-fn1.s
cargo asm --lib -p <crate> '<function_2>' > /tmp/asm-before-fn2.s
# If cargo-show-asm can't find the function, list candidates:
cargo asm --lib -p <crate> 2>&1 | grep -i '<partial_name>'
# For LLVM-IR view (useful for understanding optimization decisions):
cargo asm --lib -p <crate> --llvm '<function>' > /tmp/llvm-before.ll
# For MIR view (useful for understanding Rust-level optimizations):
cargo asm --lib -p <crate> --mir '<function>' > /tmp/mir-before.mir
ISA auto-detection: cargo-show-asm emits native ISA by default:
--target flag for cross-compilation)Use the Task tool with subagent_type=Bash:
# Save baseline for the relevant benchmarks
# Identify which bench file covers the target functions
cargo bench --bench <relevant_bench> -- --save-baseline forge-before
# For benchmarks requiring the bench feature:
cargo bench --features bench --bench <relevant_bench> -- --save-baseline forge-before
Benchmark selection guide for this project (workspace crates):
gossip-contracts types → cargo bench -p gossip-contracts --bench identitygossip-coordination logic → cargo bench -p gossip-coordination --bench coordinationgossip-stdx data structures → corresponding bench (e.g., --bench inline_vec, --bench ring_buffer, --bench byte_slab)Use the Task tool with subagent_type=general-purpose:
Prompt: Read the target file(s) and identify performance-relevant patterns:
#[cold] or #[inline(always)]Cross-reference with references/forge-techniques.md for known source→ASM mappings.
Once recon completes, read the collected ASM files and audit codegen quality.
Load these references based on the target ISA:
references/aarch64-codegen.mdreferences/x86-64-codegen.mdreferences/asm-red-flags.md and references/ilp-and-microarch.mdScan the ASM for these categories (detailed patterns in references):
1. Panic paths in hot code (highest priority)
; x86-64: look for
call core::panicking::panic_bounds_check
call core::panicking::panic
; AArch64: look for
bl core::panicking::panic_bounds_check
bl core::panicking::panic
These are bounds checks the compiler couldn't elide. Each one adds a branch and
cold-path code to the hot loop. Fix with get_unchecked() (with safety proof) or
restructure to help the compiler prove bounds.
2. Register spills
; x86-64: look for stack spills
mov [rsp+offset], reg ; spill to stack
mov reg, [rsp+offset] ; reload from stack
; AArch64: look for
str xN, [sp, #offset] ; spill
ldr xN, [sp, #offset] ; reload
Spills mean the compiler ran out of registers. The function is too complex for the register allocator. Fix by splitting the function, reducing live variables, or restructuring to reduce register pressure.
3. Long dependency chains (ILP killers) Each instruction that depends on the previous one's result creates a serial chain. The CPU can't execute them in parallel even with out-of-order execution.
Look for sequences where each instruction uses the result of the previous one:
; Bad: serial chain, ~4 cycles each = 12 cycles total
load rax, [mem] ; cycle 0: load
add rax, rbx ; cycle 4: depends on load
imul rax, rcx ; cycle 5: depends on add
Fix by restructuring to create independent operations the CPU can overlap.
4. Missed vectorization If you see scalar operations on arrays where SIMD instructions exist:
; Bad: scalar loop processing one element at a time
.loop:
movzx eax, byte [rdi]
; ... process one byte ...
inc rdi
cmp rdi, rsi
jne .loop
; Good: SIMD processing 16/32 bytes at a time
.loop:
vmovdqu ymm0, [rdi] ; load 32 bytes
; ... process 32 bytes ...
add rdi, 32
cmp rdi, rsi
jne .loop
5. Unnecessary memory traffic Values loaded from memory, used once, then reloaded instead of staying in registers. Often caused by aliasing concerns or complex control flow.
6. Suboptimal instruction selection
>> 3 instead of / 8)cmov on x86-64, csel on AArch64)For each issue found, record:
### Issue N: [Category]
**Location**: instruction offset or source line (from interleaved view)
**ASM evidence**:
```asm
; the problematic instruction sequence
Root cause: Why the compiler generated this Proposed fix: Source-level change to improve codegen Expected ASM impact: What the improved code should look like Confidence: High/Medium/Low (based on how predictable the compiler response is)
## Phase 2: Optimization Plan
Rank all identified issues by:
1. **Impact**: How many cycles does this cost per invocation?
- Panic paths: 0 cycles normally (predicted correctly) but pollute icache
- Spills: 4-7 cycles per spill+reload pair
- Dependency chains: pipeline depth × chain length
- Missed SIMD: Nx speedup where N = vector width / scalar width
2. **Confidence**: How predictable is the compiler's response to our change?
- High: Removing bounds checks, adding `#[cold]`, `get_unchecked` → always works
- Medium: Restructuring for ILP, data packing → usually works
- Low: Hoping for auto-vectorization → compiler may not cooperate
3. **Risk**: Could this change introduce bugs?
- `unsafe` changes need safety proofs
- Layout changes need all access sites audited
- Algorithmic changes need correctness tests
Present as:
```markdown
## Forge Plan
### 1. [Issue] — Expected: X% improvement, Confidence: High
Source change: [specific diff]
ASM prediction: [what should change]
Benchmark: `cargo bench --bench <name> -- '<filter>'`
### 2. [Issue] — Expected: X% improvement, Confidence: Medium
...
Critical discipline: ONE change at a time.
For each planned optimization:
Make the source modification. Keep it minimal and isolated.
cargo asm --lib -p <crate> --rust '<function>' > /tmp/asm-after.s
# Use the bundled diff script
bash <skill_dir>/scripts/diff_asm.sh /tmp/asm-before.s /tmp/asm-after.s
Or manually:
diff --color=always /tmp/asm-before.s /tmp/asm-after.s | head -100
Verify the ASM actually changed as predicted:
If the ASM didn't improve as expected: The compiler may need a different hint, or the issue is elsewhere. Investigate before proceeding.
cargo bench --bench <relevant_bench> -- --baseline forge-before '<filter>'
# For bench-feature benchmarks:
cargo bench --features bench --bench <relevant_bench> -- --baseline forge-before '<filter>'
| ASM improved? | Benchmark improved? | Action |
|:---:|:---:|---|
| Yes | Yes | Accept. Update baseline: cargo bench --bench <name> -- --save-baseline forge-before |
| Yes | No | Investigate. ASM looks better but real workload didn't benefit. Bottleneck is elsewhere. Consider reverting. |
| No | Yes | Suspicious. Measurement noise? Re-run with more iterations. |
| No | No | Revert immediately. git checkout -- <file> |
After accepting a change:
/tmp/asm-before-*.s)After completing the forge loop, produce:
## ASM Forge Report: [target]
### Environment
- ISA: [AArch64 / x86-64]
- CPU: [Apple M3 / Intel i9 / Graviton3 / etc.]
- Build: `opt-level=3, lto=thin, codegen-units=1, target-cpu=native`
- Rust: [rustc version]
### Changes Applied
| # | Optimization | ASM Impact | Benchmark Impact | Status |
|---|---|---|---|---|
| 1 | Elided bounds check in inner loop | -2 branches, -1 panic path | -8.3% latency | Accepted |
| 2 | Restructured for ILP (2 independent chains) | +4 parallel ops | -12.1% latency | Accepted |
| 3 | Packed struct from 48→32 bytes | -2 cache lines per iter | -3.2% latency | Accepted |
| 4 | Attempted SIMD for byte scan | No vectorization emitted | No change | Reverted |
### Aggregate Improvement
| Benchmark | Before | After | Delta |
|---|---|---|---|
| hot_function | 145 ns | 112 ns | -22.8% |
| end_to_end_scan | 3.2 ms | 2.8 ms | -12.5% |
### Remaining Opportunities
[List any issues identified in Phase 1 that weren't addressed, with rationale]
### Key ASM Diffs
[Include the most instructive before/after ASM snippets for documentation]
# List all functions in a module (grep for your target)
cargo asm --lib -p <crate> 2>&1 | grep 'inline_vec'
# Show ASM with interleaved Rust source (best for audit)
cargo asm --lib -p <crate> --rust 'gossip_stdx::inline_vec::InlineVec<T,N>::push'
# Show only ASM (cleaner for diffing)
cargo asm --lib -p <crate> 'gossip_stdx::inline_vec::InlineVec<T,N>::push'
# Show LLVM-IR (understand optimizer decisions)
cargo asm --lib -p <crate> --llvm 'function_name'
# Show MIR (understand Rust-level optimizations, monomorphization)
cargo asm --lib -p <crate> --mir 'function_name'
# When function name is ambiguous, cargo-show-asm shows candidates
# Pick the right monomorphization by examining the type parameters
Load references/forge-techniques.md for the complete catalog. Key ones:
[] → bounds check + panic path. Use iterators or get_unchecked().Option<u32> → 8 bytes (niche optimization sometimes fails). Use sentinel values.When optimizing for both Apple Silicon and x86-64 Linux:
csel) is free and common. x86-64 cmov has limitations.
Branchless patterns may look different across ISAs.Stop forging when:
/linux-perf-profile on Linux)/performance-analyzer — Static hotspot analysis for this project's patterns/linux-perf-profile — Hardware counter analysis on Linux (PMU, cache, TLB)/bench-compare — Quick before/after benchmark comparison/perf-topdown — Cross-arch TMA classification; use when perf report --stdio --percent-limit=5.0 has identified hot symbols for ASM analysis/perf-regression — Full regression testing workflow before merging/pgo-bolt — After exhausting source-level ASM improvements, apply PGO+BOLT for binary-level layout optimizationdevelopment
Deep first-principles code explanation that builds real understanding through phased walkthroughs with diagrams. Covers algorithms, data structures, memory layout, concurrency patterns, and performance tricks — especially for systems code in Rust. Use whenever the user asks to explain, walk through, break down, deep dive into, or understand code. Trigger on "how does this work", "what's happening here", "teach me about this", "why is it done this way", or when the user references a file with @ and wants to understand it. Proactively use when examining code involving lock-free algorithms, atomics/CAS, memory ordering,
development
Use when creating implementation-ready beads tasks that need testing strategy, optimal implementation approach, and documentation requirements baked in — composes /create-task with parallel enrichment agents that analyze the codebase and produce concrete test specifications, algorithm/data-structure guidance, and doc quality standards so implementing agents don't need to re-research
development
--- name: autoresearch description: Autonomous Goal-directed Iteration. Apply Karpathy's autoresearch principles to ANY task. Loops autonomously — modify, verify, keep/discard, repeat. Supports bounded iteration via Iterations: N inline config. version: 1.9.11 --- # Claude Autoresearch — Autonomous Goal-directed Iteration Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). Applies constraint-driven autonomous iteration to ANY work — not just ML research. **Core id
development
Use when implementing a new feature and assessing coverage gaps, during periodic test hygiene, when test suites feel bloated, or before merging code that changes coordination or hot paths. Two-phase assess-then-improve testing pipeline.