.claude/skills/linux-perf-profile/SKILL.md
Use when profiling on Linux/ARM/Graviton targets, when you need PMU counter data beyond what flamegraphs show, or when /perf-topdown identifies a bottleneck class that needs source-level drill-down. Deep perf profiling with annotated hotspot analysis.
npx skillsauth add ahrav/gossip-rs linux-perf-profileInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Deep hardware-level performance analysis using Linux perf on this ARM (Neoverse V1 / Graviton3) system. Goes beyond Criterion benchmarks to explain why code is slow using PMU counters, topdown decomposition, cache/TLB analysis, and annotated disassembly.
/bench-compare finds a regression and you need to explain itEnsure a release build with debug info exists:
RUSTFLAGS="-C target-cpu=native -C debuginfo=2" cargo build --release
The -C debuginfo=2 flag is critical — it enables perf annotate to map samples back to Rust source lines without impacting optimization.
Use the mode that matches your investigation. Modes are ordered from broadest to most focused.
Get a high-level breakdown of where cycles are going: frontend stalls, backend stalls, or retiring useful work.
perf stat -e cpu_cycles,inst_retired,stall_frontend,stall_backend,stall_backend_mem,br_mis_pred_retired \
./target/release/gossip-worker 2>&1
Derived metrics to compute:
| Metric | Formula | Healthy | Investigate |
|--------|---------|---------|-------------|
| IPC | inst_retired / cpu_cycles | > 2.0 | < 1.0 |
| Frontend bound % | stall_frontend / cpu_cycles × 100 | < 10% | > 20% |
| Backend bound % | stall_backend / cpu_cycles × 100 | < 30% | > 40% |
| Backend memory % | stall_backend_mem / cpu_cycles × 100 | < 15% | > 25% |
| Branch mispredict rate | br_mis_pred_retired / br_retired × 100 | < 1% | > 3% |
Interpretation guide:
Record samples and identify which functions consume the most cycles.
perf record -g --call-graph dwarf,32768 -F 4999 \
./target/release/gossip-worker
Flags explained:
-g --call-graph dwarf,32768 — DWARF-based unwinding (works with Rust, unlike frame pointers). 32KB stack dump covers deep async stacks.-F 4999 — ~5000 samples/sec. Use a prime number to avoid aliasing with periodic code.perf report --no-children --sort=dso,symbol --percent-limit=1.0
--no-children shows self time only (not cumulative call-tree cost).--percent-limit=1.0 hides noise below 1%.perf report --no-children --sort=symbol --percent-limit=0.5 --stdio 2>&1 | head -80
Pipe the output here for analysis.
perf script | inferno-collapse-perf | inferno-flamegraph > flamegraph.svg
If inferno is not installed: cargo install inferno. Alternatively:
perf script > perf.script
# Then use https://www.speedscope.app/ — drag-and-drop perf.script
When topdown shows backend/memory stalls, measure the cache hierarchy.
perf stat -e l1d_cache,l1d_cache_refill,l1d_cache_lmiss_rd,l2d_cache,l2d_cache_refill,l2d_cache_lmiss_rd,dtlb_walk,itlb_walk,mem_access \
./target/release/gossip-worker 2>&1
Derived metrics:
| Metric | Formula | Healthy | Investigate |
|--------|---------|---------|-------------|
| L1d miss rate | l1d_cache_refill / l1d_cache × 100 | < 5% | > 10% |
| L1d → L2 miss rate | l1d_cache_lmiss_rd / l1d_cache_refill × 100 | < 30% | > 50% |
| L2 miss rate | l2d_cache_refill / l2d_cache × 100 | < 10% | > 20% |
| L2 → LLC/DRAM rate | l2d_cache_lmiss_rd / l2d_cache_refill × 100 | < 20% | > 40% |
| dTLB miss rate | dtlb_walk / mem_access × 100 | < 0.5% | > 2% |
| iTLB miss rate | itlb_walk / l1i_cache × 100 | < 0.1% | > 1% |
Common patterns in this codebase:
TimingWheel (crates/gossip-stdx/src/timing_wheel.rs) or SetAssociativeCache (crates/scanner-engine/src/lsm/set_associative_cache.rs).When topdown shows frontend stalls or branch misprediction issues.
perf stat -e br_pred,br_mis_pred,br_retired,br_mis_pred_retired,inst_retired \
./target/release/gossip-worker 2>&1
Derived metrics:
| Metric | Formula | Healthy | Investigate |
|--------|---------|---------|-------------|
| Speculative mispredict % | br_mis_pred / br_pred × 100 | < 2% | > 5% |
| Retired mispredict % | br_mis_pred_retired / br_retired × 100 | < 1% | > 3% |
| Branch density | br_retired / inst_retired × 100 | < 20% | > 30% |
To find which branches are mispredicting:
perf record -e br_mis_pred_retired -c 1000 -g --call-graph dwarf \
./target/release/gossip-worker
perf report --stdio --percent-limit=1.0 2>&1 | head -60
Once you've identified a hot function from Mode 2, drill into it at the source-line level.
perf annotate --symbol=<function_name> --stdio 2>&1
For a specific event:
perf record -e l1d_cache_refill -c 10000 --call-graph dwarf \
./target/release/gossip-worker
perf annotate --symbol=<function_name> --stdio 2>&1
This shows the percentage of samples on each source line / assembly instruction. Look for:
Compare two builds at the hardware counter level to explain a Criterion regression.
git stash push -m "changes"
RUSTFLAGS="-C target-cpu=native -C debuginfo=2" cargo build --release
perf stat -r 3 -e cpu_cycles,inst_retired,stall_frontend,stall_backend,stall_backend_mem,l1d_cache_refill,l2d_cache_refill,br_mis_pred_retired \
./target/release/gossip-worker 2>&1 | tee /tmp/perf-baseline.txt
git stash pop
RUSTFLAGS="-C target-cpu=native -C debuginfo=2" cargo build --release
perf stat -r 3 -e cpu_cycles,inst_retired,stall_frontend,stall_backend,stall_backend_mem,l1d_cache_refill,l2d_cache_refill,br_mis_pred_retired \
./target/release/gossip-worker 2>&1 | tee /tmp/perf-after.txt
Compare the two files and compute deltas. Focus on:
For diagnosing contention in async/concurrent code paths.
perf stat -e context-switches,cpu-migrations,page-faults \
-e sdt_libpthread:mutex_entry,sdt_libpthread:mutex_acquired \
./target/release/gossip-worker 2>&1
For scheduling latency:
perf sched record ./target/release/gossip-worker
perf sched latency --sort max 2>&1 | head -30
Pre-built event sets tuned for this system (ARM Neoverse V1, armv8_pmuv3).
Multiplexing note: This virtualized Graviton3 environment exposes ~3 simultaneous hardware counters. Do not use {} group pinning syntax — it will fail with <not supported>. Instead, pass events as a comma-separated list and let perf stat multiplex automatically. The kernel time-shares counters and scales results; the (XX.XX%) annotation next to each counter shows what fraction of runtime it was active. With workloads running ≥1 second, scaled values are reliable. For maximum accuracy on critical ratios, use -r 3 (repeat 3x) and keep groups small (3-4 events).
cpu_cycles,inst_retired,stall_frontend,stall_backend,stall_backend_mem,br_mis_pred_retired
l1d_cache,l1d_cache_refill,l1d_cache_lmiss_rd,l1i_cache,l1i_cache_refill,l1i_cache_lmiss
l2d_cache,l2d_cache_refill,l2d_cache_lmiss_rd,dtlb_walk,itlb_walk,mem_access
br_pred,br_mis_pred,br_retired,br_mis_pred_retired,inst_retired,cpu_cycles
inst_retired,inst_spec,op_retired,op_spec,cpu_cycles,stall
stall_backend,stall_backend_mem,l1d_cache_lmiss_rd,l2d_cache_lmiss_rd,mem_access,cpu_cycles
When you need exact ratios without scaling, use ≤3 events:
cpu_cycles,inst_retired,stall_backend
Report findings using this structure:
## Perf Profile: [target / scenario]
### Environment
- CPU: ARM Neoverse V1 (Graviton3), 16 cores
- Build: `RUSTFLAGS="-C target-cpu=native -C debuginfo=2"` release
- Target: [repo or benchmark name]
### Topdown Summary
| Metric | Value | Assessment |
|--------|-------|------------|
| IPC | X.XX | [good/investigate] |
| Frontend bound | X.X% | [ok/high] |
| Backend bound | X.X% | [ok/high] |
| Backend memory | X.X% | [ok/high] |
| Branch mispredict | X.X% | [ok/high] |
### Hotspots (top 5 by self%)
| Rank | Symbol | Self % | Module | Likely Cause |
|------|--------|--------|--------|--------------|
| 1 | func_name | XX.X% | gossip_stdx | [explanation] |
### Cache Hierarchy (if relevant)
| Level | Accesses | Misses | Miss Rate | Assessment |
|-------|----------|--------|-----------|------------|
| L1d | X.XXB | X.XXM | X.X% | [ok/high] |
| L2 | X.XXM | X.XXM | X.X% | [ok/high] |
### Root Cause Analysis
[Narrative explaining what the numbers mean for this specific code. Connect
PMU data to source-level patterns. Reference specific lines/functions.]
### Recommendations
1. **[Issue]** at `file:line` — [specific fix with rationale tied to PMU data]
- Expected impact: [which metric should improve and by roughly how much]
- Validate: `perf stat -e <relevant_events> ...`
### Validation Plan
[Commands to re-run after applying fixes to confirm improvement]
perf stat multiplexes automatically — the (XX.XX%) annotation shows sampling duty cycle. With ≥1s workloads, scaled values are reliable. Never use {} group pinning.--call-graph dwarf adds ~5-15% overhead to the profiled process. For tight latency measurements, use perf stat instead of perf record.[kernel.kallsyms]. These are kernel-side costs (syscalls, page faults). If they dominate, investigate I/O patterns or memory allocation.perf annotate on the caller to see inlined code.-t (per-thread) recording to separate them if needed: perf record -t <tid>./bench-compare — Criterion before/after measurement (use first to detect regressions)/perf-regression — Full regression workflow with acceptance criteria/perf-topdown — Cross-arch TMA + branch trace entry point; escalates to this skill for ARM-specific deep dives/performance-analyzer — Static hotspot analysis for this project's patterns/pgo-bolt — Branch sampling data from perf can feed BOLT for post-link binary optimizationdevelopment
Deep first-principles code explanation that builds real understanding through phased walkthroughs with diagrams. Covers algorithms, data structures, memory layout, concurrency patterns, and performance tricks — especially for systems code in Rust. Use whenever the user asks to explain, walk through, break down, deep dive into, or understand code. Trigger on "how does this work", "what's happening here", "teach me about this", "why is it done this way", or when the user references a file with @ and wants to understand it. Proactively use when examining code involving lock-free algorithms, atomics/CAS, memory ordering,
development
Use when creating implementation-ready beads tasks that need testing strategy, optimal implementation approach, and documentation requirements baked in — composes /create-task with parallel enrichment agents that analyze the codebase and produce concrete test specifications, algorithm/data-structure guidance, and doc quality standards so implementing agents don't need to re-research
development
--- name: autoresearch description: Autonomous Goal-directed Iteration. Apply Karpathy's autoresearch principles to ANY task. Loops autonomously — modify, verify, keep/discard, repeat. Supports bounded iteration via Iterations: N inline config. version: 1.9.11 --- # Claude Autoresearch — Autonomous Goal-directed Iteration Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). Applies constraint-driven autonomous iteration to ANY work — not just ML research. **Core id
development
Use when implementing a new feature and assessing coverage gaps, during periodic test hygiene, when test suites feel bloated, or before merging code that changes coordination or hot paths. Two-phase assess-then-improve testing pipeline.