.claude/skills/perf-topdown/SKILL.md
Use when you need to classify why code is slow (front-end vs back-end vs speculation), when hunting branch misprediction sites, after /bench-compare or /perf-regression finds a regression needing root cause, or when building an isolated hot-loop harness. Cross-arch TMA and branch tracing.
npx skillsauth add ahrav/gossip-rs perf-topdownInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Structured workflow for diagnosing why code is slow using hardware performance counters. Works on both x86-64 (Intel/AMD) and AArch64 (ARM Neoverse/Cortex).
The recipe: build harness → classify bottleneck → trace branches → apply fix.
/bench-compare or /perf-regression finds a regression and you need root cause/linux-perf-profile (Modes 3-7)/asm-forge/bench-compare/rust-perf-triageRUSTFLAGS='-C opt-level=3 -C target-cpu=native -C force-frame-pointers=yes -C debuginfo=2' \
cargo build --release --bin <target>
-C force-frame-pointers=yes — enables lightweight call-graph capture alongside LBR-C debuginfo=2 — maps samples back to Rust source lines without affecting optimizationApply these before profiling. Each reduces measurement variance:
# 1. CPU governor to performance (biggest single impact)
sudo cpupower frequency-set --governor performance
# Or: echo performance | sudo tee /sys/devices/system/cpu/cpu*/cpufreq/scaling_governor
# 2. Disable turbo/boost
# Intel:
echo 1 | sudo tee /sys/devices/system/cpu/intel_pstate/no_turbo
# AMD/generic:
echo 0 | sudo tee /sys/devices/system/cpu/cpufreq/boost
# ARM Graviton: fixed frequency, no action needed
# 3. Pin to a specific core (avoid core 0 — handles IRQs)
taskset -c 2 ./target/release/binary
# 4. Disable ASLR (per-process, no global security impact)
setarch $(uname -m) -R ./target/release/binary
# 5. Disable NMI watchdog (frees a PMU counter)
echo 0 | sudo tee /proc/sys/kernel/nmi_watchdog
# 6. Set perf_event_paranoid for counter access
echo -1 | sudo tee /proc/sys/kernel/perf_event_paranoid
Graviton note: AWS Graviton processors run at fixed frequency with no turbo/boost control. Step 2 is unnecessary; steps 1, 3-6 still apply.
Use when no existing Criterion benchmark covers the suspected hot path.
If a benchmark already exists, use /bench-compare instead.
Create src/bin/tiny_hot.rs:
use std::hint::black_box;
use std::time::{Duration, Instant};
use std::thread;
fn hot_loop(iter: usize) -> u64 {
let mut acc: u64 = 0;
for i in 0..iter {
// Replace with the suspected hotspot, keep inputs stable
acc = acc.wrapping_add((i as u64).rotate_left(13) ^ 0x9E3779B97F4A7C15);
}
acc
}
fn main() {
let start = Instant::now();
let target = Duration::from_secs(60);
let mut reps = 0u64;
let mut sink = 0u64;
while start.elapsed() < target {
sink ^= hot_loop(black_box(2_000_000));
reps += 1;
}
eprintln!("done reps={reps} sink={sink}");
thread::sleep(Duration::from_millis(10)); // profiler tail time
}
Build and run:
RUSTFLAGS='-C opt-level=3 -C target-cpu=native -C force-frame-pointers=yes -C debuginfo=2' \
cargo build --release --bin tiny_hot
taskset -c 2 ./target/release/tiny_hot
Fast triage to determine if the workload is compute-bound, memory-bound, or branch-bound. Run 5 repetitions for statistical stability.
x86-64:
sudo perf stat -r 5 -e cycles,instructions,branch-misses,cache-misses,LLC-loads,icache_misses \
taskset -c 2 ./target/release/tiny_hot
AArch64:
sudo perf stat -r 5 -e cpu_cycles,inst_retired,br_mis_pred_retired,l1d_cache_refill,l2d_cache_refill,l1i_cache_refill \
taskset -c 2 ./target/release/tiny_hot
| Metric | Formula | Healthy | Investigate | |--------|---------|---------|-------------| | CPI | cycles / instructions | < 1.0 | > 1.0 | | Branch miss rate | branch-misses / branches | < 2% | > 5% | | L1d miss rate | l1d misses / l1d accesses | < 5% | > 10% | | ICache miss rate | icache_misses / instructions | < 0.1% | > 1% |
Decision tree:
branch-misses → suspect predictor/layout, proceed to Mode 4 (Branch Trace)icache_misses → suspect code size, see references/tma-diagnosis-actions.md FE-Bound sectioncache-misses or LLC-loads → memory/locality issue, see /linux-perf-profile Mode 3 (ARM) or BE-Bound remediationCross-ref rust-perf-triage/references/profiling-tools.md for detailed counter interpretation.
Classify the bottleneck into four categories: Retiring, Bad Speculation, Frontend Bound, Backend Bound. The commands differ by architecture and vendor.
# Level 1 (kernel 4.8+, Sandy Bridge+)
sudo perf stat --topdown -a -- taskset -c 2 ./target/release/tiny_hot
# Level 2 (kernel ~5.13+, Sapphire Rapids+)
sudo perf stat --topdown --td-level 2 -- taskset -c 2 ./target/release/tiny_hot
The --topdown flag requires system-wide mode (-a) on pre-Ice Lake CPUs.
Ice Lake and later allow per-thread topdown collection.
Alternative (deeper drill-down via JSON metrics, kernel 6.1+):
# Level 1
perf stat -M TopdownL1 -- taskset -c 2 ./target/release/tiny_hot
# Drill into a specific L1 category
perf stat -M tma_backend_bound_group -- taskset -c 2 ./target/release/tiny_hot
# Drill further into L3
perf stat -M tma_core_bound_group -- taskset -c 2 ./target/release/tiny_hot
AMD does NOT support --topdown. Use -M PipelineL1 instead:
# Level 1 (5 categories — AMD adds smt_contention)
perf stat -M PipelineL1 -- taskset -c 2 ./target/release/tiny_hot
# Level 2
perf stat -M PipelineL2 -- taskset -c 2 ./target/release/tiny_hot
# Drill into specific category
perf stat -M frontend_bound_group -- taskset -c 2 ./target/release/tiny_hot
AMD pipeline widths: Zen 4 = 6-wide, Zen 5 = 8-wide. Raw slot counts are not comparable across Intel (4-wide) and AMD.
ARM has no --topdown flag. Use manual event-based topdown:
Neoverse N2/V2+ (slot-based topdown, kernel metric support):
# If perf supports -M on this core:
perf stat -M frontend_bound,backend_bound,bad_speculation,retiring \
-- taskset -c 2 ./target/release/tiny_hot
# Manual event collection:
perf stat -r 5 -e cpu_cycles,inst_retired,stall_slot_frontend,stall_slot_backend,stall_slot,op_retired,op_spec,br_mis_pred \
-- taskset -c 2 ./target/release/tiny_hot
# Compute: FE% = stall_slot_frontend / (cpu_cycles * slots)
# Compute: BE% = stall_slot_backend / (cpu_cycles * slots)
Neoverse N1/V1 (cycle-based only, no slot events):
perf stat -r 5 -e cpu_cycles,inst_retired,stall_frontend,stall_backend,stall_backend_mem,br_mis_pred_retired \
-- taskset -c 2 ./target/release/tiny_hot
# Compute: FE% = stall_frontend / cpu_cycles * 100
# Compute: BE% = stall_backend / cpu_cycles * 100
See linux-perf-profile Mode 1 for the full ARM topdown derived metrics table.
| Category | Threshold | Meaning | Next step |
|----------|-----------|---------|-----------|
| Retiring high | > 80% | Near peak efficiency | Mode 2 for µop reduction; see references/tma-diagnosis-actions.md |
| Bad Speculation high | > 15% | Branch mispredictions | Mode 4 for branch traces |
| Frontend Bound high | > 20% | ICache / decode stalls | references/tma-diagnosis-actions.md FE section |
| Backend Bound high | > 40% | Memory / execution ports | references/tma-diagnosis-actions.md BE section |
See references/tma-diagnosis-actions.md for the complete diagnosis-to-action mapping.
After Mode 3 identifies Bad Speculation or you see high branch-misses in Mode 2, record branch traces to find the exact misprediction sites.
# Record with LBR (user-space branches, ~100K sample period)
sudo perf record -o perf.data -c 100000 -b -e cycles:u \
-- taskset -c 2 ./target/release/tiny_hot
# Identify misprediction hotspots (function-level)
perf report --sort symbol_from,symbol_to,mispredict --stdio
# Dump raw branch stacks with Rust demangling
perf script -F ip,sym,brstack | rustfilt | head -200
# Map specific addresses to source
addr2line -e ./target/release/tiny_hot 0x<ADDRESS>
# View disassembly around a hot branch
objdump -dr --no-show-raw-insn ./target/release/tiny_hot | rustfilt | less
# Annotate with per-basic-block cycles/IPC (Skylake+ timed LBR)
perf annotate --symbol=<function_name> --stdio
Branch type filtering (narrow capture to specific branch types):
perf record -j cond,u ./binary # conditional branches only (mispredict candidates)
perf record -j any_call,any_ret,u ./binary # calls and returns only
perf record -j ind_call,u ./binary # indirect calls only
brstack output format: FROM/TO/M_or_P/INTX/ABORT/CYCLES/TYPE/SPEC
M = mispredicted, P = predictedSame perf commands as Intel LBR. AMD Zen 4 supports hardware branch filtering and misprediction flags. Zen 3 BRS is limited (16 entries, no filtering, no prediction info) — use Zen 4+ for serious LBR work.
ARM SPE provides statistical branch sampling. For branch misprediction profiling:
# Record branch mispredictions (event_filter bit 7 = 0x80)
sudo perf record -e arm_spe/branch_filter=1,event_filter=0x80/ \
-- taskset -c 2 ./target/release/tiny_hot
# Record all branches
sudo perf record -e arm_spe/branch_filter=1/ \
-- taskset -c 2 ./target/release/tiny_hot
# Analyze
perf report --stdio --percent-limit=1.0
# View decoded samples
perf script
SPE vs LBR: SPE is a statistical sampler (like Intel PEBS) — it samples individual operations with rich metadata (addresses, latency, cache level). It does NOT provide a continuous branch history like LBR. ARM BRBE (ARMv9.2, FEAT_BRBE) is the true LBR equivalent but is only available on the newest cores.
Fallback (no SPE):
perf record -e br_mis_pred_retired -c 1000 -g --call-graph dwarf \
-- taskset -c 2 ./target/release/tiny_hot
perf report --stdio --percent-limit=1.0
SPE-supported cores: Neoverse N1/N2/V1/V2/V3, Cortex-X1/X2/X3/X4/X925, Cortex-A715/A720/A725, Ampere1A. Covers AWS Graviton 2/3/4.
See references/branch-trace-cookbook.md for detailed decode procedures.
# TMA Level 1 raw events
topdown-retiring,topdown-bad-spec,topdown-fe-bound,topdown-be-bound
# LBR recording
cycles:u with -b flag, or br_inst_retired.any:u
# ICache pressure
icache_64b.iftag_miss,frontend_retired.l1i_miss,frontend_retired.l2_miss
Use -M PipelineL1 and -M PipelineL2 metric groups (AMD events have
vendor-specific names accessed through the JSON metric system).
# Topdown (N1/V1 cycle-based)
cpu_cycles,inst_retired,stall_frontend,stall_backend,stall_backend_mem
# Topdown (N2/V2+ slot-based)
cpu_cycles,inst_retired,stall_slot_frontend,stall_slot_backend,stall_slot,op_retired,op_spec
# SPE branch misprediction
arm_spe/branch_filter=1,event_filter=0x80/
# SPE all branches with timestamps
arm_spe/branch_filter=1,ts_enable=1/
Cross-ref linux-perf-profile for ARM cache/TLB/branch event sets.
Cross-ref rust-perf-triage/scripts/perf_counters.sh for generic presets.
Report findings using this structure:
## Top-Down Profile: [target / scenario]
### Environment
- CPU: [model], [arch] (x86-64 / AArch64)
- Build: `RUSTFLAGS="-C target-cpu=native -C debuginfo=2 -C force-frame-pointers=yes"` release
- Stability: governor=performance, turbo=off, pinned to core N
### TMA Level 1
| Category | Value | Assessment |
|----------|-------|------------|
| Retiring | X.X% | [ok/high — near peak] |
| Bad Speculation | X.X% | [ok/high — branch misses] |
| Frontend Bound | X.X% | [ok/high — icache/decode] |
| Backend Bound | X.X% | [ok/high — memory/ports] |
### Branch Trace Hotspots (if Mode 4 was used)
| Rank | From → To | Mispredict Rate | Type | Likely Cause |
|------|-----------|-----------------|------|--------------|
| 1 | func_a+0x42 → func_b | 23% | COND | unpredictable match arm |
### Diagnosis & Remediation
[TMA category] is the bottleneck.
**Root cause**: [specific code pattern tied to PMU data]
**Fix applied**: [source-level change with rationale]
### Validation Plan
[Commands to re-run after applying fixes to confirm improvement]
--topdown: Requires kernel 4.8+. System-wide (-a) required pre-Ice Lake. Disable NMI watchdog.--topdown flag. Use -M PipelineL1 (kernel 6.2+ for Zen 4, ~6.9+ for Zen 5).(XX.XX%) duty cycle. Keep groups small./linux-perf-profile — Deep ARM counter drill-down (cache/TLB/lock, Modes 3-7)/asm-forge — Assembly-level follow-up after identifying hot functions/bench-compare — Criterion before/after measurement/perf-regression — Full regression workflow with acceptance criteria/rust-perf-triage — Post-hoc analysis of collected perf datadevelopment
Deep first-principles code explanation that builds real understanding through phased walkthroughs with diagrams. Covers algorithms, data structures, memory layout, concurrency patterns, and performance tricks — especially for systems code in Rust. Use whenever the user asks to explain, walk through, break down, deep dive into, or understand code. Trigger on "how does this work", "what's happening here", "teach me about this", "why is it done this way", or when the user references a file with @ and wants to understand it. Proactively use when examining code involving lock-free algorithms, atomics/CAS, memory ordering,
development
Use when creating implementation-ready beads tasks that need testing strategy, optimal implementation approach, and documentation requirements baked in — composes /create-task with parallel enrichment agents that analyze the codebase and produce concrete test specifications, algorithm/data-structure guidance, and doc quality standards so implementing agents don't need to re-research
development
--- name: autoresearch description: Autonomous Goal-directed Iteration. Apply Karpathy's autoresearch principles to ANY task. Loops autonomously — modify, verify, keep/discard, repeat. Supports bounded iteration via Iterations: N inline config. version: 1.9.11 --- # Claude Autoresearch — Autonomous Goal-directed Iteration Inspired by [Karpathy's autoresearch](https://github.com/karpathy/autoresearch). Applies constraint-driven autonomous iteration to ANY work — not just ML research. **Core id
development
Use when implementing a new feature and assessing coverage gaps, during periodic test hygiene, when test suites feel bloated, or before merging code that changes coordination or hot paths. Two-phase assess-then-improve testing pipeline.