amd-kernel-optimization/SKILL.md
Optimize inference latency and throughput of PyTorch models on AMD GPUs (MI250/MI300/MI350) with ROCm. Use when profiling and optimizing GEMM, attention, elementwise ops, torch.compile, CUDAGraphs, or Triton kernels on AMD hardware. Covers the full optimize cycle: benchmark → profile → analyze → implement → verify. Also covers benchmarking methodology and common pitfalls that waste time.
npx skillsauth add arist12/amd-skills amd-kernel-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
torch.compile FIRST. torch.compile(mode="default") with correct inductor config gives 2-5x speedup. Get this working before any manual optimization. Any code change that breaks compile is a net regression.
Profile before optimizing. Never guess where time is spent. Run torch.profiler, classify GPU time into GEMM / attention / elementwise / launch overhead, then optimize the largest category.
Measure after every change. Benchmark with proper warmup and iterations (see below). Revert if performance regresses.
NEVER reduce warmup/iterations to "save time" — you get garbage numbers.
start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record(); result = model(input); end.record()
torch.cuda.synchronize()
ms = start.elapsed_time(end)
Each level builds on the previous. Do NOT skip to Level 3+ without Level 2.
GPU_MAX_HW_QUEUES=2, HIP_FORCE_DEV_KERNARG=1, HSA_NO_SCRATCH_RECLAIM=1, AMD_LOG_LEVEL=0sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'PYTORCH_TUNABLEOP_ENABLED=1, TORCH_BLAS_PREFER_HIPBLASLT=1torch.set_float32_matmul_precision('high')env | grep -iE 'TORCH|INDUCTOR|AUTOTUNE' — unset TORCHINDUCTOR_MAX_AUTOTUNE if presentmode="default". Details: references/torch-compile-and-graphs.mdTORCH_LOGS="graph_breaks" python3 ...repeat_kv, cache unchanged outputsTORCH_LOGS="graph_breaks" — verify no new breaks after each changecoordinate_descent_tuning, benchmark_kernel, freezing (increase compile time, improve steady-state)torch.ops.aiter.* (compile-safe) not Python wrappers@torch.compiler.disable as last resort| Option | Notes |
|---|---|
| rocBLAS (default) | Vendor BLAS; generally well-tuned |
| hipBLASLt | Fused epilogues; may beat rocBLAS for some shapes |
| aiter tuned GEMM | Auto-dispatches best kernel per (M,N,K) from tuned configs |
| FP8 GEMM (MI300+) | gemm_a8w8 via aiter; gfx942=e4m3fnuz, gfx950=e4m3fn |
| Option | Notes |
|---|---|
| aiter flash attention | torch.ops.aiter.mha_fwd.default(...) — compile-friendly, GQA native |
| SDPA | F.scaled_dot_product_attention(...) — good for KV-cache decode |
| Manual bmm+softmax+bmm | Slowest; replace with SDPA |
| Option | Notes |
|---|---|
| torch.compile(mode="default") | Start here. Stable on ROCm with correct inductor config |
| Manual CUDAGraph capture | Wrap full inference in one graph; needs Dynamo RNG patch |
| reduce-overhead / max-autotune | Avoid on ROCm unless you have verified stability |
python -c "import transformers; print(transformers.__file__)" and edit directly.WARMUP=0 ITERATIONS=1 gives meaningless numbers. Optimize the code, not the test.gemm_a16w16 silently falls back to plain F.linear — no error, no crash, no benefit. Diagnose with AITER_LOG_TUNED_CONFIG=1; if you see "using torch solution:0", generate configs via AITER_TUNE_GEMM=1 + aiter's GemmTuner, or fall back to PYTORCH_TUNABLEOP_ENABLED=1. See references/gemm-and-linear.md for the full workflow.Read as needed for implementation details:
tools
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
data-ai
Profile AMD GPU kernels using rocprofv3 and analyze performance bottlenecks. Use when the user wants to profile HIP/ROCm kernels, identify GPU performance issues, analyze hardware counters, or understand why a kernel is slow on AMD GPUs (MI100, MI200, MI300 series). Provides wrapper scripts for rocprofv3 execution and automated parsing of profiler output into structured, agent-friendly JSON with bottleneck classification.
testing
Analyze SGLang and vLLM profiler traces on AMD ROCm systems, especially MI355X/gfx950 nodes. Adapted from the SGLang torch-profiler workflow: triage kernel breakdown, overlap headroom, and fuse opportunities, then write structured artifacts that can be attached to amdpilot experiments, trials, and dashboard views. Use when a run needs profiling, when an optimization trial should produce machine-readable profiling artifacts, or when the user asks why a ROCm workload is slow.
development
Debug ROCm/HIP kernel crashes in SGLang and vLLM on AMD GPUs (MI300X/MI325X/MI355X). Adapts SGLang's @debug_kernel_api kernel boundary logging to ROCm: captures input tensors before crash, tracks shapes/dtypes/values, dumps crash artifacts for offline analysis. Integrates with amdpilot executor failure_reason field and dashboard trajectory viewer. Triggered by: CUDA/HIP errors, illegal memory access, device-side assert, OOM kills, signal 137/139, NaN/Inf in outputs, "debug crash", "why did the trial fail".