auto-benchmark/SKILL.md
Run AI-driven benchmark searches on AMD ROCm with tiered server-flag sweeps for vLLM/SGLang, canonical dataset preparation, SLA or fixed-QPS benchmarking, CSV export, and resume. Adapted from SGLang auto-benchmark for MI355X (gfx950) / MI300X (gfx942) on ROCm 7.x. Use when the user wants an automated benchmark workflow on AMD GPUs rather than a one-off bench_serving command. Integrates with amdpilot executor task_spec_json for resource-aware launch.
npx skillsauth add amdpilot-org/amd-skills auto-benchmark-rocmInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill is for repeatable, AI-driven performance tuning of vLLM and SGLang on AMD Instinct GPUs.
Adapted from the upstream SGLang sglang-auto-benchmark skill. All CUDA-specific references have
been replaced with ROCm equivalents; attention backends, environment variables, resource classes,
and Docker image selection reflect the AMD MI355X (gfx950) and MI300X (gfx942) ecosystem.
rocminfo | grep gfx to verify gfx950 (MI355X) or gfx942 (MI300X).cat /opt/rocm/.info/version — this skill targets ROCm 7.x (7.0 or 7.2).rocm/sgl-dev or rocm/vllm-dev).
Use the env-probe skill first if unsure./data/ preferred over NFS /mnt/dcgpuval/ for I/O speed.max_ttft_ms / max_tpot_ms.If any precondition is not met, fix it before running a large search.
If the user wants the best command for a real production or real workload scenario, the benchmark
must use their real request distribution — real prompt lengths, output lengths, multi-turn patterns,
and sampling settings. sharegpt, random, and generated-shared-prefix are useful for sanity
checks, but they are not a substitute for real traffic.
On ROCm, the available attention backends differ from NVIDIA:
| Backend | Description | Notes |
|---------|-------------|-------|
| aiter | ROCm-native unified attention kernel library | Primary AMD backend, best for MI355X/MI300X |
| triton | Cross-platform Triton kernels | Works on ROCm, good fallback |
| torch_native | PyTorch native SDPA | Baseline, no AMD-specific optimization |
Do NOT use NVIDIA-specific backends: fa3, fa4, flashinfer, flashmla, trtllm_*, cutlass_*.
These will fail or silently produce wrong results on ROCm.
heads_per_gpu % 16 == 0.
For models with 64 heads, TP must be ≤ 4 (giving 16 heads/GPU).SGLANG_AITER_FP8_PREFILL_ATTN=1.SGLANG_AITER_MLA_PERSIST=1.Replace CUDA environment variables with ROCm equivalents:
| CUDA (do not use) | ROCm equivalent | Purpose |
|--------------------|-----------------|---------|
| CUDA_VISIBLE_DEVICES | HIP_VISIBLE_DEVICES | GPU device selection |
| CUDA_LAUNCH_BLOCKING | HIP_LAUNCH_BLOCKING | Synchronous kernel launch for debugging |
| — | SGLANG_USE_AITER=1 | Explicitly enable aiter backend |
| — | SGLANG_AITER_MLA_PERSIST=1 | Enable MLA persist design |
| — | SGLANG_AITER_FP8_PREFILL_ATTN=1 | FP8 prefill on gfx950 |
| — | HSA_FORCE_FINE_GRAIN_PCIE=1 | Fine-grain PCIe for host-device transfers |
Our node has 8× AMD Instinct MI355X (gfx950), ROCm 7.2.0.
| Resource class | GPU count | Typical use case |
|----------------|-----------|------------------|
| single-gpu | 1 | Small models (≤8B), quick sanity checks |
| multi-gpu | 4 | Medium models (32B–70B), TP=4 |
| full-node | 8 | Large models (70B+), TP=8 or TP=4×DP=2 |
When server.parallel is used and dp_size is not set explicitly:
dp_size = visible_gpus / (tp_size * pp_size)
Visible GPU count is inferred from HIP_VISIBLE_DEVICES, or from server.parallel.gpu_count.
Same as upstream:
sharegpt — auto-download supported, converted to canonical JSONL.custom — old bench_serving format or canonical autobench JSONL.random — synthetic/random benchmark path.generated-shared-prefix — shared-prefix synthetic generator.Identical to upstream. JSONL, one request per line:
{"prompt": "Write a summary.", "output_len": 256}
{"prompt": [{"role": "user", "content": "Summarize."}], "output_len": 256}
YAML key order matters. Put the most important search keys first.
server.base_flags and server.search_space are passed to the server launcher. Any valid
vLLM/SGLang server CLI flag can be set or searched.
attention_backend — search [aiter, triton]prefill_attention_backend — if split prefill/decode is supporteddecode_attention_backend — if split prefill/decode is supportedsampling_backendmax_running_requestschunked_prefill_size — common values: [4096, 8192, 16384, 131072]prefill_max_requestsmax_prefill_tokensschedule_policy — [lpm, fcfs]schedule_conservativenessnum_continuous_decode_stepsmem_fraction_static — critical for ROCm; MI355X HBM3e is larger than H100, so ranges differ.
Typical search: [0.80, 0.85, 0.88, 0.90]max_total_tokenspage_sizedisable_radix_cachekv_cache_dtype — [auto, fp8_e4m3] for FP8 KV cache on MI355Xtp_size — must respect aiter head constraints (heads_per_gpu % 16 == 0)pp_sizedp_sizeload_balance_methodenable_dp_attentionenable_aiter_allreduce_fusion — AMD-specific distributed optimizationcuda_graph_max_bs — flag name is unchanged in vLLM/SGLang even on ROCmdisable_cuda_graph_paddingdisable_cuda_graph into the default search space.Speculative decoding support on ROCm may be limited. Verify availability before enabling:
speculative_num_stepsspeculative_eagle_topkspeculative_num_draft_tokensOrder: always tune the non-speculative base server first, then optionally add speculative search.
Never start by tuning EAGLE first. Use this order:
python3 -m sglang.auto_benchmark convert \
--kind sharegpt \
--tokenizer /data/meta-llama/Meta-Llama-3.1-70B-Instruct \
--num-prompts 1200 \
--output /tmp/sharegpt.autobench.jsonl
python3 -m sglang.auto_benchmark run --config /path/to/config.yaml
results.jsonlresults.csvWhen running benchmarks through the amdpilot executor, the benchmark task spec maps to
task_spec_json in the queue DB:
{
"gpu_count": 8,
"resource_class": "full-node",
"base_image": "rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260317",
"gpu_free_ids": [0, 1, 2, 3, 4, 5, 6, 7],
"disk_free_gb": 2100,
"timeout_minutes": 120
}
The dashboard GET /api/{job_name}/system_info endpoint reads from this column to display
experiment runtime info (GPU arch, ROCm version, base image, resource class).
Benchmark results should be written as structured artifacts so the dashboard can render them and feed them into the data flywheel for downstream LoRA/SFT training signal.
Use the reference configs in references/:
| Config | Model | GPUs | Notes |
|--------|-------|------|-------|
| config-example-rocm.yaml | Generic | 4 | Starting template |
| llama3.1-70b-mi355x.yaml | Llama 3.1 70B Instruct | 8 | Full-node TP=8 |
| qwen3-32b-mi355x.yaml | Qwen3 32B | 4 | TP=4, aiter search |
| deepseek-r1-mi355x.yaml | DeepSeek R1 671B (FP8/MXFP4) | 8 | FP8 KV cache, AllReduce fusion |
After a run, summarize:
results.jsonl, results.csv, server logs| Aspect | CUDA / NVIDIA | ROCm / AMD |
|--------|---------------|------------|
| GPU visibility | CUDA_VISIBLE_DEVICES | HIP_VISIBLE_DEVICES |
| Attention backends | fa3, flashinfer | aiter, triton |
| Graph runtime | CUDA graph | HIP graph (flag names unchanged — cuda_graph_max_bs, disable_cuda_graph etc. still use "cuda" prefix in vLLM/SGLang CLI even on ROCm) |
| GPU query | nvidia-smi | rocm-smi --showproductname |
| Arch detection | N/A | rocminfo \| grep gfx |
| AllReduce fusion | N/A | --enable-aiter-allreduce-fusion |
| FP8 prefill | Built-in | SGLANG_AITER_FP8_PREFILL_ATTN=1 |
| MLA persist | Built-in | SGLANG_AITER_MLA_PERSIST=1 |
| Memory range | 0.85–0.92 typical | 0.80–0.90 typical (HBM3e larger) |
| Docker images | nvcr.io/nvidia/* | rocm/sgl-dev, rocm/vllm-dev |
development
FlyDSL is a Python DSL with MLIR-native backend for authoring custom AMD GPU kernels with explicit layout algebra (pre-installed at /opt/FlyDSL on images tagged *-flydsl:*). Use this skill when profiling identifies a hot per-row reduction (RMSNorm / LayerNorm / softmax), a fused elementwise chain (norm + residual add, activation + multiplier), or an unusual-shape grouped GEMM that the standard AMD backends (Triton / aiter / CK / hipBLASLt / TransformerEngine) don't serve well. Essential for any workload where Python/config/Triton-tuning gains have plateaued and the profile shows a custom kernel opportunity. Covers the `/opt/FlyDSL` availability check, the integration playbook (dispatcher + direct site-packages edit + autograd-safe output handling), kernel authoring patterns (elementwise via layout API, block reductions via wave_reduce_add, fused dx+dw designs, MFMA GEMM preshuffle), torchrun gotchas, and the critical rule that custom kernels typically only win end-to-end when stacked with `torch.compile(mode="default")`.
tools
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
data-ai
Profile AMD GPU kernels using rocprofv3 and analyze performance bottlenecks. Use when the user wants to profile HIP/ROCm kernels, identify GPU performance issues, analyze hardware counters, or understand why a kernel is slow on AMD GPUs (MI100, MI200, MI300 series). Provides wrapper scripts for rocprofv3 execution and automated parsing of profiler output into structured, agent-friendly JSON with bottleneck classification.
testing
Analyze SGLang and vLLM profiler traces on AMD ROCm systems, especially MI355X/gfx950 nodes. Adapted from the SGLang torch-profiler workflow: triage kernel breakdown, overlap headroom, and fuse opportunities, then write structured artifacts that can be attached to amdpilot experiments, trials, and dashboard views. Use when a run needs profiling, when an optimization trial should produce machine-readable profiling artifacts, or when the user asks why a ROCm workload is slow.