Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

amdpilot-org/auto-benchmark-rocm

Name: auto-benchmark-rocm
Author: amdpilot-org

auto-benchmark/SKILL.md

npx skillsauth add amdpilot-org/amd-skills auto-benchmark-rocm

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Auto Benchmark — AMD ROCm / MI355X

This skill is for repeatable, AI-driven performance tuning of vLLM and SGLang on AMD Instinct GPUs.

Adapted from the upstream SGLang sglang-auto-benchmark skill. All CUDA-specific references have been replaced with ROCm equivalents; attention backends, environment variables, resource classes, and Docker image selection reflect the AMD MI355X (gfx950) and MI300X (gfx942) ecosystem.

Preconditions

GPU architecture confirmed: run rocminfo | grep gfx to verify gfx950 (MI355X) or gfx942 (MI300X).
ROCm version checked: cat /opt/rocm/.info/version — this skill targets ROCm 7.x (7.0 or 7.2).
Docker image identified: know which base image you are in (rocm/sgl-dev or rocm/vllm-dev). Use the env-probe skill first if unsure.
Server can launch: vLLM or SGLang can already start and serve the target model in this container.
Model path exists: local path under /data/ preferred over NFS /mnt/dcgpuval/ for I/O speed.
Goal is clear:
- benchmark a fixed QPS list, or
- search the maximum QPS that satisfies max_ttft_ms / max_tpot_ms.

If any precondition is not met, fix it before running a large search.

Most Important Rule

If the user wants the best command for a real production or real workload scenario, the benchmark must use their real request distribution — real prompt lengths, output lengths, multi-turn patterns, and sampling settings. sharegpt, random, and generated-shared-prefix are useful for sanity checks, but they are not a substitute for real traffic.

AMD-Specific Attention Backends

On ROCm, the available attention backends differ from NVIDIA:

| Backend | Description | Notes | |---------|-------------|-------| | aiter | ROCm-native unified attention kernel library | Primary AMD backend, best for MI355X/MI300X | | triton | Cross-platform Triton kernels | Works on ROCm, good fallback | | torch_native | PyTorch native SDPA | Baseline, no AMD-specific optimization |

Do NOT use NVIDIA-specific backends: fa3, fa4, flashinfer, flashmla, trtllm_*, cutlass_*. These will fail or silently produce wrong results on ROCm.

aiter-Specific Constraints

MLA (Multi-Head Latent Attention): aiter ASM kernels require heads_per_gpu % 16 == 0. For models with 64 heads, TP must be ≤ 4 (giving 16 heads/GPU).
FP8 prefill attention on gfx950: enable with SGLANG_AITER_FP8_PREFILL_ATTN=1.
MLA persist design for FP8 KV cache: enable with SGLANG_AITER_MLA_PERSIST=1.

Environment Variables

Replace CUDA environment variables with ROCm equivalents:

| CUDA (do not use) | ROCm equivalent | Purpose | |--------------------|-----------------|---------| | CUDA_VISIBLE_DEVICES | HIP_VISIBLE_DEVICES | GPU device selection | | CUDA_LAUNCH_BLOCKING | HIP_LAUNCH_BLOCKING | Synchronous kernel launch for debugging | | — | SGLANG_USE_AITER=1 | Explicitly enable aiter backend | | — | SGLANG_AITER_MLA_PERSIST=1 | Enable MLA persist design | | — | SGLANG_AITER_FP8_PREFILL_ATTN=1 | FP8 prefill on gfx950 | | — | HSA_FORCE_FINE_GRAIN_PCIE=1 | Fine-grain PCIe for host-device transfers |

GPU Topology and Resource Classes

Our node has 8× AMD Instinct MI355X (gfx950), ROCm 7.2.0.

| Resource class | GPU count | Typical use case | |----------------|-----------|------------------| | single-gpu | 1 | Small models (≤8B), quick sanity checks | | multi-gpu | 4 | Medium models (32B–70B), TP=4 | | full-node | 8 | Large models (70B+), TP=8 or TP=4×DP=2 |

When server.parallel is used and dp_size is not set explicitly:

dp_size = visible_gpus / (tp_size * pp_size)

Visible GPU count is inferred from HIP_VISIBLE_DEVICES, or from server.parallel.gpu_count.

Supported Dataset Kinds

Same as upstream:

sharegpt — auto-download supported, converted to canonical JSONL.
custom — old bench_serving format or canonical autobench JSONL.
random — synthetic/random benchmark path.
generated-shared-prefix — shared-prefix synthetic generator.

Canonical Dataset Format

Identical to upstream. JSONL, one request per line:

{"prompt": "Write a summary.", "output_len": 256}
{"prompt": [{"role": "user", "content": "Summarize."}], "output_len": 256}

Search Tiers

Tier 1: Fast sanity sweep. Baseline + small one-at-a-time scan.
Tier 2: Good default. Small cartesian on high-priority keys + expansion for rest.
Tier 3: Full cartesian product. Slowest, but thorough when space is bounded.

YAML key order matters. Put the most important search keys first.

What Is Tunable (AMD/ROCm)

server.base_flags and server.search_space are passed to the server launcher. Any valid vLLM/SGLang server CLI flag can be set or searched.

Kernel / Backend (AMD-specific)

attention_backend — search [aiter, triton]
prefill_attention_backend — if split prefill/decode is supported
decode_attention_backend — if split prefill/decode is supported
sampling_backend

Batching / Scheduling

max_running_requests
chunked_prefill_size — common values: [4096, 8192, 16384, 131072]
prefill_max_requests
max_prefill_tokens
schedule_policy — [lpm, fcfs]
schedule_conservativeness
num_continuous_decode_steps

Memory / Cache

mem_fraction_static — critical for ROCm; MI355X HBM3e is larger than H100, so ranges differ. Typical search: [0.80, 0.85, 0.88, 0.90]
max_total_tokens
page_size
disable_radix_cache
kv_cache_dtype — [auto, fp8_e4m3] for FP8 KV cache on MI355X

Parallel / Distributed

tp_size — must respect aiter head constraints (heads_per_gpu % 16 == 0)
pp_size
dp_size
load_balance_method
enable_dp_attention
enable_aiter_allreduce_fusion — AMD-specific distributed optimization

Runtime / HIP Graph

Keep HIP graph enabled by default for performance benchmarking (same concept as CUDA graph on ROCm).
cuda_graph_max_bs — flag name is unchanged in vLLM/SGLang even on ROCm
disable_cuda_graph_padding
Do not put disable_cuda_graph into the default search space.

Optional Speculative / EAGLE Stage

Speculative decoding support on ROCm may be limited. Verify availability before enabling:

speculative_num_steps
speculative_eagle_topk
speculative_num_draft_tokens

Order: always tune the non-speculative base server first, then optionally add speculative search.

Base Tuning Before EAGLE

Never start by tuning EAGLE first. Use this order:

Tune the non-speculative base server first.
Find the best normal config for the target dataset and SLA.
Only if the user explicitly asks and draft model assets exist, run speculative search.

Running The Workflow

Prepare dataset

python3 -m sglang.auto_benchmark convert \
  --kind sharegpt \
  --tokenizer /data/meta-llama/Meta-Llama-3.1-70B-Instruct \
  --num-prompts 1200 \
  --output /tmp/sharegpt.autobench.jsonl

Run from config

python3 -m sglang.auto_benchmark run --config /path/to/config.yaml

Outputs

Prepared canonical dataset JSONL
Per-run results.jsonl
Summary results.csv
Per-candidate server logs

Integration with amdpilot

When running benchmarks through the amdpilot executor, the benchmark task spec maps to task_spec_json in the queue DB:

{
  "gpu_count": 8,
  "resource_class": "full-node",
  "base_image": "rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260317",
  "gpu_free_ids": [0, 1, 2, 3, 4, 5, 6, 7],
  "disk_free_gb": 2100,
  "timeout_minutes": 120
}

The dashboard GET /api/{job_name}/system_info endpoint reads from this column to display experiment runtime info (GPU arch, ROCm version, base image, resource class).

Benchmark results should be written as structured artifacts so the dashboard can render them and feed them into the data flywheel for downstream LoRA/SFT training signal.

Config Template

Use the reference configs in references/:

| Config | Model | GPUs | Notes | |--------|-------|------|-------| | config-example-rocm.yaml | Generic | 4 | Starting template | | llama3.1-70b-mi355x.yaml | Llama 3.1 70B Instruct | 8 | Full-node TP=8 | | qwen3-32b-mi355x.yaml | Qwen3 32B | 4 | TP=4, aiter search | | deepseek-r1-mi355x.yaml | DeepSeek R1 671B (FP8/MXFP4) | 8 | FP8 KV cache, AllReduce fusion |

What To Report Back

After a run, summarize:

Hardware: GPU arch (gfx950/gfx942), ROCm version, Docker image tag
Search config: which tier, dataset kind (synthetic vs real)
Best config found: attention backend, TP/DP, key flags
Best QPS that satisfied SLA (or fixed QPS results)
Whether speculative tuning was skipped or run
Paths to artifacts: dataset JSONL, results.jsonl, results.csv, server logs
Anomalies: OOM events, HIP graph capture failures, aiter constraint violations

Differences From Upstream (CUDA) Skill

| Aspect | CUDA / NVIDIA | ROCm / AMD | |--------|---------------|------------| | GPU visibility | CUDA_VISIBLE_DEVICES | HIP_VISIBLE_DEVICES | | Attention backends | fa3, flashinfer | aiter, triton | | Graph runtime | CUDA graph | HIP graph (flag names unchanged — cuda_graph_max_bs, disable_cuda_graph etc. still use "cuda" prefix in vLLM/SGLang CLI even on ROCm) | | GPU query | nvidia-smi | rocm-smi --showproductname | | Arch detection | N/A | rocminfo \| grep gfx | | AllReduce fusion | N/A | --enable-aiter-allreduce-fusion | | FP8 prefill | Built-in | SGLANG_AITER_FP8_PREFILL_ATTN=1 | | MLA persist | Built-in | SGLANG_AITER_MLA_PERSIST=1 | | Memory range | 0.85–0.92 typical | 0.80–0.90 typical (HBM3e larger) | | Docker images | nvcr.io/nvidia/* | rocm/sgl-dev, rocm/vllm-dev |

amdpilot-org/auto-benchmark-rocm

auto-benchmark/SKILL.md

Run AI-driven benchmark searches on AMD ROCm with tiered server-flag sweeps for vLLM/SGLang, canonical dataset preparation, SLA or fixed-QPS benchmarking, CSV export, and resume. Adapted from SGLang auto-benchmark for MI355X (gfx950) / MI300X (gfx942) on ROCm 7.x. Use when the user wants an automated benchmark workflow on AMD GPUs rather than a one-off bench_serving command. Integrates with amdpilot executor task_spec_json for resource-aware launch.

1 stars

data-ai

Updated Apr 25, 2026

$ install --global

skillsauth

npx skillsauth add amdpilot-org/amd-skills auto-benchmark-rocm

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 25, 2026, 5:15 AM60.2s5 files scanned

SKILL.md

name:: auto-benchmark-rocm
description:: >

Auto Benchmark — AMD ROCm / MI355X

This skill is for repeatable, AI-driven performance tuning of vLLM and SGLang on AMD Instinct GPUs.

Preconditions

GPU architecture confirmed: run rocminfo | grep gfx to verify gfx950 (MI355X) or gfx942 (MI300X).
ROCm version checked: cat /opt/rocm/.info/version — this skill targets ROCm 7.x (7.0 or 7.2).
Docker image identified: know which base image you are in (rocm/sgl-dev or rocm/vllm-dev). Use the env-probe skill first if unsure.
Server can launch: vLLM or SGLang can already start and serve the target model in this container.
Model path exists: local path under /data/ preferred over NFS /mnt/dcgpuval/ for I/O speed.
Goal is clear:
- benchmark a fixed QPS list, or
- search the maximum QPS that satisfies max_ttft_ms / max_tpot_ms.

If any precondition is not met, fix it before running a large search.

Most Important Rule

AMD-Specific Attention Backends

On ROCm, the available attention backends differ from NVIDIA:

Do NOT use NVIDIA-specific backends: fa3, fa4, flashinfer, flashmla, trtllm_*, cutlass_*. These will fail or silently produce wrong results on ROCm.

aiter-Specific Constraints

MLA (Multi-Head Latent Attention): aiter ASM kernels require heads_per_gpu % 16 == 0. For models with 64 heads, TP must be ≤ 4 (giving 16 heads/GPU).
FP8 prefill attention on gfx950: enable with SGLANG_AITER_FP8_PREFILL_ATTN=1.
MLA persist design for FP8 KV cache: enable with SGLANG_AITER_MLA_PERSIST=1.

Environment Variables

Replace CUDA environment variables with ROCm equivalents:

GPU Topology and Resource Classes

Our node has 8× AMD Instinct MI355X (gfx950), ROCm 7.2.0.

When server.parallel is used and dp_size is not set explicitly:

dp_size = visible_gpus / (tp_size * pp_size)

Visible GPU count is inferred from HIP_VISIBLE_DEVICES, or from server.parallel.gpu_count.

Supported Dataset Kinds

Same as upstream:

sharegpt — auto-download supported, converted to canonical JSONL.
custom — old bench_serving format or canonical autobench JSONL.
random — synthetic/random benchmark path.
generated-shared-prefix — shared-prefix synthetic generator.

Canonical Dataset Format

Identical to upstream. JSONL, one request per line:

{"prompt": "Write a summary.", "output_len": 256}
{"prompt": [{"role": "user", "content": "Summarize."}], "output_len": 256}

Search Tiers

Tier 1: Fast sanity sweep. Baseline + small one-at-a-time scan.
Tier 2: Good default. Small cartesian on high-priority keys + expansion for rest.
Tier 3: Full cartesian product. Slowest, but thorough when space is bounded.

YAML key order matters. Put the most important search keys first.

What Is Tunable (AMD/ROCm)

server.base_flags and server.search_space are passed to the server launcher. Any valid vLLM/SGLang server CLI flag can be set or searched.

Kernel / Backend (AMD-specific)

attention_backend — search [aiter, triton]
prefill_attention_backend — if split prefill/decode is supported
decode_attention_backend — if split prefill/decode is supported
sampling_backend

Batching / Scheduling

max_running_requests
chunked_prefill_size — common values: [4096, 8192, 16384, 131072]
prefill_max_requests
max_prefill_tokens
schedule_policy — [lpm, fcfs]
schedule_conservativeness
num_continuous_decode_steps

Memory / Cache

mem_fraction_static — critical for ROCm; MI355X HBM3e is larger than H100, so ranges differ. Typical search: [0.80, 0.85, 0.88, 0.90]
max_total_tokens
page_size
disable_radix_cache
kv_cache_dtype — [auto, fp8_e4m3] for FP8 KV cache on MI355X

Parallel / Distributed

tp_size — must respect aiter head constraints (heads_per_gpu % 16 == 0)
pp_size
dp_size
load_balance_method
enable_dp_attention
enable_aiter_allreduce_fusion — AMD-specific distributed optimization

Runtime / HIP Graph

Keep HIP graph enabled by default for performance benchmarking (same concept as CUDA graph on ROCm).
cuda_graph_max_bs — flag name is unchanged in vLLM/SGLang even on ROCm
disable_cuda_graph_padding
Do not put disable_cuda_graph into the default search space.

Optional Speculative / EAGLE Stage

Speculative decoding support on ROCm may be limited. Verify availability before enabling:

speculative_num_steps
speculative_eagle_topk
speculative_num_draft_tokens

Order: always tune the non-speculative base server first, then optionally add speculative search.

Base Tuning Before EAGLE

Never start by tuning EAGLE first. Use this order:

Tune the non-speculative base server first.
Find the best normal config for the target dataset and SLA.
Only if the user explicitly asks and draft model assets exist, run speculative search.

Running The Workflow

Prepare dataset

python3 -m sglang.auto_benchmark convert \
  --kind sharegpt \
  --tokenizer /data/meta-llama/Meta-Llama-3.1-70B-Instruct \
  --num-prompts 1200 \
  --output /tmp/sharegpt.autobench.jsonl

Run from config

python3 -m sglang.auto_benchmark run --config /path/to/config.yaml

Outputs

Prepared canonical dataset JSONL
Per-run results.jsonl
Summary results.csv
Per-candidate server logs

Integration with amdpilot

When running benchmarks through the amdpilot executor, the benchmark task spec maps to task_spec_json in the queue DB:

{
  "gpu_count": 8,
  "resource_class": "full-node",
  "base_image": "rocm/sgl-dev:v0.5.9-rocm720-mi35x-20260317",
  "gpu_free_ids": [0, 1, 2, 3, 4, 5, 6, 7],
  "disk_free_gb": 2100,
  "timeout_minutes": 120
}

The dashboard GET /api/{job_name}/system_info endpoint reads from this column to display experiment runtime info (GPU arch, ROCm version, base image, resource class).

Benchmark results should be written as structured artifacts so the dashboard can render them and feed them into the data flywheel for downstream LoRA/SFT training signal.

Config Template

Use the reference configs in references/:

What To Report Back

After a run, summarize:

Hardware: GPU arch (gfx950/gfx942), ROCm version, Docker image tag
Search config: which tier, dataset kind (synthetic vs real)
Best config found: attention backend, TP/DP, key flags
Best QPS that satisfied SLA (or fixed QPS results)
Whether speculative tuning was skipped or run
Paths to artifacts: dataset JSONL, results.jsonl, results.csv, server logs
Anomalies: OOM events, HIP graph capture failures, aiter constraint violations

Differences From Upstream (CUDA) Skill

Related Skills

amdpilot-org/flydsl-kernel-authoring

development

VerifiedTrustedCommunity

FlyDSL is a Python DSL with MLIR-native backend for authoring custom AMD GPU kernels with explicit layout algebra (pre-installed at /opt/FlyDSL on images tagged *-flydsl:*). Use this skill when profiling identifies a hot per-row reduction (RMSNorm / LayerNorm / softmax), a fused elementwise chain (norm + residual add, activation + multiplier), or an unusual-shape grouped GEMM that the standard AMD backends (Triton / aiter / CK / hipBLASLt / TransformerEngine) don't serve well. Essential for any workload where Python/config/Triton-tuning gains have plateaued and the profile shows a custom kernel opportunity. Covers the `/opt/FlyDSL` availability check, the integration playbook (dispatcher + direct site-packages edit + autograd-safe output handling), kernel authoring patterns (elementwise via layout API, block reductions via wave_reduce_add, fused dx+dw designs, MFMA GEMM preshuffle), torchrun gotchas, and the critical rule that custom kernels typically only win end-to-end when stacked with `torch.compile(mode="default")`.

2SKILL.mdUpdated Apr 25, 2026

amdpilot-org/flydsl-kernel-authoring

amdpilot-org/skill-creator

tools

VerifiedTrustedCommunity

Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.

1SKILL.mdUpdated Apr 25, 2026

amdpilot-org/skill-creator

amdpilot-org/rocprofv3-profiler

data-ai

VerifiedTrustedCommunity

Profile AMD GPU kernels using rocprofv3 and analyze performance bottlenecks. Use when the user wants to profile HIP/ROCm kernels, identify GPU performance issues, analyze hardware counters, or understand why a kernel is slow on AMD GPUs (MI100, MI200, MI300 series). Provides wrapper scripts for rocprofv3 execution and automated parsing of profiler output into structured, agent-friendly JSON with bottleneck classification.

1SKILL.mdUpdated Apr 25, 2026

amdpilot-org/rocprofv3-profiler

amdpilot-org/rocm-profiler-analysis

testing

VerifiedTrustedCommunity

Analyze SGLang and vLLM profiler traces on AMD ROCm systems, especially MI355X/gfx950 nodes. Adapted from the SGLang torch-profiler workflow: triage kernel breakdown, overlap headroom, and fuse opportunities, then write structured artifacts that can be attached to amdpilot experiments, trials, and dashboard views. Use when a run needs profiling, when an optimization trial should produce machine-readable profiling artifacts, or when the user asks why a ROCm workload is slow.

1SKILL.mdUpdated Apr 25, 2026

amdpilot-org/rocm-profiler-analysis

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/amdpilot-org/amd-skills.git

# Copy into Claude Code skills folder (global)
cp -r amd-skills/auto-benchmark ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

amdpilot-org/amd-skills

1 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT