
Add or update AMD/ROCm SGLang regression tests, choose the right MI325/MI35x CI suite, reproduce AMD CI locally with the upstream amd_ci container scripts, and bisect AMD CI regressions on main/nightly. Use when an agent-generated fix needs register_amd_ci coverage, when selecting MI325 vs MI35x runners, when debugging pr-test-amd or nightly AMD failures, or when documenting the register_amd_ci flow for ROCm-specific fixes.
Run AI-driven benchmark searches on AMD ROCm with tiered server-flag sweeps for vLLM/SGLang, canonical dataset preparation, SLA or fixed-QPS benchmarking, CSV export, and resume. Adapted from SGLang auto-benchmark for MI355X (gfx950) / MI300X (gfx942) on ROCm 7.x. Use when the user wants an automated benchmark workflow on AMD GPUs rather than a one-off bench_serving command. Integrates with amdpilot executor task_spec_json for resource-aware launch.
Inspect AMD/ROCm Docker runtime environment before writing any code. Use BEFORE torch.compile, CUDAGraph capture, or any kernel optimization. Detects hidden framework defaults (inductor max_autotune, triton.cudagraphs), known Docker-specific bugs (hipBLASLt solver crash, FP8 flash attn), and missing packages. Outputs CRITICAL/WARNING/INFO report with recommended fixes. Triggered by: starting work in an AMD Docker, "check environment", "why is torch.compile hanging", "env probe", Phase 0 of any AMD optimization experiment.
FlyDSL is a Python DSL with MLIR-native backend for authoring custom AMD GPU kernels with explicit layout algebra (pre-installed at /opt/FlyDSL on images tagged *-flydsl:*). Use this skill when profiling identifies a hot per-row reduction (RMSNorm / LayerNorm / softmax), a fused elementwise chain (norm + residual add, activation + multiplier), or an unusual-shape grouped GEMM that the standard AMD backends (Triton / aiter / CK / hipBLASLt / TransformerEngine) don't serve well. Essential for any workload where Python/config/Triton-tuning gains have plateaued and the profile shows a custom kernel opportunity. Covers the `/opt/FlyDSL` availability check, the integration playbook (dispatcher + direct site-packages edit + autograd-safe output handling), kernel authoring patterns (elementwise via layout API, block reductions via wave_reduce_add, fused dx+dw designs, MFMA GEMM preshuffle), torchrun gotchas, and the critical rule that custom kernels typically only win end-to-end when stacked with `torch.compile(mode="default")`.
Git commit message conventions using Conventional Commits format
Debug ROCm/HIP kernel crashes in SGLang and vLLM on AMD GPUs (MI300X/MI325X/MI355X). Adapts SGLang's @debug_kernel_api kernel boundary logging to ROCm: captures input tensors before crash, tracks shapes/dtypes/values, dumps crash artifacts for offline analysis. Integrates with amdpilot executor failure_reason field and dashboard trajectory viewer. Triggered by: CUDA/HIP errors, illegal memory access, device-side assert, OOM kills, signal 137/139, NaN/Inf in outputs, "debug crash", "why did the trial fail".
Profile AMD GPU kernels using rocprofv3 and analyze performance bottlenecks. Use when the user wants to profile HIP/ROCm kernels, identify GPU performance issues, analyze hardware counters, or understand why a kernel is slow on AMD GPUs (MI100, MI200, MI300 series). Provides wrapper scripts for rocprofv3 execution and automated parsing of profiler output into structured, agent-friendly JSON with bottleneck classification.
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
Optimize inference latency and throughput of PyTorch models on AMD GPUs (MI250/MI300/MI350) with ROCm. Use when profiling and optimizing GEMM, attention, elementwise ops, torch.compile, CUDAGraphs, or Triton kernels on AMD hardware. Covers the full optimize cycle: benchmark → profile → analyze → implement → verify. Also covers benchmarking methodology and common pitfalls that waste time.
Port NVIDIA CUDA codebases to AMD ROCm GPUs. Use when making PyTorch models run on AMD GPUs, replacing NVIDIA-specific libraries with AMD equivalents, fixing ROCm build/runtime failures, or porting C/C++ CUDA kernels to HIP. Also covers dependency debugging and environment setup on ROCm Docker images.
Behavioral guidelines to reduce common LLM coding mistakes. Use when writing, reviewing, or refactoring code to avoid overcomplication, make surgical changes, surface assumptions, and define verifiable success criteria. Derived from Andrej Karpathy's observations on LLM coding pitfalls. Applies universally regardless of language or domain.
Analyze SGLang and vLLM profiler traces on AMD ROCm systems, especially MI355X/gfx950 nodes. Adapted from the SGLang torch-profiler workflow: triage kernel breakdown, overlap headroom, and fuse opportunities, then write structured artifacts that can be attached to amdpilot experiments, trials, and dashboard views. Use when a run needs profiling, when an optimization trial should produce machine-readable profiling artifacts, or when the user asks why a ROCm workload is slow.