amd-rocm-porting/SKILL.md
Port NVIDIA CUDA codebases to AMD ROCm GPUs. Use when making PyTorch models run on AMD GPUs, replacing NVIDIA-specific libraries with AMD equivalents, fixing ROCm build/runtime failures, or porting C/C++ CUDA kernels to HIP. Also covers dependency debugging and environment setup on ROCm Docker images.
npx skillsauth add arist12/amd-skills amd-rocm-portingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Port NVIDIA CUDA codebases to AMD ROCm GPUs for functional equivalence.
NVIDIA isolation: Every ROCm change MUST be gated behind is_rocm. The NVIDIA code path
must be byte-for-byte identical to the pre-porting state.
is_rocm = hasattr(torch.version, "hip") and torch.version.hip is not None
Compile mode: NEVER use mode="reduce-overhead" on ROCm — causes 65x slowdown.
Use mode="default" on ROCm, keep original mode for NVIDIA.
Inductor: Disable triton.cudagraphs, triton.cudagraph_trees, and memory_planning
on ROCm. Also override max_autotune = False (AMD Docker images set it True by default,
causing mode="default" to silently behave like max-autotune and hang).
Details: references/torch-compile-and-cudagraph.md
Warp width: AMD wavefronts are 64-wide (not 32). All ballot/mask operations need
uint64_t. (C/C++ repos only; pure Python repos skip this.)
Three-tier fallback: AMD-optimized lib → PyTorch SDPA → pure PyTorch eager. Details: references/library-and-model-adaptation.md
Does the repo have C/C++ CUDA kernels (.cu / .cuh files)?
├── NO → Skip Phases 2, 3, 4. Run Phases 1, 5, 6, 7, 8 only.
│ (Pure Python/PyTorch repos — most HuggingFace models, etc.)
└── YES → Run all 8 phases.
Does it use flash-attn, CUTLASS, or custom extensions?
├── flash-attn only → Phase 5 (replace with aiter)
├── CUTLASS → Phase 3 + manual CK rewrite
└── custom kernels → Full Phase 2 + 3 HIPIFY workflow
Step 1 — Audit existing environment before installing anything. AMD Docker images often have PyTorch ROCm, aiter, flash-attn pre-installed. Check what exists:
env | grep -iE 'TORCH|INDUCTOR|AUTOTUNE|TRITON|HIP|ROCM|HSA|GPU|AMD|CUDA' | sort
pip show torch torchvision transformers 2>/dev/null | grep -E "^(Name|Version|Location)"
Add the repo src/ to sys.path in scripts to make the package importable without pip install:
import sys, pathlib
sys.path.insert(0, str(pathlib.Path(__file__).resolve().parents[1] / "src"))
Run the target script and note only the ModuleNotFoundErrors that actually occur.
Install those packages individually (pip install --no-deps <pkg>).
Step 2 — Never run pip install -e . on AMD without exclusions.
The pyproject.toml was written for NVIDIA and often contains jax[cuda12] and torch==X.Y.Z.
Running pip install -e . will overwrite your ROCm PyTorch with CUDA versions.
Use: pip install --no-deps --ignore-requires-python -e ., then install only missing packages.
CRITICAL: Protect PyTorch after any pip install. Always verify:
python3 -c "import torch; print(torch.__version__, torch.version.hip)"
If torch.version.hip is None, your ROCm PyTorch was overwritten. Reinstall it.
Step 3 — Python version constraint.
requires-python = ">=3.11" is often a conservative constraint. Use --ignore-requires-python.
Step 4 — Repos with JAX + PyTorch: use PyTorch-only path. Skip all JAX-dependent code; do not attempt to install or fix JAX for ROCm.
Step 5 — Dependency debugging.
If you hit ImportError, version mismatch, or dtype errors, read references/dependency-debugging.md for the diagnostic protocol.
is_rocm = hasattr(torch.version, "hip") and torch.version.hip is not Nonerocminfo | grep -o 'gfx[0-9a-f]*' | head -1 → e.g. gfx942 (MI300X) or gfx950 (MI350X)PYTORCH_CUDA_ALLOC_CONF=expandable_segments:True (omit max_split_size_mb)export GPU_MAX_HW_QUEUES=2 HIP_FORCE_DEV_KERNARG=1 HSA_NO_SCRATCH_RECLAIM=1 AMD_LOG_LEVEL=0
sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'rocm-smi, rocminfo | grep gfx, hipcc --versionarch = torch.cuda.get_device_properties(0).gcnArchName
fp8_dtype = torch.float8_e4m3fnuz if "gfx942" in arch else torch.float8_e4m3fn
hipify-perl --inplace for initial pass, then hipify-clang for complex templatescuda_runtime.h→hip/hip_runtime.h, cublas_v2.h→hipblas/hipblas.hgrep -rn "asm\s*(") — cannot be auto-ported; flag CUTLASS — needs manual CK rewriteuint64_t for AMD 64-wide wavefronts__popc with __popcll for 64-bit masks; prefer 64-element shared memory tilesGPU_ARCH=$(rocminfo | grep -o 'gfx[0-9a-f]*' | head -1)find_package(HIP), set CMAKE_HIP_ARCHITECTURES to the detected archis_rocm, use CUDAExtension (hipcc handles .cu on ROCm)try/except, use torch.cuda.is_available() as primary GPU checkPYTORCH_CUDA_ALLOC_CONF: remove max_split_size_mb on ROCm (rejected by HIP allocator)"default" on ROCm, original mode on NVIDIAenv | grep -iE 'TORCH|INDUCTOR|AUTOTUNE' — unset TORCHINDUCTOR_MAX_AUTOTUNE if present (causes hangs even in default mode)max_autotune=False; use ATEN GEMM backendamd-kernel-optimization skill's torch-compile-and-graphs.md.cuda_runtime.h, inline PTX, NVIDIA-specific typesGPU_ARCH=$(rocminfo | grep -o 'gfx[0-9a-f]*' | head -1); hipcc -c kernels.hip --offload-arch=$GPU_ARCHtorch.testing.assert_close(rocm_out, cuda_ref, rtol=5e-2, atol=5e-2)Every JIT component has a slow first run. Do NOT conclude something is broken because first run is slow.
| Component | First Run | Subsequent | Cache |
|---|---|---|---|
| torch.compile (default) | 2-5 min | <1s | TORCHINDUCTOR_CACHE_DIR |
| torch.compile (max-autotune) | 5-15 min | <1s | TORCHINDUCTOR_CACHE_DIR |
| AITER JIT kernels | 1-3 min | <1s | aiter jit/build/ |
| Triton kernels | 1-2 min | <1s | ~/.triton/cache |
| TunableOp GEMM tuning | 1-5 min | <1s | PYTORCH_TUNABLEOP_FILENAME |
Set timeout ≥ 600s for first compilation. Do NOT kill processes under 15 minutes.
| Pitfall | Symptom | Fix |
|---|---|---|
| pip install -e . on AMD | Overwrites ROCm torch with CUDA version | Use --no-deps --ignore-requires-python; install missing pkgs individually |
| TORCHINDUCTOR_MAX_AUTOTUNE=1 in Docker env | mode="default" hangs (silently becomes max-autotune) | unset TORCHINDUCTOR_MAX_AUTOTUNE before any compile |
| reduce-overhead compile mode | 65x slowdown, hangs | mode="default" on ROCm |
| max_split_size_mb in PYTORCH_CUDA_ALLOC_CONF | RuntimeError at startup | Remove on ROCm |
| Top-level import pynvml | ImportError | Guard with try/except; use torch.cuda.is_available() first |
| Inductor cudagraphs enabled | Slowdown, capture errors | inductor_config.triton.cudagraphs = False |
| Inductor memory_planning | Deep recursion crash | inductor_config.memory_planning = False |
| torch.cuda.get_rng_state() during capture | RuntimeError | Apply Dynamo RNG patch |
| torch.backends.cuda.matmul.allow_tf32 | AttributeError on ROCm | Gate behind if not is_rocm |
| NUMA balancing on | 10-30% perf loss, intermittent errors | echo 0 > /proc/sys/kernel/numa_balancing |
| FP8 dtype mismatch | Crash or accuracy loss | gfx942=e4m3fnuz, gfx950=e4m3fn |
| 32-bit warp masks (C/C++) | Silent wrong results | Use uint64_t for ballot/active masks |
| Patching files into wrong site-packages path | Custom model code never loads | Verify with inspect.getfile(TheClass) after patching |
Load only when actively working on that phase:
tools
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
data-ai
Profile AMD GPU kernels using rocprofv3 and analyze performance bottlenecks. Use when the user wants to profile HIP/ROCm kernels, identify GPU performance issues, analyze hardware counters, or understand why a kernel is slow on AMD GPUs (MI100, MI200, MI300 series). Provides wrapper scripts for rocprofv3 execution and automated parsing of profiler output into structured, agent-friendly JSON with bottleneck classification.
testing
Analyze SGLang and vLLM profiler traces on AMD ROCm systems, especially MI355X/gfx950 nodes. Adapted from the SGLang torch-profiler workflow: triage kernel breakdown, overlap headroom, and fuse opportunities, then write structured artifacts that can be attached to amdpilot experiments, trials, and dashboard views. Use when a run needs profiling, when an optimization trial should produce machine-readable profiling artifacts, or when the user asks why a ROCm workload is slow.
development
Debug ROCm/HIP kernel crashes in SGLang and vLLM on AMD GPUs (MI300X/MI325X/MI355X). Adapts SGLang's @debug_kernel_api kernel boundary logging to ROCm: captures input tensors before crash, tracks shapes/dtypes/values, dumps crash artifacts for offline analysis. Integrates with amdpilot executor failure_reason field and dashboard trajectory viewer. Triggered by: CUDA/HIP errors, illegal memory access, device-side assert, OOM kills, signal 137/139, NaN/Inf in outputs, "debug crash", "why did the trial fail".