rocm-crash-debug/SKILL.md
Debug ROCm/HIP kernel crashes in SGLang and vLLM on AMD GPUs (MI300X/MI325X/MI355X). Adapts SGLang's @debug_kernel_api kernel boundary logging to ROCm: captures input tensors before crash, tracks shapes/dtypes/values, dumps crash artifacts for offline analysis. Integrates with amdpilot executor failure_reason field and dashboard trajectory viewer. Triggered by: CUDA/HIP errors, illegal memory access, device-side assert, OOM kills, signal 137/139, NaN/Inf in outputs, "debug crash", "why did the trial fail".
npx skillsauth add amdpilot-org/amd-skills rocm-crash-debugInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this when an amdpilot trial fails with a GPU error, OOM kill, or produces wrong results. AMD ROCm crashes often leave minimal diagnostic information. This skill captures structured crash evidence at kernel boundaries so failures become reproducible offline samples.
Problem: ROCm/HIP errors on MI300X/MI325X/MI355X crash the process before normal debug output is flushed. A trial that exits with code 137 (OOM kill) or produces 0.00 metric tells you nothing about what went wrong. The agent can't learn from a failure it can't observe.
Solution: Instrument kernel call boundaries with pre-execution logging. When a crash occurs, the last logged boundary tells you exactly which op failed and what inputs caused it. Combined with tensor dumps, a crash becomes an offline-analyzable sample.
This approach is adapted from SGLang's @debug_kernel_api decorator (PR #20910), which was
inspired by FlashInfer's API logging. The concepts are backend-agnostic; this skill specializes
them for ROCm/HIP and integrates with amdpilot's trial/experiment infrastructure.
Set environment variables when launching the Docker container. These work with SGLang's
existing kernel_api_logging.py infrastructure on ROCm builds.
SGLANG_KERNEL_API_LOGLEVEL=1
SGLANG_KERNEL_API_LOGDEST=stderr
Use this as the default for all amdpilot trials. Overhead is negligible and it captures the last kernel boundary before any crash.
SGLANG_KERNEL_API_LOGLEVEL=3
SGLANG_KERNEL_API_LOGDEST=/workspace/.amdpilot/kernel_api.log
Use this when a trial has failed and you need to understand what data caused the crash. Shows tensor shapes, dtypes, device placement, and contiguity.
SGLANG_KERNEL_API_LOGLEVEL=5
SGLANG_KERNEL_API_LOGDEST=/workspace/.amdpilot/kernel_api.log
Use this for numerical correctness issues (NaN/Inf in outputs, wrong results). Adds statistical summaries of every input tensor at each kernel boundary.
SGLANG_KERNEL_API_LOGLEVEL=10
SGLANG_KERNEL_API_DUMP_DIR=/workspace/.amdpilot/crash_dumps
SGLANG_KERNEL_API_LOGDEST=/workspace/.amdpilot/kernel_api.log
Use this for hard-to-reproduce crashes. Saves complete input tensors and metadata before each kernel call. When the process crashes, the last dump directory contains a reproducible snapshot.
Note on HIP Graphs: When HIP_GRAPH_CAPTURE is active, level-10 tensor dumps are
automatically skipped (same as CUDA Graph behavior), but boundary logging at levels 1-5
continues. This prevents dump operations from corrupting graph capture.
In task.yaml, add the logging environment variables:
container:
env:
SGLANG_KERNEL_API_LOGLEVEL: "3"
SGLANG_KERNEL_API_LOGDEST: "/workspace/.amdpilot/kernel_api.log"
For the orchestrator's Docker container startup, these flow through ContainerConfig.env
to docker run -e flags. No code changes needed — just config.
For amdpilot executor integration, the recommended approach:
When a trial fails, collect these AMD-specific diagnostics from inside or outside the container:
# GPU memory and utilization at time of failure
rocm-smi --showuse --showmeminfo vram
# Check for GPU hangs or resets
dmesg | grep -i "amdgpu\|drm\|gpu" | tail -20
# Check for Xnack / page fault issues (MI300X/MI355X specific)
dmesg | grep -i "retry fault\|xnack" | tail -10
# Get exit code and OOM details
docker inspect --format='{{.State.ExitCode}} {{.State.OOMKilled}}' <container>
# Get last N lines of container stderr
docker logs --tail 50 <container> 2>&1
| Error | Meaning | What to Check |
|-------|---------|---------------|
| hipErrorIllegalAddress | Out-of-bounds GPU memory access | Tensor shapes, index bounds, contiguity |
| hipErrorAssert | Device-side assert triggered | Input validation in kernel, index range |
| hipErrorOutOfMemory | GPU VRAM exhaustion | Batch size, model size, KV cache config |
| hipErrorLaunchFailure | Kernel launch failed | Shared memory size, block dimensions |
| hipErrorNoBinaryForGpu | No binary for target GPU arch | Check gfx target matches (gfx942 vs gfx950) |
| Signal 137 | OOM killed by host kernel | Container memory limit, host swap pressure |
| Signal 139 | Segmentation fault | Usually host-side pointer corruption |
# Inside container — attach to running process
rocgdb -p <pid>
# Or launch with rocgdb
rocgdb --args python your_script.py
(gdb) catch throw
(gdb) run
MI355X nodes typically run 8 GPUs. Use per-process log files:
SGLANG_KERNEL_API_LOGDEST="/workspace/.amdpilot/kernel_api_rank%i.log"
SGLANG_KERNEL_API_DUMP_DIR="/workspace/.amdpilot/crash_dumps/rank%i"
The %i placeholder is replaced by the process rank. This prevents log interleaving
across TP/DP workers.
For RCCL (ROCm collective communication) hangs:
# Enable RCCL debug logging
NCCL_DEBUG=INFO
NCCL_DEBUG_SUBSYS=INIT,COLL
# Set timeout for collective operations
NCCL_TIMEOUT=300
Level-10 dumps can consume significant disk space. Filter to specific ops:
# Only dump attention-related ops
SGLANG_KERNEL_API_DUMP_INCLUDE="*attention*,*flash*,*sdpa*"
# Exclude high-frequency trivial ops
SGLANG_KERNEL_API_DUMP_EXCLUDE="*elementwise*,*copy*,*fill*"
Symptom: hipErrorNoBinaryForGpu or silent wrong results
Cause: Container built for gfx942 (MI300X) but running on gfx950 (MI355X)
Fix: Verify Docker image target matches hardware
# Check GPU architecture
rocminfo | grep "Name:" | head -1
# Should show: gfx950 for MI355X, gfx942 for MI300X/MI325X
Symptom: Signal 137 or hipErrorOutOfMemory despite sufficient total VRAM
Cause: HBM3e memory fragmentation after multiple allocation/free cycles
Fix: Set PYTORCH_HIP_ALLOC_CONF=expandable_segments:True
Symptom: Crash in attention kernel with FP8 quantized models
Cause: FP8 flash attention path not supported on all ROCm versions
Fix: Check env-probe skill output for FP8 flash attn availability; fall back to
non-FP8 attention if flagged
Symptom: Hang during first forward pass, or ModuleNotFoundError in Triton
Cause: Triton cache corruption or version mismatch
Fix: Clear Triton cache: rm -rf ~/.triton/cache
Symptom: Process hangs on all_reduce or all_gather, eventually killed
Cause: GPU-to-GPU communication failure, often RDMA fabric issue
Fix: Check rocm-smi --showtopo for link health; set NCCL_TIMEOUT
Crash artifacts flow into the dashboard through existing fields:
| Artifact | DB Field | Dashboard View |
|----------|----------|----------------|
| Last kernel boundary | trials.failure_reason | Trial detail panel |
| Input shapes/dtypes | trials.failure_reason (structured) | Trial detail panel |
| Full crash dump path | events.detail_json | Trajectory viewer |
| GPU state at crash | events.detail_json | System info panel |
| Kernel API log file | Agent output (trial_N.txt) | Agent log viewer |
After a trial fails, run this inside the container to collect structured diagnostics:
#!/bin/bash
# collect_crash_diagnostics.sh
OUT="/workspace/.amdpilot/crash_diagnostics.json"
python3 -c "
import json, os, glob
diag = {
'kernel_api_log': '',
'last_kernel_boundary': '',
'crash_dumps': [],
'gpu_arch': '',
'exit_code': ${EXIT_CODE:-0},
}
# Read kernel API log
log_path = '/workspace/.amdpilot/kernel_api.log'
if os.path.exists(log_path):
with open(log_path) as f:
lines = f.readlines()
diag['kernel_api_log'] = ''.join(lines[-50:])
# Extract last kernel boundary
for line in reversed(lines):
if 'Kernel API Call:' in line:
diag['last_kernel_boundary'] = line.strip()
break
# List crash dumps
dump_dir = '/workspace/.amdpilot/crash_dumps'
if os.path.isdir(dump_dir):
diag['crash_dumps'] = glob.glob(f'{dump_dir}/**/metadata.json', recursive=True)[-5:]
# GPU arch
import subprocess
try:
out = subprocess.check_output(['rocminfo'], text=True, timeout=5)
for line in out.splitlines():
if 'Name:' in line and 'gfx' in line:
diag['gpu_arch'] = line.strip().split()[-1]
break
except Exception:
pass
print(json.dumps(diag, indent=2))
" > "$OUT"
The orchestrator can read this file post-trial and inject into failure_reason and
events.detail_json for dashboard display.
Recommended supervisor behavior when a trial fails:
First failure (any exit code != 0):
failure_reasonSecond failure (same error pattern):
Third failure (still stuck):
This escalation flow produces progressively richer crash evidence without paying the full level-10 cost on every trial.
This skill has been validated against:
When you encounter a new ROCm crash pattern:
collect_crash_diagnostics.shdevelopment
FlyDSL is a Python DSL with MLIR-native backend for authoring custom AMD GPU kernels with explicit layout algebra (pre-installed at /opt/FlyDSL on images tagged *-flydsl:*). Use this skill when profiling identifies a hot per-row reduction (RMSNorm / LayerNorm / softmax), a fused elementwise chain (norm + residual add, activation + multiplier), or an unusual-shape grouped GEMM that the standard AMD backends (Triton / aiter / CK / hipBLASLt / TransformerEngine) don't serve well. Essential for any workload where Python/config/Triton-tuning gains have plateaued and the profile shows a custom kernel opportunity. Covers the `/opt/FlyDSL` availability check, the integration playbook (dispatcher + direct site-packages edit + autograd-safe output handling), kernel authoring patterns (elementwise via layout API, block reductions via wave_reduce_add, fused dx+dw designs, MFMA GEMM preshuffle), torchrun gotchas, and the critical rule that custom kernels typically only win end-to-end when stacked with `torch.compile(mode="default")`.
tools
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
data-ai
Profile AMD GPU kernels using rocprofv3 and analyze performance bottlenecks. Use when the user wants to profile HIP/ROCm kernels, identify GPU performance issues, analyze hardware counters, or understand why a kernel is slow on AMD GPUs (MI100, MI200, MI300 series). Provides wrapper scripts for rocprofv3 execution and automated parsing of profiler output into structured, agent-friendly JSON with bottleneck classification.
testing
Analyze SGLang and vLLM profiler traces on AMD ROCm systems, especially MI355X/gfx950 nodes. Adapted from the SGLang torch-profiler workflow: triage kernel breakdown, overlap headroom, and fuse opportunities, then write structured artifacts that can be attached to amdpilot experiments, trials, and dashboard views. Use when a run needs profiling, when an optimization trial should produce machine-readable profiling artifacts, or when the user asks why a ROCm workload is slow.