env-probe/SKILL.md
Inspect AMD/ROCm Docker runtime environment before writing any code. Use BEFORE torch.compile, CUDAGraph capture, or any kernel optimization. Detects hidden framework defaults (inductor max_autotune, triton.cudagraphs), known Docker-specific bugs (hipBLASLt solver crash, FP8 flash attn), and missing packages. Outputs CRITICAL/WARNING/INFO report with recommended fixes. Triggered by: starting work in an AMD Docker, "check environment", "why is torch.compile hanging", "env probe", Phase 0 of any AMD optimization experiment.
npx skillsauth add arist12/amd-skills env-probeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run this before writing any optimization code. AMD Docker images silently set framework defaults that differ from stock PyTorch. These hidden defaults cause stalls, crashes, and wrong results that are impossible to diagnose by looking at code alone.
Problem: ROCm Docker images override PyTorch/Triton defaults at the system level. For example,
max_autotune=True as a global default means torch.compile(mode="default") benchmarks every
GEMM across ATEN+TRITON+CPP backends. With hundreds of matmuls in a compiled graph, autotuning
never finishes — the process hangs indefinitely with no error message.
These defaults are invisible to pip list, rocm-smi, or any surface-level inspection. You have
to introspect the framework config objects at runtime to see them.
python /path/to/env_probe.py
Or if the skill is installed as a Claude Code command, copy the probe script from references/env_probe.py and run it inside your Docker container.
The probe script is self-contained — no dependencies beyond PyTorch (which your Docker already has).
The probe outputs a structured report with three severity levels:
| Level | Meaning | Action | |-------|---------|--------| | CRITICAL | Will cause hangs, crashes, or silent wrong results | Must fix before proceeding | | WARNING | Suboptimal default, will hurt performance | Fix before benchmarking | | INFO | Informational, no action needed | Document for reproducibility |
Each CRITICAL/WARNING item includes a recommended fix — either a Python config line or an
environment variable to set. Apply these fixes at the top of your script, before any
torch.compile() or torch.cuda.CUDAGraph() call.
torch._inductor.config.max_autotune — if True, causes indefinite stall with torch.compiletorch._inductor.config.max_autotune_gemm_backends — which backends inductor will benchmarktorch._inductor.config.triton.cudagraphs — unstable on ROCmtorch._inductor.config.triton.cudagraph_trees — unstable on ROCmtorch._inductor.config.memory_planning — causes deep recursion crash on ROCmtorch._dynamo.config.cache_size_limit — too small causes recompilation loopstorch.backends.cudnn.benchmark and allow_tf32 defaultsHIP_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICESHSA_ENABLE_SDMA, HIP_FORCE_DEV_KERNARGPYTORCH_TUNABLEOP_ENABLED, PYTORCH_TUNABLEOP_TUNINGTORCH_COMPILE_DEBUG, TORCHINDUCTOR_* overridesWhen the probe flags inductor defaults as CRITICAL, apply this configuration block before any
torch.compile() call:
import torch._inductor.config as inductor_config
import torch._dynamo.config as dynamo_config
# Prevent indefinite GEMM autotuning stall
inductor_config.max_autotune = False
inductor_config.max_autotune_gemm_backends = "ATEN"
# Disable unstable triton cudagraphs on ROCm
inductor_config.triton.cudagraphs = False
inductor_config.triton.cudagraph_trees = False
# Prevent deep recursion crash
inductor_config.memory_planning = False
# Prevent cache eviction / recompilation loops
dynamo_config.cache_size_limit = 128
See references/inductor-rocm-defaults.md for the full explanation of each setting and when you might want to override them.
When you discover a new Docker-specific gotcha, add it to references/env_probe.py:
run_all_checks()This skill is meant to grow — every experiment that hits an environment issue should contribute a new check back to the probe.
tools
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
data-ai
Profile AMD GPU kernels using rocprofv3 and analyze performance bottlenecks. Use when the user wants to profile HIP/ROCm kernels, identify GPU performance issues, analyze hardware counters, or understand why a kernel is slow on AMD GPUs (MI100, MI200, MI300 series). Provides wrapper scripts for rocprofv3 execution and automated parsing of profiler output into structured, agent-friendly JSON with bottleneck classification.
testing
Analyze SGLang and vLLM profiler traces on AMD ROCm systems, especially MI355X/gfx950 nodes. Adapted from the SGLang torch-profiler workflow: triage kernel breakdown, overlap headroom, and fuse opportunities, then write structured artifacts that can be attached to amdpilot experiments, trials, and dashboard views. Use when a run needs profiling, when an optimization trial should produce machine-readable profiling artifacts, or when the user asks why a ROCm workload is slow.
development
Debug ROCm/HIP kernel crashes in SGLang and vLLM on AMD GPUs (MI300X/MI325X/MI355X). Adapts SGLang's @debug_kernel_api kernel boundary logging to ROCm: captures input tensors before crash, tracks shapes/dtypes/values, dumps crash artifacts for offline analysis. Integrates with amdpilot executor failure_reason field and dashboard trajectory viewer. Triggered by: CUDA/HIP errors, illegal memory access, device-side assert, OOM kills, signal 137/139, NaN/Inf in outputs, "debug crash", "why did the trial fail".