rocm-profiler-analysis/SKILL.md
Analyze SGLang and vLLM profiler traces on AMD ROCm systems, especially MI355X/gfx950 nodes. Adapted from the SGLang torch-profiler workflow: triage kernel breakdown, overlap headroom, and fuse opportunities, then write structured artifacts that can be attached to amdpilot experiments, trials, and dashboard views. Use when a run needs profiling, when an optimization trial should produce machine-readable profiling artifacts, or when the user asks why a ROCm workload is slow.
npx skillsauth add amdpilot-org/amd-skills rocm-profiler-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill when you need to turn a profiling run into structured optimization evidence instead of a raw trace file.
This skill is the AMD/ROCm/MI355X adaptation of SGLang's torch-profiler analysis workflow. It is designed for our current amdpilot stack:
Raw traces are not good enough for agents or dashboards. They tell you that time was spent somewhere, but they do not directly answer:
This skill standardizes that path.
Preserve the same four subcommands as the upstream SGLang profiler skill:
triagebreakdownoverlapperfetto-fixFor normal agent use, default to triage.
triageUse this when you want one compact answer with three main outputs:
breakdownUse this when you need one-trace category share analysis without overlap reasoning.
overlapUse this when you have both:
and need to tie overlap headroom back to code paths.
perfetto-fixUse this only when Perfetto renders overlapped lanes incorrectly and you need a repaired trace for human inspection.
This skill supports two input shapes:
Existing trace directory / trace file
trace.jsontrace.json.gzLive server / live experiment
For amdpilot integration, prefer the second path for optimization-stage profiling and the first path for post-hoc investigation.
Do not reuse CUDA/H100/B200 assumptions. On our nodes, category tables should explicitly account for ROCm-specific paths:
See references/rocm-kernel-categories.md.
Every profiling result must declare whether it is truly relevant to our MI355X node:
observed_arch: actual arch from the runarch_match: exact | compatible | unknownhardware_relevance_reason: short human-readable explanationDo not hide gfx942 vs gfx950 differences.
Do not stop at stdout tables. Write stable artifacts that can be attached to an experiment or trial. Minimum recommended outputs:
profile_summary.mdprofile_metadata.jsonkernel_table.jsonoverlap_opportunities.jsonfuse_opportunities.jsonperfetto_fixed_trace.json if usedSee references/artifact-contract.md.
profile_metadata.json should contain enough information to tie profiling results back to the
dashboard and DB.
Minimum fields:
experiment_idtrial_idobserved_archarch_matchhardware_relevance_reasonrocm_versionbase_imageresource_classgpu_device_idsgpu_clocks_mhzpreflight_passedserver_flagsbenchmark_config_hashmodel_nameprofile_stagesource_trace_pathThis is the difference between "a useful local notebook" and "a reusable profiling artifact".
The intended downstream path is:
This skill should feed:
See references/dashboard-integration.md.
On our node, prefer profiling plans that stay grounded in actual machine facts:
When you compare profiles across runs, never compare them without also checking:
base_imageresource_classgpu_device_idsserver_flagsbenchmark_config_hashOtherwise the comparison is not trustworthy.
Make the kernel classification and overlap heuristics ROCm-aware.
Write the profiling outputs into stable JSON + Markdown artifacts.
Trigger profiling from real optimization stages and surface the artifacts in dashboard.
rocprofv3-profiler Use that skill when you need low-level AMD hardware counters or kernel-level bottleneck data. Use this skill when you need SGLang/vLLM trace triage tied back to Python/operator semantics.
env-probe Run env-probe before profiling if you suspect hidden runtime defaults are skewing results.
rocm-crash-debug Use crash-debug when the run is failing. Use this skill when the run is healthy enough to generate profiling evidence.
Before calling this skill "done", verify:
arch_match programmaticallydevelopment
FlyDSL is a Python DSL with MLIR-native backend for authoring custom AMD GPU kernels with explicit layout algebra (pre-installed at /opt/FlyDSL on images tagged *-flydsl:*). Use this skill when profiling identifies a hot per-row reduction (RMSNorm / LayerNorm / softmax), a fused elementwise chain (norm + residual add, activation + multiplier), or an unusual-shape grouped GEMM that the standard AMD backends (Triton / aiter / CK / hipBLASLt / TransformerEngine) don't serve well. Essential for any workload where Python/config/Triton-tuning gains have plateaued and the profile shows a custom kernel opportunity. Covers the `/opt/FlyDSL` availability check, the integration playbook (dispatcher + direct site-packages edit + autograd-safe output handling), kernel authoring patterns (elementwise via layout API, block reductions via wave_reduce_add, fused dx+dw designs, MFMA GEMM preshuffle), torchrun gotchas, and the critical rule that custom kernels typically only win end-to-end when stacked with `torch.compile(mode="default")`.
tools
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
data-ai
Profile AMD GPU kernels using rocprofv3 and analyze performance bottlenecks. Use when the user wants to profile HIP/ROCm kernels, identify GPU performance issues, analyze hardware counters, or understand why a kernel is slow on AMD GPUs (MI100, MI200, MI300 series). Provides wrapper scripts for rocprofv3 execution and automated parsing of profiler output into structured, agent-friendly JSON with bottleneck classification.
development
Debug ROCm/HIP kernel crashes in SGLang and vLLM on AMD GPUs (MI300X/MI325X/MI355X). Adapts SGLang's @debug_kernel_api kernel boundary logging to ROCm: captures input tensors before crash, tracks shapes/dtypes/values, dumps crash artifacts for offline analysis. Integrates with amdpilot executor failure_reason field and dashboard trajectory viewer. Triggered by: CUDA/HIP errors, illegal memory access, device-side assert, OOM kills, signal 137/139, NaN/Inf in outputs, "debug crash", "why did the trial fail".