rocprofv3-profiler/SKILL.md
Profile AMD GPU kernels using rocprofv3 and analyze performance bottlenecks. Use when the user wants to profile HIP/ROCm kernels, identify GPU performance issues, analyze hardware counters, or understand why a kernel is slow on AMD GPUs (MI100, MI200, MI300 series). Provides wrapper scripts for rocprofv3 execution and automated parsing of profiler output into structured, agent-friendly JSON with bottleneck classification.
npx skillsauth add amdpilot-org/amd-skills rocprofv3-profilerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Profile AMD GPU applications and identify performance bottlenecks using rocprofv3.
python3 scripts/rocprof_wrapper.py --mode counters -- ./your_app [args]
Modes:
counters (default): Collect key performance counters for bottleneck analysistrace: Collect kernel execution traces (timing only)full: Collect both counters and tracesOptions:
--output-dir <dir>: Output directory (default: ./rocprof_output)--counters <file>: Custom counter input file (optional)--kernel <name>: Target specific kernel by namepython3 scripts/parse_profile.py <output_dir>
Returns structured JSON with:
# Profile application
python3 scripts/rocprof_wrapper.py --mode counters -- ./matrix_multiply 1024
# Parse and analyze
python3 scripts/parse_profile.py ./rocprof_output
Sample output:
{
"kernels": [{
"name": "matmul_kernel",
"metrics": {
"duration_ns": 145230,
"occupancy_pct": 45.2,
"valu_busy_pct": 78.5,
"lds_bank_conflict_rate": 0.12,
"l2_hit_rate": 0.65
},
"bottleneck": {
"type": "memory_bound",
"confidence": "high",
"detail": "Low L2 hit rate (65%) with high memory stall cycles"
}
}],
"raw_data_path": "./rocprof_output/pmc_1/counter_collection.csv"
}
The parser classifies kernels into these categories:
| Bottleneck | Indicators | |------------|------------| | compute_bound | High VALU/MFMA busy, low memory stalls | | memory_bound | High memory latency, low cache hit rates | | lds_bound | High LDS bank conflicts or LDS instruction stalls | | latency_bound | Low occupancy with high instruction latency | | balanced | No single dominant bottleneck |
For advanced use cases, invoke rocprofv3 directly:
# List available counters
rocprofv3 -L
# Trace kernel execution
rocprofv3 --kernel-trace --stats -- ./app
# Collect specific counters
rocprofv3 -i counters.txt -- ./app
Counter input file format (counters.txt):
pmc: SQ_WAVES SQ_INSTS_VALU SQ_INSTS_VMEM
pmc: TCC_HIT TCC_MISS
"rocprofv3 not found": Ensure ROCm is installed and /opt/rocm/bin is in PATH.
"No GPU detected": Check rocm-smi output and HSA_VISIBLE_DEVICES environment variable.
Multi-pass collection: If too many counters requested, rocprofv3 replays the kernel. Use fewer counters per pmc line.
development
FlyDSL is a Python DSL with MLIR-native backend for authoring custom AMD GPU kernels with explicit layout algebra (pre-installed at /opt/FlyDSL on images tagged *-flydsl:*). Use this skill when profiling identifies a hot per-row reduction (RMSNorm / LayerNorm / softmax), a fused elementwise chain (norm + residual add, activation + multiplier), or an unusual-shape grouped GEMM that the standard AMD backends (Triton / aiter / CK / hipBLASLt / TransformerEngine) don't serve well. Essential for any workload where Python/config/Triton-tuning gains have plateaued and the profile shows a custom kernel opportunity. Covers the `/opt/FlyDSL` availability check, the integration playbook (dispatcher + direct site-packages edit + autograd-safe output handling), kernel authoring patterns (elementwise via layout API, block reductions via wave_reduce_add, fused dx+dw designs, MFMA GEMM preshuffle), torchrun gotchas, and the critical rule that custom kernels typically only win end-to-end when stacked with `torch.compile(mode="default")`.
tools
Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.
testing
Analyze SGLang and vLLM profiler traces on AMD ROCm systems, especially MI355X/gfx950 nodes. Adapted from the SGLang torch-profiler workflow: triage kernel breakdown, overlap headroom, and fuse opportunities, then write structured artifacts that can be attached to amdpilot experiments, trials, and dashboard views. Use when a run needs profiling, when an optimization trial should produce machine-readable profiling artifacts, or when the user asks why a ROCm workload is slow.
development
Debug ROCm/HIP kernel crashes in SGLang and vLLM on AMD GPUs (MI300X/MI325X/MI355X). Adapts SGLang's @debug_kernel_api kernel boundary logging to ROCm: captures input tensors before crash, tracks shapes/dtypes/values, dumps crash artifacts for offline analysis. Integrates with amdpilot executor failure_reason field and dashboard trajectory viewer. Triggered by: CUDA/HIP errors, illegal memory access, device-side assert, OOM kills, signal 137/139, NaN/Inf in outputs, "debug crash", "why did the trial fail".