Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

arist12/amd-kernel-optimization

Name: amd-kernel-optimization
Author: arist12

amd-kernel-optimization/SKILL.md

npx skillsauth add arist12/amd-skills amd-kernel-optimization

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

AMD Kernel Optimization (ROCm)

3 Rules (read first)

torch.compile FIRST. torch.compile(mode="default") with correct inductor config gives 2-5x speedup. Get this working before any manual optimization. Any code change that breaks compile is a net regression.
Profile before optimizing. Never guess where time is spent. Run torch.profiler, classify GPU time into GEMM / attention / elementwise / launch overhead, then optimize the largest category.
Measure after every change. Benchmark with proper warmup and iterations (see below). Revert if performance regresses.

Benchmarking (do this correctly or waste hours)

NEVER reduce warmup/iterations to "save time" — you get garbage numbers.

Minimum: 3 warmup runs, 10 measurement iterations. Use the benchmark script's defaults.

Use GPU timing, not wall-clock:

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record(); result = model(input); end.record()
torch.cuda.synchronize()
ms = start.elapsed_time(end)

Report mean AND std. If std > 10% of mean, something is wrong (graph breaks, recompilation).
First-run penalty is NORMAL on AMD — torch.compile takes 2-15 min on first run. Set timeout ≥ 600s. Do NOT conclude "compile doesn't work" or kill jobs under 15 min.
Never report first-run latency as baseline. Always use mean of post-warmup iterations.

Optimization Ladder

Each level builds on the previous. Do NOT skip to Level 3+ without Level 2.

Level 1: Environment (no code changes)

Set env vars: GPU_MAX_HW_QUEUES=2, HIP_FORCE_DEV_KERNARG=1, HSA_NO_SCRATCH_RECLAIM=1, AMD_LOG_LEVEL=0
Disable NUMA balancing: sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
Enable GEMM tuning: PYTORCH_TUNABLEOP_ENABLED=1, TORCH_BLAS_PREFER_HIPBLASLT=1
Set torch.set_float32_matmul_precision('high')
Audit env vars: env | grep -iE 'TORCH|INDUCTOR|AUTOTUNE' — unset TORCHINDUCTOR_MAX_AUTOTUNE if present

Level 2: torch.compile (MANDATORY — do this first)

Apply inductor config, compile with mode="default". Details: references/torch-compile-and-graphs.md
Fix ALL graph breaks before Level 3: TORCH_LOGS="graph_breaks" python3 ...
If kernel launch overhead is high after compile, try manual CUDAGraph capture (see reference)
This alone typically gives 2-5x speedup

Level 3: Model surgery (must preserve compile compatibility)

Profile first → read references/benchmarking-and-profiling.md
Fuse QKV projections (3 GEMMs → 1), Gate+Up projections (2 GEMMs → 1) → references/gemm-and-linear.md
Replace manual attention with SDPA or aiter flash attention → references/gemm-and-linear.md
Look for compute-reduction: skip masked/padding inputs, avoid repeat_kv, cache unchanged outputs
Test: TORCH_LOGS="graph_breaks" — verify no new breaks after each change

Level 4: Kernel & inductor tuning

Write Triton kernels for elementwise fusions (RMSNorm, SiLU+Mul, Add+RMSNorm) → references/triton-on-rocm.md
Route fused GEMMs through aiter tuned GEMM with M-threshold gating → references/gemm-and-linear.md
Inductor tuning flags: coordinate_descent_tuning, benchmark_kernel, freezing (increase compile time, improve steady-state)

Level 5: Architecture-specific kernels

aiter kernels for attention/GEMM — use torch.ops.aiter.* (compile-safe) not Python wrappers
Weight preshuffling for asm paths (benchmark: may help or hurt per shape)
If custom kernel breaks compile, wrap with @torch.compiler.disable as last resort

AMD-Specific Alternatives Quick Reference

GEMM / Linear

| Option | Notes | |---|---| | rocBLAS (default) | Vendor BLAS; generally well-tuned | | hipBLASLt | Fused epilogues; may beat rocBLAS for some shapes | | aiter tuned GEMM | Auto-dispatches best kernel per (M,N,K) from tuned configs | | FP8 GEMM (MI300+) | gemm_a8w8 via aiter; gfx942=e4m3fnuz, gfx950=e4m3fn |

Attention

| Option | Notes | |---|---| | aiter flash attention | torch.ops.aiter.mha_fwd.default(...) — compile-friendly, GQA native | | SDPA | F.scaled_dot_product_attention(...) — good for KV-cache decode | | Manual bmm+softmax+bmm | Slowest; replace with SDPA |

Compilation & Graphs

| Option | Notes | |---|---| | torch.compile(mode="default") | Start here. Stable on ROCm with correct inductor config | | Manual CUDAGraph capture | Wrap full inference in one graph; needs Dynamo RNG patch | | reduce-overhead / max-autotune | Avoid on ROCm unless you have verified stability |

Common Pitfalls

Optimizing without profiling: Classify kernels by category (GEMM/attention/elementwise/other) and compute percentages. "Top 10 kernels" is not enough.
Skipping torch.compile: Manual fusion that saves 10% is worthless if you're missing the 3x from compile. Get compile working first.
Giving up on first failure: When a technique causes regression, diagnose and adjust (e.g., M-threshold gating for aiter GEMM), don't abandon it entirely.
Treating blockers as dead ends: "CUDAGraph doesn't support control flow" means "refactor the control flow." "Requires editing HuggingFace modeling file" means "go edit the modeling file" — that IS the work.
Not modifying inner model layers: The hottest code is in attention and MLP modules, often in third-party libraries. Locate them: python -c "import transformers; print(transformers.__file__)" and edit directly.
Testing only in isolation: Optimizations compose. A technique showing 0% alone may enable others. Build incrementally — each new technique applied ON TOP of all previous ones.
Reducing benchmark parameters: Setting WARMUP=0 ITERATIONS=1 gives meaningless numbers. Optimize the code, not the test.
aiter tuned GEMM with no tuned configs: The default config CSV ships empty. Without it, gemm_a16w16 silently falls back to plain F.linear — no error, no crash, no benefit. Diagnose with AITER_LOG_TUNED_CONFIG=1; if you see "using torch solution:0", generate configs via AITER_TUNE_GEMM=1 + aiter's GemmTuner, or fall back to PYTORCH_TUNABLEOP_ENABLED=1. See references/gemm-and-linear.md for the full workflow.

Reference Files

Read as needed for implementation details:

benchmarking-and-profiling.md — Proper measurement, GPU timing, profile interpretation, what to look for in traces
torch-compile-and-graphs.md — Inductor config, graph breaks and fixes, manual CUDAGraph capture, Dynamo RNG patch, env var audit
gemm-and-linear.md — GEMM backend APIs, aiter tuned GEMM, projection fusion, attention backends, weight preshuffling
triton-on-rocm.md — ROCm Triton gotchas, kernel templates for RMSNorm, SiLU+Mul, GELU+Mul, Add+RMSNorm

arist12/amd-kernel-optimization

amd-kernel-optimization/SKILL.md

Optimize inference latency and throughput of PyTorch models on AMD GPUs (MI250/MI300/MI350) with ROCm. Use when profiling and optimizing GEMM, attention, elementwise ops, torch.compile, CUDAGraphs, or Triton kernels on AMD hardware. Covers the full optimize cycle: benchmark → profile → analyze → implement → verify. Also covers benchmarking methodology and common pitfalls that waste time.

2 stars

development

Updated Apr 30, 2026

$ install --global

skillsauth

npx skillsauth add arist12/amd-skills amd-kernel-optimization

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 25, 2026, 5:14 AM52.6s5 files scanned

SKILL.md

name:: amd-kernel-optimization
description:: >
or Triton kernels on AMD hardware. Covers the full optimize cycle:: benchmark → profile → analyze →

AMD Kernel Optimization (ROCm)

3 Rules (read first)

torch.compile FIRST. torch.compile(mode="default") with correct inductor config gives 2-5x speedup. Get this working before any manual optimization. Any code change that breaks compile is a net regression.
Profile before optimizing. Never guess where time is spent. Run torch.profiler, classify GPU time into GEMM / attention / elementwise / launch overhead, then optimize the largest category.
Measure after every change. Benchmark with proper warmup and iterations (see below). Revert if performance regresses.

Benchmarking (do this correctly or waste hours)

NEVER reduce warmup/iterations to "save time" — you get garbage numbers.

Minimum: 3 warmup runs, 10 measurement iterations. Use the benchmark script's defaults.

Use GPU timing, not wall-clock:

start = torch.cuda.Event(enable_timing=True)
end = torch.cuda.Event(enable_timing=True)
start.record(); result = model(input); end.record()
torch.cuda.synchronize()
ms = start.elapsed_time(end)

Report mean AND std. If std > 10% of mean, something is wrong (graph breaks, recompilation).
First-run penalty is NORMAL on AMD — torch.compile takes 2-15 min on first run. Set timeout ≥ 600s. Do NOT conclude "compile doesn't work" or kill jobs under 15 min.
Never report first-run latency as baseline. Always use mean of post-warmup iterations.

Optimization Ladder

Each level builds on the previous. Do NOT skip to Level 3+ without Level 2.

Level 1: Environment (no code changes)

Set env vars: GPU_MAX_HW_QUEUES=2, HIP_FORCE_DEV_KERNARG=1, HSA_NO_SCRATCH_RECLAIM=1, AMD_LOG_LEVEL=0
Disable NUMA balancing: sudo sh -c 'echo 0 > /proc/sys/kernel/numa_balancing'
Enable GEMM tuning: PYTORCH_TUNABLEOP_ENABLED=1, TORCH_BLAS_PREFER_HIPBLASLT=1
Set torch.set_float32_matmul_precision('high')
Audit env vars: env | grep -iE 'TORCH|INDUCTOR|AUTOTUNE' — unset TORCHINDUCTOR_MAX_AUTOTUNE if present

Level 2: torch.compile (MANDATORY — do this first)

Apply inductor config, compile with mode="default". Details: references/torch-compile-and-graphs.md
Fix ALL graph breaks before Level 3: TORCH_LOGS="graph_breaks" python3 ...
If kernel launch overhead is high after compile, try manual CUDAGraph capture (see reference)
This alone typically gives 2-5x speedup

Level 3: Model surgery (must preserve compile compatibility)

Profile first → read references/benchmarking-and-profiling.md
Fuse QKV projections (3 GEMMs → 1), Gate+Up projections (2 GEMMs → 1) → references/gemm-and-linear.md
Replace manual attention with SDPA or aiter flash attention → references/gemm-and-linear.md
Look for compute-reduction: skip masked/padding inputs, avoid repeat_kv, cache unchanged outputs
Test: TORCH_LOGS="graph_breaks" — verify no new breaks after each change

Level 4: Kernel & inductor tuning

Write Triton kernels for elementwise fusions (RMSNorm, SiLU+Mul, Add+RMSNorm) → references/triton-on-rocm.md
Route fused GEMMs through aiter tuned GEMM with M-threshold gating → references/gemm-and-linear.md
Inductor tuning flags: coordinate_descent_tuning, benchmark_kernel, freezing (increase compile time, improve steady-state)

Level 5: Architecture-specific kernels

aiter kernels for attention/GEMM — use torch.ops.aiter.* (compile-safe) not Python wrappers
Weight preshuffling for asm paths (benchmark: may help or hurt per shape)
If custom kernel breaks compile, wrap with @torch.compiler.disable as last resort

AMD-Specific Alternatives Quick Reference

GEMM / Linear

Attention

Compilation & Graphs

Common Pitfalls

Optimizing without profiling: Classify kernels by category (GEMM/attention/elementwise/other) and compute percentages. "Top 10 kernels" is not enough.
Skipping torch.compile: Manual fusion that saves 10% is worthless if you're missing the 3x from compile. Get compile working first.
Giving up on first failure: When a technique causes regression, diagnose and adjust (e.g., M-threshold gating for aiter GEMM), don't abandon it entirely.
Treating blockers as dead ends: "CUDAGraph doesn't support control flow" means "refactor the control flow." "Requires editing HuggingFace modeling file" means "go edit the modeling file" — that IS the work.
Not modifying inner model layers: The hottest code is in attention and MLP modules, often in third-party libraries. Locate them: python -c "import transformers; print(transformers.__file__)" and edit directly.
Testing only in isolation: Optimizations compose. A technique showing 0% alone may enable others. Build incrementally — each new technique applied ON TOP of all previous ones.
Reducing benchmark parameters: Setting WARMUP=0 ITERATIONS=1 gives meaningless numbers. Optimize the code, not the test.
aiter tuned GEMM with no tuned configs: The default config CSV ships empty. Without it, gemm_a16w16 silently falls back to plain F.linear — no error, no crash, no benefit. Diagnose with AITER_LOG_TUNED_CONFIG=1; if you see "using torch solution:0", generate configs via AITER_TUNE_GEMM=1 + aiter's GemmTuner, or fall back to PYTORCH_TUNABLEOP_ENABLED=1. See references/gemm-and-linear.md for the full workflow.

Reference Files

Read as needed for implementation details:

benchmarking-and-profiling.md — Proper measurement, GPU timing, profile interpretation, what to look for in traces
torch-compile-and-graphs.md — Inductor config, graph breaks and fixes, manual CUDAGraph capture, Dynamo RNG patch, env var audit
gemm-and-linear.md — GEMM backend APIs, aiter tuned GEMM, projection fusion, attention backends, weight preshuffling
triton-on-rocm.md — ROCm Triton gotchas, kernel templates for RMSNorm, SiLU+Mul, GELU+Mul, Add+RMSNorm

Related Skills

arist12/skill-creator

tools

VerifiedTrustedCommunity

Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.

2SKILL.mdUpdated Apr 30, 2026

arist12/skill-creator

arist12/rocprofv3-profiler

data-ai

VerifiedTrustedCommunity

Profile AMD GPU kernels using rocprofv3 and analyze performance bottlenecks. Use when the user wants to profile HIP/ROCm kernels, identify GPU performance issues, analyze hardware counters, or understand why a kernel is slow on AMD GPUs (MI100, MI200, MI300 series). Provides wrapper scripts for rocprofv3 execution and automated parsing of profiler output into structured, agent-friendly JSON with bottleneck classification.

2SKILL.mdUpdated Apr 30, 2026

arist12/rocprofv3-profiler

arist12/rocm-profiler-analysis

testing

VerifiedTrustedCommunity

Analyze SGLang and vLLM profiler traces on AMD ROCm systems, especially MI355X/gfx950 nodes. Adapted from the SGLang torch-profiler workflow: triage kernel breakdown, overlap headroom, and fuse opportunities, then write structured artifacts that can be attached to amdpilot experiments, trials, and dashboard views. Use when a run needs profiling, when an optimization trial should produce machine-readable profiling artifacts, or when the user asks why a ROCm workload is slow.

2SKILL.mdUpdated Apr 30, 2026

arist12/rocm-profiler-analysis

arist12/rocm-crash-debug

development

VerifiedTrustedCommunity

Debug ROCm/HIP kernel crashes in SGLang and vLLM on AMD GPUs (MI300X/MI325X/MI355X). Adapts SGLang's @debug_kernel_api kernel boundary logging to ROCm: captures input tensors before crash, tracks shapes/dtypes/values, dumps crash artifacts for offline analysis. Integrates with amdpilot executor failure_reason field and dashboard trajectory viewer. Triggered by: CUDA/HIP errors, illegal memory access, device-side assert, OOM kills, signal 137/139, NaN/Inf in outputs, "debug crash", "why did the trial fail".

2SKILL.mdUpdated Apr 30, 2026

arist12/rocm-crash-debug

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/arist12/amd-skills.git

# Copy into Claude Code skills folder (global)
cp -r amd-skills/amd-kernel-optimization ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

arist12/amd-skills

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT