Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

amdpilot-org/rocm-profiler-analysis

Name: rocm-profiler-analysis
Author: amdpilot-org

rocm-profiler-analysis/SKILL.md

npx skillsauth add amdpilot-org/amd-skills rocm-profiler-analysis

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

ROCm Profiler Analysis

Use this skill when you need to turn a profiling run into structured optimization evidence instead of a raw trace file.

This skill is the AMD/ROCm/MI355X adaptation of SGLang's torch-profiler analysis workflow. It is designed for our current amdpilot stack:

MI355X / gfx950 nodes
ROCm 7.2
SGLang / vLLM issue-driven runs
dashboard artifacts, not just terminal output

Why This Exists

Raw traces are not good enough for agents or dashboards. They tell you that time was spent somewhere, but they do not directly answer:

Which kernel families dominate prefill or decode on ROCm?
Which kernels still have overlap headroom?
Which hotspots map back to Python or operator-level code paths?
Which results are actually relevant to gfx950 / MI355X, and which are only generic?
Which profiling outputs should be written into our canonical experiment/trial schema?

This skill standardizes that path.

Main Workflow

Preserve the same four subcommands as the upstream SGLang profiler skill:

triage
breakdown
overlap
perfetto-fix

For normal agent use, default to triage.

`triage`

Use this when you want one compact answer with three main outputs:

kernel table
overlap-opportunity table
fuse-opportunity table

`breakdown`

Use this when you need one-trace category share analysis without overlap reasoning.

`overlap`

Use this when you have both:

a graph-off mapping trace
a graph-on formal trace

and need to tie overlap headroom back to code paths.

`perfetto-fix`

Use this only when Perfetto renders overlapped lanes incorrectly and you need a repaired trace for human inspection.

Recommended Inputs

This skill supports two input shapes:

Existing trace directory / trace file
- trace.json
- trace.json.gz
- profiler output directory
Live server / live experiment
- trigger profiling against a running SGLang or vLLM server
- then immediately analyze the result and attach artifacts back to the run

For amdpilot integration, prefer the second path for optimization-stage profiling and the first path for post-hoc investigation.

AMD / ROCm Adaptation Rules

1. Use ROCm-native kernel categories

Do not reuse CUDA/H100/B200 assumptions. On our nodes, category tables should explicitly account for ROCm-specific paths:

RCCL / communication
Triton kernels
CK / composable kernel paths
AITER paths
hipBLASLt / rocBLAS GEMM
MIOpen / attention runtime kernels
quantization
normalization
memory / copy / scheduler overhead

See references/rocm-kernel-categories.md.

2. Keep hardware relevance explicit

Every profiling result must declare whether it is truly relevant to our MI355X node:

observed_arch: actual arch from the run
arch_match: exact | compatible | unknown
hardware_relevance_reason: short human-readable explanation

Do not hide gfx942 vs gfx950 differences.

3. Treat profiling as structured artifacts

Do not stop at stdout tables. Write stable artifacts that can be attached to an experiment or trial. Minimum recommended outputs:

profile_summary.md
profile_metadata.json
kernel_table.json
overlap_opportunities.json
fuse_opportunities.json
perfetto_fixed_trace.json if used

See references/artifact-contract.md.

Canonical Metadata Contract

profile_metadata.json should contain enough information to tie profiling results back to the dashboard and DB.

Minimum fields:

experiment_id
trial_id
observed_arch
arch_match
hardware_relevance_reason
rocm_version
base_image
resource_class
gpu_device_ids
gpu_clocks_mhz
preflight_passed
server_flags
benchmark_config_hash
model_name
profile_stage
source_trace_path

This is the difference between "a useful local notebook" and "a reusable profiling artifact".

Dashboard / DB Integration

The intended downstream path is:

run profile
emit structured artifacts
attach artifacts to experiment / trial
surface the summary and tables in dashboard

This skill should feed:

experiment detail page profiling section
trial-level artifact list
trajectory context for optimization retries
future data-flywheel / SFT signals

See references/dashboard-integration.md.

MI355X / gfx950 Specific Guidance

On our node, prefer profiling plans that stay grounded in actual machine facts:

8x MI355X
gfx950
ROCm 7.2
explicit GPU ID allocation from the experiment
exact Docker image tag, not just "ROCm 7.2"

When you compare profiles across runs, never compare them without also checking:

base_image
resource_class
gpu_device_ids
server_flags
benchmark_config_hash

Otherwise the comparison is not trustworthy.

Suggested Rollout

Phase A: Analysis adaptation

Make the kernel classification and overlap heuristics ROCm-aware.

Phase B: Artifactization

Write the profiling outputs into stable JSON + Markdown artifacts.

Phase C: Live integration

Trigger profiling from real optimization stages and surface the artifacts in dashboard.

Relationship to Other AMD Skills

rocprofv3-profiler Use that skill when you need low-level AMD hardware counters or kernel-level bottleneck data. Use this skill when you need SGLang/vLLM trace triage tied back to Python/operator semantics.
env-probe Run env-probe before profiling if you suspect hidden runtime defaults are skewing results.
rocm-crash-debug Use crash-debug when the run is failing. Use this skill when the run is healthy enough to generate profiling evidence.

Reviewer Checklist

Before calling this skill "done", verify:

It is explicitly MI355X / gfx950 / ROCm-aware
It produces structured artifacts, not just console output
It carries experiment/trial linkage fields
It distinguishes arch_match programmatically
It can be attached to dashboard and DB without ad-hoc parsing

amdpilot-org/rocm-profiler-analysis

rocm-profiler-analysis/SKILL.md

Analyze SGLang and vLLM profiler traces on AMD ROCm systems, especially MI355X/gfx950 nodes. Adapted from the SGLang torch-profiler workflow: triage kernel breakdown, overlap headroom, and fuse opportunities, then write structured artifacts that can be attached to amdpilot experiments, trials, and dashboard views. Use when a run needs profiling, when an optimization trial should produce machine-readable profiling artifacts, or when the user asks why a ROCm workload is slow.

1 stars

testing

Updated Apr 25, 2026

$ install --global

skillsauth

npx skillsauth add amdpilot-org/amd-skills rocm-profiler-analysis

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 25, 2026, 5:18 AM63.8s4 files scanned

SKILL.md

name:: rocm-profiler-analysis
description:: >
Adapted from the SGLang torch-profiler workflow:: triage kernel breakdown, overlap headroom,

ROCm Profiler Analysis

Use this skill when you need to turn a profiling run into structured optimization evidence instead of a raw trace file.

This skill is the AMD/ROCm/MI355X adaptation of SGLang's torch-profiler analysis workflow. It is designed for our current amdpilot stack:

MI355X / gfx950 nodes
ROCm 7.2
SGLang / vLLM issue-driven runs
dashboard artifacts, not just terminal output

Why This Exists

Raw traces are not good enough for agents or dashboards. They tell you that time was spent somewhere, but they do not directly answer:

Which kernel families dominate prefill or decode on ROCm?
Which kernels still have overlap headroom?
Which hotspots map back to Python or operator-level code paths?
Which results are actually relevant to gfx950 / MI355X, and which are only generic?
Which profiling outputs should be written into our canonical experiment/trial schema?

This skill standardizes that path.

Main Workflow

Preserve the same four subcommands as the upstream SGLang profiler skill:

triage
breakdown
overlap
perfetto-fix

For normal agent use, default to triage.

`triage`

Use this when you want one compact answer with three main outputs:

kernel table
overlap-opportunity table
fuse-opportunity table

`breakdown`

Use this when you need one-trace category share analysis without overlap reasoning.

`overlap`

Use this when you have both:

a graph-off mapping trace
a graph-on formal trace

and need to tie overlap headroom back to code paths.

`perfetto-fix`

Use this only when Perfetto renders overlapped lanes incorrectly and you need a repaired trace for human inspection.

Recommended Inputs

This skill supports two input shapes:

Existing trace directory / trace file
- trace.json
- trace.json.gz
- profiler output directory
Live server / live experiment
- trigger profiling against a running SGLang or vLLM server
- then immediately analyze the result and attach artifacts back to the run

For amdpilot integration, prefer the second path for optimization-stage profiling and the first path for post-hoc investigation.

AMD / ROCm Adaptation Rules

1. Use ROCm-native kernel categories

Do not reuse CUDA/H100/B200 assumptions. On our nodes, category tables should explicitly account for ROCm-specific paths:

RCCL / communication
Triton kernels
CK / composable kernel paths
AITER paths
hipBLASLt / rocBLAS GEMM
MIOpen / attention runtime kernels
quantization
normalization
memory / copy / scheduler overhead

See references/rocm-kernel-categories.md.

2. Keep hardware relevance explicit

Every profiling result must declare whether it is truly relevant to our MI355X node:

observed_arch: actual arch from the run
arch_match: exact | compatible | unknown
hardware_relevance_reason: short human-readable explanation

Do not hide gfx942 vs gfx950 differences.

3. Treat profiling as structured artifacts

Do not stop at stdout tables. Write stable artifacts that can be attached to an experiment or trial. Minimum recommended outputs:

profile_summary.md
profile_metadata.json
kernel_table.json
overlap_opportunities.json
fuse_opportunities.json
perfetto_fixed_trace.json if used

See references/artifact-contract.md.

Canonical Metadata Contract

profile_metadata.json should contain enough information to tie profiling results back to the dashboard and DB.

Minimum fields:

experiment_id
trial_id
observed_arch
arch_match
hardware_relevance_reason
rocm_version
base_image
resource_class
gpu_device_ids
gpu_clocks_mhz
preflight_passed
server_flags
benchmark_config_hash
model_name
profile_stage
source_trace_path

This is the difference between "a useful local notebook" and "a reusable profiling artifact".

Dashboard / DB Integration

The intended downstream path is:

run profile
emit structured artifacts
attach artifacts to experiment / trial
surface the summary and tables in dashboard

This skill should feed:

experiment detail page profiling section
trial-level artifact list
trajectory context for optimization retries
future data-flywheel / SFT signals

See references/dashboard-integration.md.

MI355X / gfx950 Specific Guidance

On our node, prefer profiling plans that stay grounded in actual machine facts:

8x MI355X
gfx950
ROCm 7.2
explicit GPU ID allocation from the experiment
exact Docker image tag, not just "ROCm 7.2"

When you compare profiles across runs, never compare them without also checking:

base_image
resource_class
gpu_device_ids
server_flags
benchmark_config_hash

Otherwise the comparison is not trustworthy.

Suggested Rollout

Phase A: Analysis adaptation

Make the kernel classification and overlap heuristics ROCm-aware.

Phase B: Artifactization

Write the profiling outputs into stable JSON + Markdown artifacts.

Phase C: Live integration

Trigger profiling from real optimization stages and surface the artifacts in dashboard.

Relationship to Other AMD Skills

rocprofv3-profiler Use that skill when you need low-level AMD hardware counters or kernel-level bottleneck data. Use this skill when you need SGLang/vLLM trace triage tied back to Python/operator semantics.
env-probe Run env-probe before profiling if you suspect hidden runtime defaults are skewing results.
rocm-crash-debug Use crash-debug when the run is failing. Use this skill when the run is healthy enough to generate profiling evidence.

Reviewer Checklist

Before calling this skill "done", verify:

It is explicitly MI355X / gfx950 / ROCm-aware
It produces structured artifacts, not just console output
It carries experiment/trial linkage fields
It distinguishes arch_match programmatically
It can be attached to dashboard and DB without ad-hoc parsing

Related Skills

amdpilot-org/flydsl-kernel-authoring

development

VerifiedTrustedCommunity

FlyDSL is a Python DSL with MLIR-native backend for authoring custom AMD GPU kernels with explicit layout algebra (pre-installed at /opt/FlyDSL on images tagged *-flydsl:*). Use this skill when profiling identifies a hot per-row reduction (RMSNorm / LayerNorm / softmax), a fused elementwise chain (norm + residual add, activation + multiplier), or an unusual-shape grouped GEMM that the standard AMD backends (Triton / aiter / CK / hipBLASLt / TransformerEngine) don't serve well. Essential for any workload where Python/config/Triton-tuning gains have plateaued and the profile shows a custom kernel opportunity. Covers the `/opt/FlyDSL` availability check, the integration playbook (dispatcher + direct site-packages edit + autograd-safe output handling), kernel authoring patterns (elementwise via layout API, block reductions via wave_reduce_add, fused dx+dw designs, MFMA GEMM preshuffle), torchrun gotchas, and the critical rule that custom kernels typically only win end-to-end when stacked with `torch.compile(mode="default")`.

2SKILL.mdUpdated Apr 25, 2026

amdpilot-org/flydsl-kernel-authoring

amdpilot-org/skill-creator

tools

VerifiedTrustedCommunity

Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.

1SKILL.mdUpdated Apr 25, 2026

amdpilot-org/skill-creator

amdpilot-org/rocprofv3-profiler

data-ai

VerifiedTrustedCommunity

Profile AMD GPU kernels using rocprofv3 and analyze performance bottlenecks. Use when the user wants to profile HIP/ROCm kernels, identify GPU performance issues, analyze hardware counters, or understand why a kernel is slow on AMD GPUs (MI100, MI200, MI300 series). Provides wrapper scripts for rocprofv3 execution and automated parsing of profiler output into structured, agent-friendly JSON with bottleneck classification.

1SKILL.mdUpdated Apr 25, 2026

amdpilot-org/rocprofv3-profiler

amdpilot-org/rocm-crash-debug

development

VerifiedTrustedCommunity

Debug ROCm/HIP kernel crashes in SGLang and vLLM on AMD GPUs (MI300X/MI325X/MI355X). Adapts SGLang's @debug_kernel_api kernel boundary logging to ROCm: captures input tensors before crash, tracks shapes/dtypes/values, dumps crash artifacts for offline analysis. Integrates with amdpilot executor failure_reason field and dashboard trajectory viewer. Triggered by: CUDA/HIP errors, illegal memory access, device-side assert, OOM kills, signal 137/139, NaN/Inf in outputs, "debug crash", "why did the trial fail".

1SKILL.mdUpdated Apr 25, 2026

amdpilot-org/rocm-crash-debug

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/amdpilot-org/amd-skills.git

# Copy into Claude Code skills folder (global)
cp -r amd-skills/rocm-profiler-analysis ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

amdpilot-org/amd-skills

1 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT