Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

amdpilot-org/rocprofv3-profiler

Name: rocprofv3-profiler
Author: amdpilot-org

rocprofv3-profiler/SKILL.md

npx skillsauth add amdpilot-org/amd-skills rocprofv3-profiler

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

rocprofv3-profiler

Profile AMD GPU applications and identify performance bottlenecks using rocprofv3.

Quick Start

1. Run Profiler

python3 scripts/rocprof_wrapper.py --mode counters -- ./your_app [args]

Modes:

counters (default): Collect key performance counters for bottleneck analysis
trace: Collect kernel execution traces (timing only)
full: Collect both counters and traces

Options:

--output-dir <dir>: Output directory (default: ./rocprof_output)
--counters <file>: Custom counter input file (optional)
--kernel <name>: Target specific kernel by name

2. Parse Results

python3 scripts/parse_profile.py <output_dir>

Returns structured JSON with:

Per-kernel metrics summary
Bottleneck classification (compute/memory/lds/latency bound)
Optimization hints
Path to raw data

Example Workflow

# Profile application
python3 scripts/rocprof_wrapper.py --mode counters -- ./matrix_multiply 1024

# Parse and analyze
python3 scripts/parse_profile.py ./rocprof_output

Sample output:

{
  "kernels": [{
    "name": "matmul_kernel",
    "metrics": {
      "duration_ns": 145230,
      "occupancy_pct": 45.2,
      "valu_busy_pct": 78.5,
      "lds_bank_conflict_rate": 0.12,
      "l2_hit_rate": 0.65
    },
    "bottleneck": {
      "type": "memory_bound",
      "confidence": "high",
      "detail": "Low L2 hit rate (65%) with high memory stall cycles"
    }
  }],
  "raw_data_path": "./rocprof_output/pmc_1/counter_collection.csv"
}

Bottleneck Classification

The parser classifies kernels into these categories:

| Bottleneck | Indicators | |------------|------------| | compute_bound | High VALU/MFMA busy, low memory stalls | | memory_bound | High memory latency, low cache hit rates | | lds_bound | High LDS bank conflicts or LDS instruction stalls | | latency_bound | Low occupancy with high instruction latency | | balanced | No single dominant bottleneck |

Reference Documentation

hardware_counters.md: Key AMD GPU counters and their meaning
bottleneck_heuristics.md: Detailed bottleneck classification rules

Direct rocprofv3 Usage

For advanced use cases, invoke rocprofv3 directly:

# List available counters
rocprofv3 -L

# Trace kernel execution
rocprofv3 --kernel-trace --stats -- ./app

# Collect specific counters
rocprofv3 -i counters.txt -- ./app

Counter input file format (counters.txt):

pmc: SQ_WAVES SQ_INSTS_VALU SQ_INSTS_VMEM
pmc: TCC_HIT TCC_MISS

Troubleshooting

"rocprofv3 not found": Ensure ROCm is installed and /opt/rocm/bin is in PATH.

"No GPU detected": Check rocm-smi output and HSA_VISIBLE_DEVICES environment variable.

Multi-pass collection: If too many counters requested, rocprofv3 replays the kernel. Use fewer counters per pmc line.

amdpilot-org/rocprofv3-profiler

rocprofv3-profiler/SKILL.md

Profile AMD GPU kernels using rocprofv3 and analyze performance bottlenecks. Use when the user wants to profile HIP/ROCm kernels, identify GPU performance issues, analyze hardware counters, or understand why a kernel is slow on AMD GPUs (MI100, MI200, MI300 series). Provides wrapper scripts for rocprofv3 execution and automated parsing of profiler output into structured, agent-friendly JSON with bottleneck classification.

1 stars

data-ai

Updated Apr 25, 2026

$ install --global

skillsauth

npx skillsauth add amdpilot-org/amd-skills rocprofv3-profiler

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 25, 2026, 5:19 AM56.6s5 files scanned

SKILL.md

name:: rocprofv3-profiler
description:: Profile AMD GPU kernels using rocprofv3 and analyze performance bottlenecks. Use when the user wants to profile HIP/ROCm kernels, identify GPU performance issues, analyze hardware counters, or understand why a kernel is slow on AMD GPUs (MI100, MI200, MI300 series). Provides wrapper scripts for rocprofv3 execution and automated parsing of profiler output into structured, agent-friendly JSON with bottleneck classification.

rocprofv3-profiler

Profile AMD GPU applications and identify performance bottlenecks using rocprofv3.

Quick Start

1. Run Profiler

python3 scripts/rocprof_wrapper.py --mode counters -- ./your_app [args]

Modes:

counters (default): Collect key performance counters for bottleneck analysis
trace: Collect kernel execution traces (timing only)
full: Collect both counters and traces

Options:

--output-dir <dir>: Output directory (default: ./rocprof_output)
--counters <file>: Custom counter input file (optional)
--kernel <name>: Target specific kernel by name

2. Parse Results

python3 scripts/parse_profile.py <output_dir>

Returns structured JSON with:

Per-kernel metrics summary
Bottleneck classification (compute/memory/lds/latency bound)
Optimization hints
Path to raw data

Example Workflow

# Profile application
python3 scripts/rocprof_wrapper.py --mode counters -- ./matrix_multiply 1024

# Parse and analyze
python3 scripts/parse_profile.py ./rocprof_output

Sample output:

{
  "kernels": [{
    "name": "matmul_kernel",
    "metrics": {
      "duration_ns": 145230,
      "occupancy_pct": 45.2,
      "valu_busy_pct": 78.5,
      "lds_bank_conflict_rate": 0.12,
      "l2_hit_rate": 0.65
    },
    "bottleneck": {
      "type": "memory_bound",
      "confidence": "high",
      "detail": "Low L2 hit rate (65%) with high memory stall cycles"
    }
  }],
  "raw_data_path": "./rocprof_output/pmc_1/counter_collection.csv"
}

Bottleneck Classification

The parser classifies kernels into these categories:

Reference Documentation

hardware_counters.md: Key AMD GPU counters and their meaning
bottleneck_heuristics.md: Detailed bottleneck classification rules

Direct rocprofv3 Usage

For advanced use cases, invoke rocprofv3 directly:

# List available counters
rocprofv3 -L

# Trace kernel execution
rocprofv3 --kernel-trace --stats -- ./app

# Collect specific counters
rocprofv3 -i counters.txt -- ./app

Counter input file format (counters.txt):

pmc: SQ_WAVES SQ_INSTS_VALU SQ_INSTS_VMEM
pmc: TCC_HIT TCC_MISS

Troubleshooting

"rocprofv3 not found": Ensure ROCm is installed and /opt/rocm/bin is in PATH.

"No GPU detected": Check rocm-smi output and HSA_VISIBLE_DEVICES environment variable.

Multi-pass collection: If too many counters requested, rocprofv3 replays the kernel. Use fewer counters per pmc line.

Related Skills

amdpilot-org/flydsl-kernel-authoring

development

VerifiedTrustedCommunity

FlyDSL is a Python DSL with MLIR-native backend for authoring custom AMD GPU kernels with explicit layout algebra (pre-installed at /opt/FlyDSL on images tagged *-flydsl:*). Use this skill when profiling identifies a hot per-row reduction (RMSNorm / LayerNorm / softmax), a fused elementwise chain (norm + residual add, activation + multiplier), or an unusual-shape grouped GEMM that the standard AMD backends (Triton / aiter / CK / hipBLASLt / TransformerEngine) don't serve well. Essential for any workload where Python/config/Triton-tuning gains have plateaued and the profile shows a custom kernel opportunity. Covers the `/opt/FlyDSL` availability check, the integration playbook (dispatcher + direct site-packages edit + autograd-safe output handling), kernel authoring patterns (elementwise via layout API, block reductions via wave_reduce_add, fused dx+dw designs, MFMA GEMM preshuffle), torchrun gotchas, and the critical rule that custom kernels typically only win end-to-end when stacked with `torch.compile(mode="default")`.

2SKILL.mdUpdated Apr 25, 2026

amdpilot-org/flydsl-kernel-authoring

amdpilot-org/skill-creator

tools

VerifiedTrustedCommunity

Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.

1SKILL.mdUpdated Apr 25, 2026

amdpilot-org/skill-creator

amdpilot-org/rocm-profiler-analysis

testing

VerifiedTrustedCommunity

Analyze SGLang and vLLM profiler traces on AMD ROCm systems, especially MI355X/gfx950 nodes. Adapted from the SGLang torch-profiler workflow: triage kernel breakdown, overlap headroom, and fuse opportunities, then write structured artifacts that can be attached to amdpilot experiments, trials, and dashboard views. Use when a run needs profiling, when an optimization trial should produce machine-readable profiling artifacts, or when the user asks why a ROCm workload is slow.

1SKILL.mdUpdated Apr 25, 2026

amdpilot-org/rocm-profiler-analysis

amdpilot-org/rocm-crash-debug

development

VerifiedTrustedCommunity

Debug ROCm/HIP kernel crashes in SGLang and vLLM on AMD GPUs (MI300X/MI325X/MI355X). Adapts SGLang's @debug_kernel_api kernel boundary logging to ROCm: captures input tensors before crash, tracks shapes/dtypes/values, dumps crash artifacts for offline analysis. Integrates with amdpilot executor failure_reason field and dashboard trajectory viewer. Triggered by: CUDA/HIP errors, illegal memory access, device-side assert, OOM kills, signal 137/139, NaN/Inf in outputs, "debug crash", "why did the trial fail".

1SKILL.mdUpdated Apr 25, 2026

amdpilot-org/rocm-crash-debug

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/amdpilot-org/amd-skills.git

# Copy into Claude Code skills folder (global)
cp -r amd-skills/rocprofv3-profiler ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

amdpilot-org/amd-skills

1 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT