Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

arist12/env-probe

Name: env-probe
Author: arist12

env-probe/SKILL.md

npx skillsauth add arist12/amd-skills env-probe

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

AMD/ROCm Docker Environment Probe

Run this before writing any optimization code. AMD Docker images silently set framework defaults that differ from stock PyTorch. These hidden defaults cause stalls, crashes, and wrong results that are impossible to diagnose by looking at code alone.

Why This Exists

Problem: ROCm Docker images override PyTorch/Triton defaults at the system level. For example, max_autotune=True as a global default means torch.compile(mode="default") benchmarks every GEMM across ATEN+TRITON+CPP backends. With hundreds of matmuls in a compiled graph, autotuning never finishes — the process hangs indefinitely with no error message.

These defaults are invisible to pip list, rocm-smi, or any surface-level inspection. You have to introspect the framework config objects at runtime to see them.

How to Use

Step 1: Run the probe script

python /path/to/env_probe.py

Or if the skill is installed as a Claude Code command, copy the probe script from references/env_probe.py and run it inside your Docker container.

The probe script is self-contained — no dependencies beyond PyTorch (which your Docker already has).

Step 2: Read the output

The probe outputs a structured report with three severity levels:

| Level | Meaning | Action | |-------|---------|--------| | CRITICAL | Will cause hangs, crashes, or silent wrong results | Must fix before proceeding | | WARNING | Suboptimal default, will hurt performance | Fix before benchmarking | | INFO | Informational, no action needed | Document for reproducibility |

Step 3: Apply fixes

Each CRITICAL/WARNING item includes a recommended fix — either a Python config line or an environment variable to set. Apply these fixes at the top of your script, before any torch.compile() or torch.cuda.CUDAGraph() call.

What the Probe Checks

Category 1: Surface Facts (versions, hardware)

Python version, PyTorch version, Triton version
ROCm version, GPU architecture (gfx target)
AITER, Composable Kernel, flash-attn availability and versions
hipBLASLt availability

Category 2: Runtime Behavior Defaults (the hidden landmines)

torch._inductor.config.max_autotune — if True, causes indefinite stall with torch.compile
torch._inductor.config.max_autotune_gemm_backends — which backends inductor will benchmark
torch._inductor.config.triton.cudagraphs — unstable on ROCm
torch._inductor.config.triton.cudagraph_trees — unstable on ROCm
torch._inductor.config.memory_planning — causes deep recursion crash on ROCm
torch._dynamo.config.cache_size_limit — too small causes recompilation loops
torch.backends.cudnn.benchmark and allow_tf32 defaults

Category 3: Known Bug Markers

hipBLASLt solver discovery (HIPBLAS_STATUS_NOT_INITIALIZED)
FP8 flash attention availability
gfx950/gfx942 ASM GEMM kernel availability
AITER function signatures (argument combos that were broken in older versions)

Category 4: Environment Variables

HIP_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES
HSA_ENABLE_SDMA, HIP_FORCE_DEV_KERNARG
PYTORCH_TUNABLEOP_ENABLED, PYTORCH_TUNABLEOP_TUNING
TORCH_COMPILE_DEBUG, TORCHINDUCTOR_* overrides

Recommended Inductor Configuration for ROCm

When the probe flags inductor defaults as CRITICAL, apply this configuration block before any torch.compile() call:

import torch._inductor.config as inductor_config
import torch._dynamo.config as dynamo_config

# Prevent indefinite GEMM autotuning stall
inductor_config.max_autotune = False
inductor_config.max_autotune_gemm_backends = "ATEN"

# Disable unstable triton cudagraphs on ROCm
inductor_config.triton.cudagraphs = False
inductor_config.triton.cudagraph_trees = False

# Prevent deep recursion crash
inductor_config.memory_planning = False

# Prevent cache eviction / recompilation loops
dynamo_config.cache_size_limit = 128

See references/inductor-rocm-defaults.md for the full explanation of each setting and when you might want to override them.

Integration with Other Skills

amd-rocm-porting: Run env-probe as Phase 0.5 (after Phase 0 environment setup, before Phase 1 porting)
amd-inference-optimization: Run env-probe before Phase 0 profiling baseline
rocprofv3-profiler: Probe checks that rocprofv3 is available and functional

Adding New Checks

When you discover a new Docker-specific gotcha, add it to references/env_probe.py:

Add the check function
Add it to the appropriate category in run_all_checks()
Include the severity level (CRITICAL/WARNING/INFO) and recommended fix
Document the failure mode (what happens if the agent doesn't know about this)

This skill is meant to grow — every experiment that hits an environment issue should contribute a new check back to the probe.

arist12/env-probe

env-probe/SKILL.md

Inspect AMD/ROCm Docker runtime environment before writing any code. Use BEFORE torch.compile, CUDAGraph capture, or any kernel optimization. Detects hidden framework defaults (inductor max_autotune, triton.cudagraphs), known Docker-specific bugs (hipBLASLt solver crash, FP8 flash attn), and missing packages. Outputs CRITICAL/WARNING/INFO report with recommended fixes. Triggered by: starting work in an AMD Docker, "check environment", "why is torch.compile hanging", "env probe", Phase 0 of any AMD optimization experiment.

2 stars

development

Updated Apr 30, 2026

$ install --global

skillsauth

npx skillsauth add arist12/amd-skills env-probe

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 25, 2026, 5:16 AM54.5s3 files scanned

SKILL.md

name:: env-probe
description:: >
Triggered by:: starting work in an AMD Docker, "check environment", "why is torch.compile hanging",

AMD/ROCm Docker Environment Probe

Why This Exists

These defaults are invisible to pip list, rocm-smi, or any surface-level inspection. You have to introspect the framework config objects at runtime to see them.

How to Use

Step 1: Run the probe script

python /path/to/env_probe.py

Or if the skill is installed as a Claude Code command, copy the probe script from references/env_probe.py and run it inside your Docker container.

The probe script is self-contained — no dependencies beyond PyTorch (which your Docker already has).

Step 2: Read the output

The probe outputs a structured report with three severity levels:

Step 3: Apply fixes

What the Probe Checks

Category 1: Surface Facts (versions, hardware)

Python version, PyTorch version, Triton version
ROCm version, GPU architecture (gfx target)
AITER, Composable Kernel, flash-attn availability and versions
hipBLASLt availability

Category 2: Runtime Behavior Defaults (the hidden landmines)

torch._inductor.config.max_autotune — if True, causes indefinite stall with torch.compile
torch._inductor.config.max_autotune_gemm_backends — which backends inductor will benchmark
torch._inductor.config.triton.cudagraphs — unstable on ROCm
torch._inductor.config.triton.cudagraph_trees — unstable on ROCm
torch._inductor.config.memory_planning — causes deep recursion crash on ROCm
torch._dynamo.config.cache_size_limit — too small causes recompilation loops
torch.backends.cudnn.benchmark and allow_tf32 defaults

Category 3: Known Bug Markers

hipBLASLt solver discovery (HIPBLAS_STATUS_NOT_INITIALIZED)
FP8 flash attention availability
gfx950/gfx942 ASM GEMM kernel availability
AITER function signatures (argument combos that were broken in older versions)

Category 4: Environment Variables

HIP_VISIBLE_DEVICES, ROCR_VISIBLE_DEVICES
HSA_ENABLE_SDMA, HIP_FORCE_DEV_KERNARG
PYTORCH_TUNABLEOP_ENABLED, PYTORCH_TUNABLEOP_TUNING
TORCH_COMPILE_DEBUG, TORCHINDUCTOR_* overrides

Recommended Inductor Configuration for ROCm

When the probe flags inductor defaults as CRITICAL, apply this configuration block before any torch.compile() call:

import torch._inductor.config as inductor_config
import torch._dynamo.config as dynamo_config

# Prevent indefinite GEMM autotuning stall
inductor_config.max_autotune = False
inductor_config.max_autotune_gemm_backends = "ATEN"

# Disable unstable triton cudagraphs on ROCm
inductor_config.triton.cudagraphs = False
inductor_config.triton.cudagraph_trees = False

# Prevent deep recursion crash
inductor_config.memory_planning = False

# Prevent cache eviction / recompilation loops
dynamo_config.cache_size_limit = 128

See references/inductor-rocm-defaults.md for the full explanation of each setting and when you might want to override them.

Integration with Other Skills

amd-rocm-porting: Run env-probe as Phase 0.5 (after Phase 0 environment setup, before Phase 1 porting)
amd-inference-optimization: Run env-probe before Phase 0 profiling baseline
rocprofv3-profiler: Probe checks that rocprofv3 is available and functional

Adding New Checks

When you discover a new Docker-specific gotcha, add it to references/env_probe.py:

Add the check function
Add it to the appropriate category in run_all_checks()
Include the severity level (CRITICAL/WARNING/INFO) and recommended fix
Document the failure mode (what happens if the agent doesn't know about this)

This skill is meant to grow — every experiment that hits an environment issue should contribute a new check back to the probe.

Related Skills

arist12/skill-creator

tools

VerifiedTrustedCommunity

Guide for creating effective skills. This skill should be used when users want to create a new skill (or update an existing skill) that extends Claude's capabilities with specialized knowledge, workflows, or tool integrations.

2SKILL.mdUpdated Apr 30, 2026

arist12/skill-creator

arist12/rocprofv3-profiler

data-ai

VerifiedTrustedCommunity

Profile AMD GPU kernels using rocprofv3 and analyze performance bottlenecks. Use when the user wants to profile HIP/ROCm kernels, identify GPU performance issues, analyze hardware counters, or understand why a kernel is slow on AMD GPUs (MI100, MI200, MI300 series). Provides wrapper scripts for rocprofv3 execution and automated parsing of profiler output into structured, agent-friendly JSON with bottleneck classification.

2SKILL.mdUpdated Apr 30, 2026

arist12/rocprofv3-profiler

arist12/rocm-profiler-analysis

testing

VerifiedTrustedCommunity

Analyze SGLang and vLLM profiler traces on AMD ROCm systems, especially MI355X/gfx950 nodes. Adapted from the SGLang torch-profiler workflow: triage kernel breakdown, overlap headroom, and fuse opportunities, then write structured artifacts that can be attached to amdpilot experiments, trials, and dashboard views. Use when a run needs profiling, when an optimization trial should produce machine-readable profiling artifacts, or when the user asks why a ROCm workload is slow.

2SKILL.mdUpdated Apr 30, 2026

arist12/rocm-profiler-analysis

arist12/rocm-crash-debug

development

VerifiedTrustedCommunity

Debug ROCm/HIP kernel crashes in SGLang and vLLM on AMD GPUs (MI300X/MI325X/MI355X). Adapts SGLang's @debug_kernel_api kernel boundary logging to ROCm: captures input tensors before crash, tracks shapes/dtypes/values, dumps crash artifacts for offline analysis. Integrates with amdpilot executor failure_reason field and dashboard trajectory viewer. Triggered by: CUDA/HIP errors, illegal memory access, device-side assert, OOM kills, signal 137/139, NaN/Inf in outputs, "debug crash", "why did the trial fail".

2SKILL.mdUpdated Apr 30, 2026

arist12/rocm-crash-debug

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/arist12/amd-skills.git

# Copy into Claude Code skills folder (global)
cp -r amd-skills/env-probe ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

arist12/amd-skills

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT