Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

AMD-AGI/docker-artifact-check

Name: docker-artifact-check
Author: AMD-AGI

skills/docker-artifact-check/SKILL.md

npx skillsauth add AMD-AGI/maxtext-slurm docker-artifact-check

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

AMD Training Docker Artifact Check

Inventory all key software in an AMD ROCm training container: versions, git hashes, branches, source code presence, and upstream repos.

Components to Check

| Component | Pip Package(s) | Typical Source Path | Upstream Repo | |---|---|---|---| | JAX | jax | /opt/jax/ | jax-ml/jax | | jaxlib (contains XLA) | jaxlib | (built from /opt/xla/) | ROCm/xla | | ROCm-JAX plugin | jax-rocm7-plugin, jax-rocm7-pjrt | /opt/rocm-jax/ | ROCm/rocm-jax | | ROCm libraries | system debs | /workspace/rocm-libraries/ | ROCm/rocm-libraries | | ROCm systems | system debs | /workspace/rocm-systems/ | ROCm/rocm-systems | | MaxText | maxtext | /workspace/maxtext/ | ROCm/maxtext | | RCCL | system deb + custom build | /workspace/rccl/ | ROCm/rccl | | AMD-ANP | N/A | /workspace/amd-anp/ | ROCm/amd-anp | | maxtext-slurm | N/A | /maxtext-slurm/ | AMD-AGI/maxtext-slurm |

Step-by-Step Workflow

Step 0: Check for build manifest

If the container has a build-time manifest, read it and skip to the Output Template — no probing needed.

if [[ -f /etc/build-manifest.json ]]; then
  jq . /etc/build-manifest.json
  # Done. Use the manifest to fill the Output Template directly.
  # Only continue with Steps 1-8 if the manifest is missing or incomplete.
fi

Step 0.5: Execution context

You must be running commands inside the target container. Common ways to get a shell:

# Running Slurm job — exec into the job's container on a compute node
srun --overlap --jobid=<JOBID> --pty bash

# Standalone container from an image
docker run --rm -it <IMAGE> bash

# Existing container
docker exec -it <CONTAINER_ID> bash

If you are an AI agent inside a container (e.g., via .host-cmd), the commands below run directly. If you are on the host, enter the container first.

Step 1: Python packages — versions and git hashes

# JAX + jaxlib versions
python3 -c "import jax; print(jax.__version__, jax.__file__)"
python3 -c "import jaxlib; print(jaxlib.__version__, jaxlib.__file__)"

# All relevant pip packages
pip list 2>/dev/null | grep -iE "jax|rocm|xla|maxtext|flax|optax|transformer.engine|xprof"

# Detailed pip metadata
pip show jax jaxlib jax-rocm7-plugin jax-rocm7-pjrt maxtext transformer-engine xprof 2>/dev/null

Step 2: Embedded git hashes from version files

JAX and jaxlib embed _git_hash in their version.py. Use Python to resolve paths dynamically (avoids hardcoding the Python version):

# JAX git hash
python3 -c "from jax.version import _git_hash; print('jax _git_hash:', _git_hash)"

# jaxlib git hash
python3 -c "from jaxlib.version import _git_hash; print('jaxlib _git_hash:', _git_hash)"

The ROCm plugin records build-time commits from three repos in an auto-generated file:

python3 -c "from jax_rocm7_plugin.commit_info import commit_info; import json; print(json.dumps(commit_info, indent=2))"
# Returns dict with keys: "ROCm/xla", "ROCm/rocm-jax", "jax"

# Fallback if the above fails (alternative location):
python3 -c "from jax_plugins.xla_rocm7.commit_info import commit_info; import json; print(json.dumps(commit_info, indent=2))"

Note: The plugin commit_info.py "jax" hash may differ from the installed JAX wheel's _git_hash. The plugin hash is the jax commit used at plugin build time; the wheel hash is the jax release commit.

Step 3: Git repos present in container

Scan for .git directories at known source paths:

for d in /opt/jax /opt/xla /opt/rocm-jax /workspace/maxtext /workspace/rccl \
         /workspace/amd-anp /workspace/rocm-libraries /workspace/rocm-systems \
         /maxtext-slurm; do
  if [ -d "$d/.git" ]; then
    echo "=== $d ==="
    git -C "$d" log --oneline -1
    git -C "$d" rev-parse HEAD
    git -C "$d" symbolic-ref --short HEAD 2>/dev/null || echo "(detached)"
    git -C "$d" describe --tags --always 2>/dev/null
    git -C "$d" remote -v | head -2
  else
    echo "=== $d === NOT PRESENT"
  fi
done

Step 4: ROCm system stack

# ROCm version
cat /opt/rocm*/.info/version 2>/dev/null

# HIP version
/opt/rocm/bin/hipcc --version 2>&1 | head -3

# rocm-smi
/opt/rocm/bin/rocm-smi --version 2>&1

# rocprofiler-sdk (v3)
/opt/rocm/bin/rocprofv3 --version 2>&1 | head -5

# rocprofiler v2 (legacy)
/opt/rocm/bin/rocprof --version 2>&1 | head -5

Step 5: ROCm library and system packages (debs)

# ROCm math/DNN libraries (from rocm-libraries monorepo)
dpkg -l 2>/dev/null | grep -iE "rocblas|rocfft|rocsolver|rocsparse|rocrand|rocprim|rocthrust|hipblas|hipfft|hipsolver|hipsparse|hipsparselt|miopen|comgr"

# ROCm system packages (from rocm-systems monorepo)
dpkg -l 2>/dev/null | grep -iE "rocprof|roctracer|rccl|hip-runtime|hsa-rocr|amd-smi|hipcc"

Step 6: Additional infrastructure

# OpenMPI
mpirun --version 2>/dev/null

# UCX
ls /workspace/ucx-*/

# Python
python3 --version

# Venv location
echo $VIRTUAL_ENV; ls /opt/venv/ 2>/dev/null

Step 7: Runtime-critical environment variables

These env vars materially change which library loads and how the GPU stack behaves. Two containers with identical packages but different env vars can perform very differently.

# Library resolution order (determines which librccl.so, libhipblaslt.so, etc. wins)
echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH"

# XLA compiler flags
echo "XLA_FLAGS=$XLA_FLAGS"

# RCCL / NCCL tuning
env | grep -iE "^NCCL_|^RCCL_" | sort

# ROCm / HIP / HSA flags
env | grep -iE "^ROCM_|^HIP_|^HSA_|^GPU_MAX_HW_QUEUES" | sort

# Transformer Engine flags
env | grep -iE "^NVTE_" | sort

# JAX memory and client config
env | grep -iE "^XLA_PYTHON_CLIENT|^JAX_" | sort

Step 8: Custom-built libraries

Check for libraries built from source alongside system installs:

# Custom RCCL build (vs system /opt/rocm/lib/librccl.so)
find /workspace/rccl -name "librccl*.so*" -type f 2>/dev/null

# Custom hipBLASLt (check if version differs from standard ROCm)
dpkg -l | grep hipblaslt

# AMD-ANP plugin
ls /opt/rocm/lib/librccl-anp.so 2>/dev/null

Output Template

Present results in this format:

## Container Environment Summary

**Docker Image**: [image name from container_env.sh or user]
**Base OS**: [distro + version]
**Python**: [version] at [venv path]
**ROCm**: [version] at /opt/rocm-X.Y.Z/

### Python Packages

| Package | Version | Git Hash | Source in Container? | Path |
|---|---|---|---|---|

### ROCm System Packages

| Package | Version (dpkg) | Notes |
|---|---|---|

### Git Repos in Container

| Path | Repo | Git Hash | Branch/Tag |
|---|---|---|---|

### Runtime-Critical Environment Variables

| Variable | Value |
|---|---|

### Source Paths NOT Present (need cloning)

| Expected Path | Repo URL | Known Hash |
|---|---|---|

Monorepo Mapping

When the user needs source for a specific ROCm library, know which monorepo contains it:

ROCm/rocm-libraries (/workspace/rocm-libraries/): projects/: rocblas, rocfft, rocsolver, rocsparse, rocrand, rocprim, rocthrust, hipblas, hipblaslt, hipcub, hipfft, hiprand, hipsolver, hipsparse, hipsparselt, hiptensor, miopen, composablekernel, rocwmma shared/: tensile, rocroller, origami, mxdatagenerator

ROCm/rocm-systems (/workspace/rocm-systems/): projects/: rocprofiler, rocprofiler-sdk, rocprofiler-register, rocprofiler-compute, rocprofiler-systems, roctracer, rccl, clr, hip, hipother, hip-tests, rocr-runtime (rocrruntime), rocminfo, rocm-core, rocmsmilib, amdsmi, aqlprofile, rdc, rocshmem

Notes

The jax-rocm7-* package names encode the ROCm major version (7). Future containers with ROCm 8 would use jax-rocm8-*.
jaxlib version containing +selfbuilt means it was compiled from source but the source tree was not retained.
The venv path may vary; check $VIRTUAL_ENV or look under /opt/venv/.
site-packages path depends on Python version (e.g., python3.12). Adjust grep paths accordingly.
hipBLASLt is often custom-built (different version hash from standard ROCm release).

AMD-AGI/docker-artifact-check

skills/docker-artifact-check/SKILL.md

Audit AMD ROCm training Docker containers for installed software versions, git hashes, branches, source code, and repo links. Use when the user asks to analyze a container environment, check software versions, find git hashes, or inventory installed AMD/ROCm/JAX/MaxText artifacts.

27 stars

development

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add AMD-AGI/maxtext-slurm docker-artifact-check

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 23, 2026, 6:59 PM37.2s1 file scanned

SKILL.md

name:: docker-artifact-check
description:: Audit AMD ROCm training Docker containers for installed software versions, git hashes, branches, source code, and repo links. Use when the user asks to analyze a container environment, check software versions, find git hashes, or inventory installed AMD/ROCm/JAX/MaxText artifacts.

AMD Training Docker Artifact Check

Inventory all key software in an AMD ROCm training container: versions, git hashes, branches, source code presence, and upstream repos.

Components to Check

Step-by-Step Workflow

Step 0: Check for build manifest

If the container has a build-time manifest, read it and skip to the Output Template — no probing needed.

if [[ -f /etc/build-manifest.json ]]; then
  jq . /etc/build-manifest.json
  # Done. Use the manifest to fill the Output Template directly.
  # Only continue with Steps 1-8 if the manifest is missing or incomplete.
fi

Step 0.5: Execution context

You must be running commands inside the target container. Common ways to get a shell:

# Running Slurm job — exec into the job's container on a compute node
srun --overlap --jobid=<JOBID> --pty bash

# Standalone container from an image
docker run --rm -it <IMAGE> bash

# Existing container
docker exec -it <CONTAINER_ID> bash

If you are an AI agent inside a container (e.g., via .host-cmd), the commands below run directly. If you are on the host, enter the container first.

Step 1: Python packages — versions and git hashes

# JAX + jaxlib versions
python3 -c "import jax; print(jax.__version__, jax.__file__)"
python3 -c "import jaxlib; print(jaxlib.__version__, jaxlib.__file__)"

# All relevant pip packages
pip list 2>/dev/null | grep -iE "jax|rocm|xla|maxtext|flax|optax|transformer.engine|xprof"

# Detailed pip metadata
pip show jax jaxlib jax-rocm7-plugin jax-rocm7-pjrt maxtext transformer-engine xprof 2>/dev/null

Step 2: Embedded git hashes from version files

JAX and jaxlib embed _git_hash in their version.py. Use Python to resolve paths dynamically (avoids hardcoding the Python version):

# JAX git hash
python3 -c "from jax.version import _git_hash; print('jax _git_hash:', _git_hash)"

# jaxlib git hash
python3 -c "from jaxlib.version import _git_hash; print('jaxlib _git_hash:', _git_hash)"

The ROCm plugin records build-time commits from three repos in an auto-generated file:

python3 -c "from jax_rocm7_plugin.commit_info import commit_info; import json; print(json.dumps(commit_info, indent=2))"
# Returns dict with keys: "ROCm/xla", "ROCm/rocm-jax", "jax"

# Fallback if the above fails (alternative location):
python3 -c "from jax_plugins.xla_rocm7.commit_info import commit_info; import json; print(json.dumps(commit_info, indent=2))"

Step 3: Git repos present in container

Scan for .git directories at known source paths:

for d in /opt/jax /opt/xla /opt/rocm-jax /workspace/maxtext /workspace/rccl \
         /workspace/amd-anp /workspace/rocm-libraries /workspace/rocm-systems \
         /maxtext-slurm; do
  if [ -d "$d/.git" ]; then
    echo "=== $d ==="
    git -C "$d" log --oneline -1
    git -C "$d" rev-parse HEAD
    git -C "$d" symbolic-ref --short HEAD 2>/dev/null || echo "(detached)"
    git -C "$d" describe --tags --always 2>/dev/null
    git -C "$d" remote -v | head -2
  else
    echo "=== $d === NOT PRESENT"
  fi
done

Step 4: ROCm system stack

# ROCm version
cat /opt/rocm*/.info/version 2>/dev/null

# HIP version
/opt/rocm/bin/hipcc --version 2>&1 | head -3

# rocm-smi
/opt/rocm/bin/rocm-smi --version 2>&1

# rocprofiler-sdk (v3)
/opt/rocm/bin/rocprofv3 --version 2>&1 | head -5

# rocprofiler v2 (legacy)
/opt/rocm/bin/rocprof --version 2>&1 | head -5

Step 5: ROCm library and system packages (debs)

# ROCm math/DNN libraries (from rocm-libraries monorepo)
dpkg -l 2>/dev/null | grep -iE "rocblas|rocfft|rocsolver|rocsparse|rocrand|rocprim|rocthrust|hipblas|hipfft|hipsolver|hipsparse|hipsparselt|miopen|comgr"

# ROCm system packages (from rocm-systems monorepo)
dpkg -l 2>/dev/null | grep -iE "rocprof|roctracer|rccl|hip-runtime|hsa-rocr|amd-smi|hipcc"

Step 6: Additional infrastructure

# OpenMPI
mpirun --version 2>/dev/null

# UCX
ls /workspace/ucx-*/

# Python
python3 --version

# Venv location
echo $VIRTUAL_ENV; ls /opt/venv/ 2>/dev/null

Step 7: Runtime-critical environment variables

These env vars materially change which library loads and how the GPU stack behaves. Two containers with identical packages but different env vars can perform very differently.

# Library resolution order (determines which librccl.so, libhipblaslt.so, etc. wins)
echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH"

# XLA compiler flags
echo "XLA_FLAGS=$XLA_FLAGS"

# RCCL / NCCL tuning
env | grep -iE "^NCCL_|^RCCL_" | sort

# ROCm / HIP / HSA flags
env | grep -iE "^ROCM_|^HIP_|^HSA_|^GPU_MAX_HW_QUEUES" | sort

# Transformer Engine flags
env | grep -iE "^NVTE_" | sort

# JAX memory and client config
env | grep -iE "^XLA_PYTHON_CLIENT|^JAX_" | sort

Step 8: Custom-built libraries

Check for libraries built from source alongside system installs:

# Custom RCCL build (vs system /opt/rocm/lib/librccl.so)
find /workspace/rccl -name "librccl*.so*" -type f 2>/dev/null

# Custom hipBLASLt (check if version differs from standard ROCm)
dpkg -l | grep hipblaslt

# AMD-ANP plugin
ls /opt/rocm/lib/librccl-anp.so 2>/dev/null

Output Template

Present results in this format:

## Container Environment Summary

**Docker Image**: [image name from container_env.sh or user]
**Base OS**: [distro + version]
**Python**: [version] at [venv path]
**ROCm**: [version] at /opt/rocm-X.Y.Z/

### Python Packages

| Package | Version | Git Hash | Source in Container? | Path |
|---|---|---|---|---|

### ROCm System Packages

| Package | Version (dpkg) | Notes |
|---|---|---|

### Git Repos in Container

| Path | Repo | Git Hash | Branch/Tag |
|---|---|---|---|

### Runtime-Critical Environment Variables

| Variable | Value |
|---|---|

### Source Paths NOT Present (need cloning)

| Expected Path | Repo URL | Known Hash |
|---|---|---|

Monorepo Mapping

When the user needs source for a specific ROCm library, know which monorepo contains it:

Notes

The jax-rocm7-* package names encode the ROCm major version (7). Future containers with ROCm 8 would use jax-rocm8-*.
jaxlib version containing +selfbuilt means it was compiled from source but the source tree was not retained.
The venv path may vary; check $VIRTUAL_ENV or look under /opt/venv/.
site-packages path depends on Python version (e.g., python3.12). Adjust grep paths accordingly.
hipBLASLt is often custom-built (different version hash from standard ROCm release).

Related Skills

AMD-AGI/pre-commit-audit

tools

VerifiedTrustedCommunity

Comprehensive pre-commit verification checklist with five independent responsibilities. (1) Launcher path coverage - verify a change to any launcher-chain file preserves correct behavior across all 16 combinations of entry point × launch mode × stack (Steps 1-4 + 5.1). (2) Ancillary scripts smoke - syntax / help / read-only / caller checks for any `.sh` or `.py` outside the launcher chain (Step 5.2; covers analysis utilities, sourced libraries, debug helpers, sweep tooling). (3) Code quality and design review (Step 6) - propose-first surface of code smells (duplication, long functions, magic numbers, deep nesting, unclear naming, primitive obsession, etc.) and design-decay signals (5th case in a switch, N-th env-var read, hand-rolled retry loops); auto-fix mechanical findings, hold design-shaped ones for explicit go-ahead. (4) Docs / comments / format-consistency (Step 7) - check any commit for stale prose, trailing-comment alignment drift, broken anchors / missing files in links, drifted cross-references, and this skill itself drifting from the code it describes. (5) Sensitive-info leak scan (Step 8) - cluster hostnames, internal IPs, vendor mount paths, hard-coded credentials, internal job IDs; final security gate. Trigger keywords - "verify all launcher paths", "trace launcher chain", "audit entry × launch × stack", "path coverage", "(entry × launch × stack) matrix", "post-launch teardown verification", "pre-commit audit", "before commit", "ready to commit", "verify scripts / utils not broken", "smoke-test the changed scripts", "any utility script broken", "code quality", "design review", "code smells", "tighten and polish", "avoid quality decay", "revisit design choice", "scrub leaks", "check for sensitive info before commit", "any docs or skills need update", "any stale comments", "any inaccurate comments", "comment alignment", "link policy", "broken anchors". Use when modifying `_train.sh`, `_train_with_ray.sh`, `_ray_actor.py`, `_container.sh`, `_job.sbatch`, `_k8s_job.sh`, `in_container_run.sh`, `run_local.sh`, `submit.sh`, `k8s_submit.sh`, `utils/run_setup.sh`, `utils/ray_cluster.sh`, `utils/monkey_patch_maxtext.py`, `utils/coredump.sh`, `utils/stage_timeout.sh`, or anywhere else in the launcher chain. Also use proactively before opening any PR (Steps 5.2, 6, 7, 8 apply universally to all changes that touch code / docs / comments), when investigating a path-specific bug ("this only happens in K8s + 1-gpu-per-process"), after adding a new entry point / launch mode / stack option, after touching any analysis utility (`utils/analyze_job.py`, `utils/perf_server.py`, `utils/profile_drill.py`, `utils/slurm_job_monitor.sh`, etc.), or after editing any doc or skill in the repo (Step 7 catches cross-reference drift).

27SKILL.mdUpdated May 10, 2026

AMD-AGI/pre-commit-audit

AMD-AGI/xla-tuning

testing

VerifiedTrustedCommunity

Find the XLA flag / NCCL env-var combination that maximizes steady-state TGS for one (model × parallelism) cell. Produces an evidence-backed leaderboard, mechanistic explanation of the winning flag, and a deployment recipe. Use when the user asks to tune XLA flags, tune NCCL, find best collective-permute / all-gather threshold, optimize FSDP/PP/TP, close a parallelism-vs-parallelism throughput gap, or sweep cross-iteration prefetch / overlap-limit / async-stream-priority knobs for a specific model.

27SKILL.mdUpdated May 3, 2026

AMD-AGI/tsdb-diagnosis

testing

VerifiedTrustedCommunity

Diagnose training job incidents and check cluster health using the per-job Prometheus TSDB. Use when the user asks to diagnose a failure root cause, check GPU/network health, query Prometheus metrics, investigate a hang, or when the triage skill recommends deeper TSDB analysis.

27SKILL.mdUpdated Apr 23, 2026

AMD-AGI/tsdb-diagnosis

AMD-AGI/telegram

testing

VerifiedTrustedCommunity

Use Telegram as the agent's I/O channel. Once triggered, the agent enters a REPL state — reading instructions from TG, executing them, printing results back to TG, and looping. Use when the user asks to be notified, messaged, or alerted via Telegram, or wants to interact with the agent through TG. This is a cross-cutting skill — other skills (batch-sweep, model-config, job-triage) can trigger it when the user explicitly requests it.

27SKILL.mdUpdated Apr 23, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/AMD-AGI/maxtext-slurm.git

# Copy into Claude Code skills folder (global)
cp -r maxtext-slurm/skills/docker-artifact-check ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

AMD-AGI/maxtext-slurm

27 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT