skills/docker-artifact-check/SKILL.md
Audit AMD ROCm training Docker containers for installed software versions, git hashes, branches, source code, and repo links. Use when the user asks to analyze a container environment, check software versions, find git hashes, or inventory installed AMD/ROCm/JAX/MaxText artifacts.
npx skillsauth add AMD-AGI/maxtext-slurm docker-artifact-checkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Inventory all key software in an AMD ROCm training container: versions, git hashes, branches, source code presence, and upstream repos.
| Component | Pip Package(s) | Typical Source Path | Upstream Repo |
|---|---|---|---|
| JAX | jax | /opt/jax/ | jax-ml/jax |
| jaxlib (contains XLA) | jaxlib | (built from /opt/xla/) | ROCm/xla |
| ROCm-JAX plugin | jax-rocm7-plugin, jax-rocm7-pjrt | /opt/rocm-jax/ | ROCm/rocm-jax |
| ROCm libraries | system debs | /workspace/rocm-libraries/ | ROCm/rocm-libraries |
| ROCm systems | system debs | /workspace/rocm-systems/ | ROCm/rocm-systems |
| MaxText | maxtext | /workspace/maxtext/ | ROCm/maxtext |
| RCCL | system deb + custom build | /workspace/rccl/ | ROCm/rccl |
| AMD-ANP | N/A | /workspace/amd-anp/ | ROCm/amd-anp |
| maxtext-slurm | N/A | /maxtext-slurm/ | AMD-AGI/maxtext-slurm |
If the container has a build-time manifest, read it and skip to the Output Template — no probing needed.
if [[ -f /etc/build-manifest.json ]]; then
jq . /etc/build-manifest.json
# Done. Use the manifest to fill the Output Template directly.
# Only continue with Steps 1-8 if the manifest is missing or incomplete.
fi
You must be running commands inside the target container. Common ways to get a shell:
# Running Slurm job — exec into the job's container on a compute node
srun --overlap --jobid=<JOBID> --pty bash
# Standalone container from an image
docker run --rm -it <IMAGE> bash
# Existing container
docker exec -it <CONTAINER_ID> bash
If you are an AI agent inside a container (e.g., via .host-cmd), the commands below run directly. If you are on the host, enter the container first.
# JAX + jaxlib versions
python3 -c "import jax; print(jax.__version__, jax.__file__)"
python3 -c "import jaxlib; print(jaxlib.__version__, jaxlib.__file__)"
# All relevant pip packages
pip list 2>/dev/null | grep -iE "jax|rocm|xla|maxtext|flax|optax|transformer.engine|xprof"
# Detailed pip metadata
pip show jax jaxlib jax-rocm7-plugin jax-rocm7-pjrt maxtext transformer-engine xprof 2>/dev/null
JAX and jaxlib embed _git_hash in their version.py. Use Python to resolve paths dynamically (avoids hardcoding the Python version):
# JAX git hash
python3 -c "from jax.version import _git_hash; print('jax _git_hash:', _git_hash)"
# jaxlib git hash
python3 -c "from jaxlib.version import _git_hash; print('jaxlib _git_hash:', _git_hash)"
The ROCm plugin records build-time commits from three repos in an auto-generated file:
python3 -c "from jax_rocm7_plugin.commit_info import commit_info; import json; print(json.dumps(commit_info, indent=2))"
# Returns dict with keys: "ROCm/xla", "ROCm/rocm-jax", "jax"
# Fallback if the above fails (alternative location):
python3 -c "from jax_plugins.xla_rocm7.commit_info import commit_info; import json; print(json.dumps(commit_info, indent=2))"
Note: The plugin commit_info.py "jax" hash may differ from the installed JAX wheel's _git_hash. The plugin hash is the jax commit used at plugin build time; the wheel hash is the jax release commit.
Scan for .git directories at known source paths:
for d in /opt/jax /opt/xla /opt/rocm-jax /workspace/maxtext /workspace/rccl \
/workspace/amd-anp /workspace/rocm-libraries /workspace/rocm-systems \
/maxtext-slurm; do
if [ -d "$d/.git" ]; then
echo "=== $d ==="
git -C "$d" log --oneline -1
git -C "$d" rev-parse HEAD
git -C "$d" symbolic-ref --short HEAD 2>/dev/null || echo "(detached)"
git -C "$d" describe --tags --always 2>/dev/null
git -C "$d" remote -v | head -2
else
echo "=== $d === NOT PRESENT"
fi
done
# ROCm version
cat /opt/rocm*/.info/version 2>/dev/null
# HIP version
/opt/rocm/bin/hipcc --version 2>&1 | head -3
# rocm-smi
/opt/rocm/bin/rocm-smi --version 2>&1
# rocprofiler-sdk (v3)
/opt/rocm/bin/rocprofv3 --version 2>&1 | head -5
# rocprofiler v2 (legacy)
/opt/rocm/bin/rocprof --version 2>&1 | head -5
# ROCm math/DNN libraries (from rocm-libraries monorepo)
dpkg -l 2>/dev/null | grep -iE "rocblas|rocfft|rocsolver|rocsparse|rocrand|rocprim|rocthrust|hipblas|hipfft|hipsolver|hipsparse|hipsparselt|miopen|comgr"
# ROCm system packages (from rocm-systems monorepo)
dpkg -l 2>/dev/null | grep -iE "rocprof|roctracer|rccl|hip-runtime|hsa-rocr|amd-smi|hipcc"
# OpenMPI
mpirun --version 2>/dev/null
# UCX
ls /workspace/ucx-*/
# Python
python3 --version
# Venv location
echo $VIRTUAL_ENV; ls /opt/venv/ 2>/dev/null
These env vars materially change which library loads and how the GPU stack behaves. Two containers with identical packages but different env vars can perform very differently.
# Library resolution order (determines which librccl.so, libhipblaslt.so, etc. wins)
echo "LD_LIBRARY_PATH=$LD_LIBRARY_PATH"
# XLA compiler flags
echo "XLA_FLAGS=$XLA_FLAGS"
# RCCL / NCCL tuning
env | grep -iE "^NCCL_|^RCCL_" | sort
# ROCm / HIP / HSA flags
env | grep -iE "^ROCM_|^HIP_|^HSA_|^GPU_MAX_HW_QUEUES" | sort
# Transformer Engine flags
env | grep -iE "^NVTE_" | sort
# JAX memory and client config
env | grep -iE "^XLA_PYTHON_CLIENT|^JAX_" | sort
Check for libraries built from source alongside system installs:
# Custom RCCL build (vs system /opt/rocm/lib/librccl.so)
find /workspace/rccl -name "librccl*.so*" -type f 2>/dev/null
# Custom hipBLASLt (check if version differs from standard ROCm)
dpkg -l | grep hipblaslt
# AMD-ANP plugin
ls /opt/rocm/lib/librccl-anp.so 2>/dev/null
Present results in this format:
## Container Environment Summary
**Docker Image**: [image name from container_env.sh or user]
**Base OS**: [distro + version]
**Python**: [version] at [venv path]
**ROCm**: [version] at /opt/rocm-X.Y.Z/
### Python Packages
| Package | Version | Git Hash | Source in Container? | Path |
|---|---|---|---|---|
### ROCm System Packages
| Package | Version (dpkg) | Notes |
|---|---|---|
### Git Repos in Container
| Path | Repo | Git Hash | Branch/Tag |
|---|---|---|---|
### Runtime-Critical Environment Variables
| Variable | Value |
|---|---|
### Source Paths NOT Present (need cloning)
| Expected Path | Repo URL | Known Hash |
|---|---|---|
When the user needs source for a specific ROCm library, know which monorepo contains it:
ROCm/rocm-libraries (/workspace/rocm-libraries/):
projects/: rocblas, rocfft, rocsolver, rocsparse, rocrand, rocprim, rocthrust, hipblas, hipblaslt, hipcub, hipfft, hiprand, hipsolver, hipsparse, hipsparselt, hiptensor, miopen, composablekernel, rocwmma
shared/: tensile, rocroller, origami, mxdatagenerator
ROCm/rocm-systems (/workspace/rocm-systems/):
projects/: rocprofiler, rocprofiler-sdk, rocprofiler-register, rocprofiler-compute, rocprofiler-systems, roctracer, rccl, clr, hip, hipother, hip-tests, rocr-runtime (rocrruntime), rocminfo, rocm-core, rocmsmilib, amdsmi, aqlprofile, rdc, rocshmem
jax-rocm7-* package names encode the ROCm major version (7). Future containers with ROCm 8 would use jax-rocm8-*.jaxlib version containing +selfbuilt means it was compiled from source but the source tree was not retained.$VIRTUAL_ENV or look under /opt/venv/.site-packages path depends on Python version (e.g., python3.12). Adjust grep paths accordingly.tools
Comprehensive pre-commit verification checklist with five independent responsibilities. (1) Launcher path coverage - verify a change to any launcher-chain file preserves correct behavior across all 16 combinations of entry point × launch mode × stack (Steps 1-4 + 5.1). (2) Ancillary scripts smoke - syntax / help / read-only / caller checks for any `.sh` or `.py` outside the launcher chain (Step 5.2; covers analysis utilities, sourced libraries, debug helpers, sweep tooling). (3) Code quality and design review (Step 6) - propose-first surface of code smells (duplication, long functions, magic numbers, deep nesting, unclear naming, primitive obsession, etc.) and design-decay signals (5th case in a switch, N-th env-var read, hand-rolled retry loops); auto-fix mechanical findings, hold design-shaped ones for explicit go-ahead. (4) Docs / comments / format-consistency (Step 7) - check any commit for stale prose, trailing-comment alignment drift, broken anchors / missing files in links, drifted cross-references, and this skill itself drifting from the code it describes. (5) Sensitive-info leak scan (Step 8) - cluster hostnames, internal IPs, vendor mount paths, hard-coded credentials, internal job IDs; final security gate. Trigger keywords - "verify all launcher paths", "trace launcher chain", "audit entry × launch × stack", "path coverage", "(entry × launch × stack) matrix", "post-launch teardown verification", "pre-commit audit", "before commit", "ready to commit", "verify scripts / utils not broken", "smoke-test the changed scripts", "any utility script broken", "code quality", "design review", "code smells", "tighten and polish", "avoid quality decay", "revisit design choice", "scrub leaks", "check for sensitive info before commit", "any docs or skills need update", "any stale comments", "any inaccurate comments", "comment alignment", "link policy", "broken anchors". Use when modifying `_train.sh`, `_train_with_ray.sh`, `_ray_actor.py`, `_container.sh`, `_job.sbatch`, `_k8s_job.sh`, `in_container_run.sh`, `run_local.sh`, `submit.sh`, `k8s_submit.sh`, `utils/run_setup.sh`, `utils/ray_cluster.sh`, `utils/monkey_patch_maxtext.py`, `utils/coredump.sh`, `utils/stage_timeout.sh`, or anywhere else in the launcher chain. Also use proactively before opening any PR (Steps 5.2, 6, 7, 8 apply universally to all changes that touch code / docs / comments), when investigating a path-specific bug ("this only happens in K8s + 1-gpu-per-process"), after adding a new entry point / launch mode / stack option, after touching any analysis utility (`utils/analyze_job.py`, `utils/perf_server.py`, `utils/profile_drill.py`, `utils/slurm_job_monitor.sh`, etc.), or after editing any doc or skill in the repo (Step 7 catches cross-reference drift).
testing
Find the XLA flag / NCCL env-var combination that maximizes steady-state TGS for one (model × parallelism) cell. Produces an evidence-backed leaderboard, mechanistic explanation of the winning flag, and a deployment recipe. Use when the user asks to tune XLA flags, tune NCCL, find best collective-permute / all-gather threshold, optimize FSDP/PP/TP, close a parallelism-vs-parallelism throughput gap, or sweep cross-iteration prefetch / overlap-limit / async-stream-priority knobs for a specific model.
testing
Diagnose training job incidents and check cluster health using the per-job Prometheus TSDB. Use when the user asks to diagnose a failure root cause, check GPU/network health, query Prometheus metrics, investigate a hang, or when the triage skill recommends deeper TSDB analysis.
testing
Use Telegram as the agent's I/O channel. Once triggered, the agent enters a REPL state — reading instructions from TG, executing them, printing results back to TG, and looping. Use when the user asks to be notified, messaged, or alerted via Telegram, or wants to interact with the agent through TG. This is a cross-cutting skill — other skills (batch-sweep, model-config, job-triage) can trigger it when the user explicitly requests it.