skills/performance-analysis/SKILL.md
Analyze MaxText training job performance using tgs_tagger, TraceLens, and IRLens. Use when the user asks to analyze a training run, profile traces, HLO IR, TGS metrics, GPU utilization, or mentions tag_tgs, TraceLens, IRLens, xplane, or performance analysis.
npx skillsauth add AMD-AGI/maxtext-slurm performance-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Post-training (or mid-training) analysis pipeline. Follow the workflow below from top to bottom.
Multi-job comparisons: If comparing two or more jobs (e.g., "why is job B slower than job A?"), start with skills/tsdb-diagnosis/SKILL.md (Multi-Job Comparison workflow) before running TraceLens. The TSDB reveals system-level root causes — CPU contention from RCCL resource leaks, network errors, I/O pressure, thermal throttling — that TraceLens cannot observe (it only sees GPU-side kernel timings). Only proceed to TraceLens here if the TSDB comparison is inconclusive.
Deep per-kernel analysis: When the user asks for per-kernel time breakdowns, step-time composition tables, cross-variant kernel comparisons, or whether a specific kernel is main-stream-blocking — switch to skills/profile-drill/SKILL.md. TraceLens's kernel_launchers_summary_by_category.csv has a known ~1.5×–2× inflation bug on 1-node/proc profiles (the time ms per gpu column divides by host count, not GPU count). profile-drill uses utils/profile_drill.py to read the raw xplane trace JSONs directly and avoids this bias.
python3 utils/analyze_job.py "$JOB_WORKSPACE/<job>.log"
python3 utils/analyze_job.py "$JOB_WORKSPACE/<job_dir>/"
python3 utils/analyze_job.py "$JOB_WORKSPACE/local_2026*"
For running jobs, pass -f to force re-analysis (bypasses staleness check):
python3 utils/analyze_job.py -f "$JOB_WORKSPACE/<job>.log"
The dispatcher auto-detects available artifacts and runs only the relevant tools:
tgs_tagger.py*.xplane.pb → TraceLens_generate_perf_report_jaxxla_dump/*.gpu_after_optimizations.txt → IRLens_analyze_hlo_ir.pyIf the dispatcher output says "TraceLens not installed" and xplane traces exist:
Check if TraceLens is already installed and patched before doing anything:
python3 -c "
import TraceLens.util, inspect
src = inspect.getsource(TraceLens.util.DataLoader.load_data)
assert 'xprof' in src, 'not patched'
print('TraceLens: installed and patched')
"
python3 utils/analyze_job.py -f "$JOB_WORKSPACE/<job>.log"Install (only if import failed):
pip install git+https://github.com/AMD-AGI/TraceLens.git
Patch (only if the xprof assertion failed). Apply all patches from tracelens-patches.md — 6 files, ~13 patches. Key fixes:
tensorboard_plugin_profile to xprof)xprof remaps device PIDs to 1001+; code filtering pid < 100 misses all GPU events)metadata_events not passed to build_tree()KeyError on gpu_kernel_op_cat and missing parent events for launch latencyRe-run the dispatcher with -f:
python3 utils/analyze_job.py -f "$JOB_WORKSPACE/<job>.log"
This is one-time per environment. Always check before patching to avoid redundant work.
Read the generated analysis.json — but do NOT try to read the raw file (it can be 40K+ lines due to per-step arrays). Extract key metrics programmatically:
python3 -c "
import json, sys
with open('<job_dir>/analysis.json') as f:
d = json.load(f)
print(f'Job: {d[\"job_id\"]} | Model: {d[\"model\"]} | Nodes: {d[\"num_nodes\"]} | Status: {d[\"job_status\"][\"status\"]}')
tgs = d['tgs']
print(f'Steady TGS: {tgs[\"steady\"][\"mean\"]:.1f} (std={tgs[\"steady\"][\"std\"]:.1f}, steps {tgs[\"steady\"][\"range\"]})')
print(f'Tail TGS: {tgs[\"tail\"][\"mean\"]:.1f} (std={tgs[\"tail\"][\"std\"]:.1f}, steps {tgs[\"tail\"][\"range\"]})')
tl = d.get('tracelens_summary', {})
if tl:
print(f'Compute: {tl[\"computation_time\"]:.1f}% | Exposed comm: {tl[\"exposed_comm_time\"]:.1f}% | Idle: {tl[\"idle_time\"]:.2f}% | Total comm: {tl[\"total_comm_time\"]:.1f}%')
"
For deeper TraceLens analysis, read the CSVs in <job_dir>/tracelens/<timestamp>/csvs/:
gpu_events_averages.csv — per-GPU compute/comm/idle breakdown (averages)gpu_timeline.csv — per-GPU breakdown with pidkernel_launchers_summary_by_category.csv — time by kernel category (GEMM, NCCL, XLA fusions, etc.)kernel_launchers_summary.csv — time by individual kernel name⚠️ TraceLens per-GPU CSV bias on 1-node/proc. The
time ms per gpucolumn in the twokernel_launchers_summary*.csvfiles divides total kernel time by host count (typically 8), not GPU count (typically 64) — so per-GPU numbers are ~1.5×–2× inflated on 1-node/proc profiles. Percentages and category rankings are fine; absolute per-GPU kernel times are not. For kernel-time numbers you can cite (e.g. in a report or step-time composition table), useskills/profile-drill/SKILL.mdinstead — it reads raw xplane trace JSONs and divides by auto-detected GPUs.
Present results using this structure:
| Metric | Source | What to look for |
|--------|--------|------------------|
| TGS (steady-state) | analysis.json → tgs.steady | Primary throughput metric |
| MFU | analysis.json → mfu_per_step | Model FLOPS utilization (if available) |
| GPU compute % | tracelens_summary.computation_time | Time on actual compute kernels |
| Exposed comm % | tracelens_summary.exposed_comm_time | Communication NOT overlapped with compute (lower is better) |
| Idle % | tracelens_summary.idle_time | GPU doing nothing (should be near 0) |
| Kernel breakdown | kernel_launchers_summary_by_category.csv | GEMM vs NCCL vs fusion time |
| Comm ops per step | dispatcher IRLens output | Count of all-reduce, all-gather, all-to-all, reduce-scatter |
Interpretation guidelines:
Check the dispatcher output first — it prints a Dashboard: line at the end. If it shows a URL with (running), use that URL.
If the dashboard is not running, start it:
pip install fastapi uvicorn # one-time
utils/perf_server.py & # binds 127.0.0.1 by default
Always tell the user the dashboard URL: http://localhost:<PORT>. For remote access, instruct them to tunnel: ssh -L <PORT>:localhost:<PORT> user@host. Avoid --host 0.0.0.0 — perf_server.py has no auth.
The server auto-detects a free port starting from 8080 and auto-reloads analysis.json on each request.
<JOB_WORKSPACE>/<JOB_ID>-<JOB_NAME>[-TGS_<VALUE>]/
log -> ../<log_file> # symlink to log file
analysis.json # structured metrics
xla_dump/ # if _env_ENABLE_XLA_DUMP=1
module_NNNN.jit_train_step.*_gpu_after_optimizations.txt
<run_name>/tensorboard/plugins/profile/<ts>/ # if profiler=xplane
<hostname>.xplane.pb # 1-node/proc: one per host
<run_name>/tensorboard/plugins/profile/<ts_i>/ # 1-GPU/proc (LOCAL_WORLD_SIZE ts dirs,
<hostname>.proc<N>.xplane.pb # one file per host per ts;
# successive serialized writes land
# in different per-second ts dirs)
tracelens/<ts>/csvs/*.csv # 1-node/proc: TraceLens output
tracelens/<ts_i>/<hostname>.proc<N>/csvs/*.csv # 1-GPU/proc: one dir per GPU
The .log file sits alongside the directory in <JOB_WORKSPACE>/.
When enable_checkpointing=true, profiler traces may end up in a shared directory outside the job dir. analyze_job.py parses Config param tensorboard_dir from the log to locate these. The dispatcher and perf_server.py filter profiles by job execution time window and node-0 hostname to disambiguate. In 1-GPU-per-process mode the node-0 filter name.startswith("<host>.") still matches all <host>.proc<N>.xplane.pb files, so TraceLens runs once per GPU on node 0; the multiple timestamp dirs (one per serialized write) are treated like periodic-profiling windows by the existing code.
These are rarely needed — analyze_job.py orchestrates them. Use only for targeted re-runs.
# TGS tagging
utils/tag_tgs.sh <log_file_or_glob>
utils/tag_tgs.sh -f <log_file> # force on running job
# IRLens
utils/IRLens_analyze_hlo_ir.py <hlo_file>
utils/IRLens_analyze_hlo_ir.py <hlo_file> --op communication
utils/IRLens_analyze_hlo_ir.py <hlo_file> --op computation
# TraceLens
TraceLens_generate_perf_report_jax \
--profile_path <xplane.pb> \
--output_csvs_dir <output_dir>/csvs
# profile_drill.py — direct per-kernel analysis from trace JSONs
# (use when TraceLens's per-GPU numbers are suspect or you need kernel-level
# ground truth; see skills/profile-drill/SKILL.md)
utils/profile_drill.py <job_dir>/.../tensorboard/plugins/profile/*/*.trace.json.gz
RAY=1 Slurm log truncationFor RAY=1 jobs, the Slurm log may contain fewer training steps than actually completed due to Ray output buffering (actor stdout is forwarded asynchronously to the driver, and unflushed output is lost when the job exits). If the analysis shows suspiciously few steps (e.g., 34 out of 100) with no error or JOB SUMMARY, check ray_logs/<head_node>/worker*.out in the job directory for the authoritative step count. The analysis.json TGS/MFU metrics will be based only on what appears in the Slurm log and may undercount the actual run.
JOB SUMMARY log marker and file modification time (15 min threshold).analyze_job.py -f bypasses the staleness check but never renames files for running jobs. Renames happen automatically on the next analysis after the job finishes.*.xplane.pb doesn't exist yet.xla_dump/ is already populated.tools
Comprehensive pre-commit verification checklist with five independent responsibilities. (1) Launcher path coverage - verify a change to any launcher-chain file preserves correct behavior across all 16 combinations of entry point × launch mode × stack (Steps 1-4 + 5.1). (2) Ancillary scripts smoke - syntax / help / read-only / caller checks for any `.sh` or `.py` outside the launcher chain (Step 5.2; covers analysis utilities, sourced libraries, debug helpers, sweep tooling). (3) Code quality and design review (Step 6) - propose-first surface of code smells (duplication, long functions, magic numbers, deep nesting, unclear naming, primitive obsession, etc.) and design-decay signals (5th case in a switch, N-th env-var read, hand-rolled retry loops); auto-fix mechanical findings, hold design-shaped ones for explicit go-ahead. (4) Docs / comments / format-consistency (Step 7) - check any commit for stale prose, trailing-comment alignment drift, broken anchors / missing files in links, drifted cross-references, and this skill itself drifting from the code it describes. (5) Sensitive-info leak scan (Step 8) - cluster hostnames, internal IPs, vendor mount paths, hard-coded credentials, internal job IDs; final security gate. Trigger keywords - "verify all launcher paths", "trace launcher chain", "audit entry × launch × stack", "path coverage", "(entry × launch × stack) matrix", "post-launch teardown verification", "pre-commit audit", "before commit", "ready to commit", "verify scripts / utils not broken", "smoke-test the changed scripts", "any utility script broken", "code quality", "design review", "code smells", "tighten and polish", "avoid quality decay", "revisit design choice", "scrub leaks", "check for sensitive info before commit", "any docs or skills need update", "any stale comments", "any inaccurate comments", "comment alignment", "link policy", "broken anchors". Use when modifying `_train.sh`, `_train_with_ray.sh`, `_ray_actor.py`, `_container.sh`, `_job.sbatch`, `_k8s_job.sh`, `in_container_run.sh`, `run_local.sh`, `submit.sh`, `k8s_submit.sh`, `utils/run_setup.sh`, `utils/ray_cluster.sh`, `utils/monkey_patch_maxtext.py`, `utils/coredump.sh`, `utils/stage_timeout.sh`, or anywhere else in the launcher chain. Also use proactively before opening any PR (Steps 5.2, 6, 7, 8 apply universally to all changes that touch code / docs / comments), when investigating a path-specific bug ("this only happens in K8s + 1-gpu-per-process"), after adding a new entry point / launch mode / stack option, after touching any analysis utility (`utils/analyze_job.py`, `utils/perf_server.py`, `utils/profile_drill.py`, `utils/slurm_job_monitor.sh`, etc.), or after editing any doc or skill in the repo (Step 7 catches cross-reference drift).
testing
Find the XLA flag / NCCL env-var combination that maximizes steady-state TGS for one (model × parallelism) cell. Produces an evidence-backed leaderboard, mechanistic explanation of the winning flag, and a deployment recipe. Use when the user asks to tune XLA flags, tune NCCL, find best collective-permute / all-gather threshold, optimize FSDP/PP/TP, close a parallelism-vs-parallelism throughput gap, or sweep cross-iteration prefetch / overlap-limit / async-stream-priority knobs for a specific model.
testing
Diagnose training job incidents and check cluster health using the per-job Prometheus TSDB. Use when the user asks to diagnose a failure root cause, check GPU/network health, query Prometheus metrics, investigate a hang, or when the triage skill recommends deeper TSDB analysis.
testing
Use Telegram as the agent's I/O channel. Once triggered, the agent enters a REPL state — reading instructions from TG, executing them, printing results back to TG, and looping. Use when the user asks to be notified, messaged, or alerted via Telegram, or wants to interact with the agent through TG. This is a cross-cutting skill — other skills (batch-sweep, model-config, job-triage) can trigger it when the user explicitly requests it.