skills/profile-drill/SKILL.md
Direct per-kernel time analysis from JAX / TensorFlow xplane traces via `utils/profile_drill.py`. Use when the user asks for a per-kernel breakdown, step-time composition, cross-variant kernel comparison, main-stream-blocking analysis, or any question that needs ground-truth kernel timings below what TraceLens reports. Triggers include "xplane", "trace.json.gz", "input_scatter_fusion", "RaggedAllToAllKernelImpl", "ncclDevKernel", "step − total kernel", "main-stream-busy", "profile drill-down", or suspicion that TraceLens numbers are off by ~1.5–2×.
npx skillsauth add AMD-AGI/maxtext-slurm profile-drillInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Direct-read xplane analysis for kernel-level ground truth. Sibling to performance-analysis (which goes via TraceLens). Use this skill when:
moe.py / sharding change / XLA flag toggle) where all variants must be measured with the same yardstick.Tool: utils/profile_drill.py. Docstring covers CLI usage and the multi-subdir gotcha; this skill covers methodology and interpretation.
performance-analysis| Situation | Skill |
|---|---|
| "Analyze this job / what's the TGS?" | performance-analysis |
| "Compute / exposed-comm / idle %?" | performance-analysis |
| "Install TraceLens, run dispatcher, populate dashboard" | performance-analysis |
| "How long does kernel <X> take per GPU per step?" | profile-drill |
| "Which input_scatter_fusion_*.kd / loop_*_fusion_*.kd dominates?" | profile-drill |
| "Compare kernel time breakdown across N experimental variants." | profile-drill |
| "The time ms per gpu number in the TraceLens CSV seems too high." | profile-drill (escape hatch) |
Both skills read the same xplane artifacts (*.trace.json.gz, *.xplane.pb). performance-analysis aggregates via TraceLens; this skill parses the trace JSON directly.
profile_drill.py has no runtime dependency on TraceLens — it reads .trace.json.gz files directly via Python's stdlib (gzip + json) and never loads TraceLens modules. The skill and performance-analysis are independent analysis paths over the same upstream profiler output.
The indirect dependency is that the .trace.json.gz file has to exist. In the standard MaxText setup, the JAX profiler writes .trace.json.gz natively alongside .xplane.pb (see utils/monkey_patch_maxtext.py which flocks jax.profiler.stop_trace). This relies on the xprof (aka tensorboard-plugin-profile) package being importable at trace-write time.
If the profile directory contains *.xplane.pb files but no matching *.trace.json.gz — typically a container issue with xprof / TensorFlow 2.19+ compatibility — install and patch TraceLens following the steps in skills/performance-analysis/SKILL.md (Step 2 and tracelens-patches.md). That install pulls in a compatible xprof and patches the known renames; then re-run the training job so the next profile window writes .trace.json.gz natively. You do not need to actually use TraceLens after that — profile_drill.py can operate on the native JAX output directly.
Add these passthrough args to the training job (MaxText):
profiler=xplane
skip_first_n_steps_for_profiler=5
profiler_steps=3
_env_ENABLE_XLA_DUMP=1
3 profiled steps is the sweet spot — enough to average out per-step jitter, cheap enough that profiler-writeback noise (visible as an inflated step time on the step immediately after the profile window) stays localised.
Output lives under <job_dir>/<run_name>/tensorboard/plugins/profile/<ts>/:
<hostname>.trace.json.gz per host. Each file contains events from all local GPUs on that host (identified by distinct pid values inside the file).<hostname>.proc<N>.trace.json.gz per GPU, each with a single pid (usually 1).The tool prints the inferred launcher mode in its header line (looks like 1-node/proc, 8.0 GPUs/file vs looks like 1-GPU/proc, 1.0 GPU/file). A cross-variant comparison where the launcher hint differs between runs is almost always a mistake — stop and re-check.
1-node/proc and 1-GPU/proc are not just different file layouts — they produce subtly different traces in four dimensions:
| Dimension | 1-node/proc | 1-GPU/proc |
|---|---|---|
| Trace files per host | 1 (multi-pid) | gpus_per_node (single-pid) |
| Multi-subdir likelihood | Rare (all hosts' profilers fire nearly together) | Common (many more processes → more skew); always glob profile/*/… |
| Which kernels XLA emits | Intra-process collectives available — e.g. stream_executor::gpu::RaggedAllToAllKernelImpl for intra-process EP axes | Cross-process only — same ragged op falls back to RCCL AllToAll. Zero time in an "in-process" family here does not mean the workload is trivial, just lowered differently. |
| Per-kernel attribution | Xplane profiler can fold interleaved intra-process GPU work into non-kernel overhead; per-kernel dur of routine kernels (GEMM, flash-attn) may read ~2× lower than the true per-GPU time | Each process observes its own kernels as fully-accounted events; more faithful per-GPU reading |
| TraceLens CSV bias | Broken — time ms per gpu is inflated ~1.5×–2× (see Pitfalls) | Correct (file count = GPU count, so TraceLens's divisor accidentally lands right) |
Practical consequences:
step − total kernel (healthy overlap) which is harder to decompose.A single profiling window commonly splits across two or more adjacent timestamp directories because JAX fires the profiler on each task slightly skewed. The severity is much higher on 1-GPU/proc (many more processes). Always glob across subdirs:
ls <job>/*/tensorboard/plugins/profile/*/*.trace.json.gz > /tmp/traces.txt
After running the tool, verify the GPUs auto-detected=NN line matches num_nodes × gpus_per_node and the launcher hint matches the job you submitted. If not, you are missing a subdir, including stale traces from a different run, or looking at a different launcher than you thought.
Build a list-file with a glob that captures all .trace.json.gz for the profiling window:
ls <job_dir>/*/tensorboard/plugins/profile/*/*.trace.json.gz > /tmp/traces.txt
wc -l /tmp/traces.txt
Expected count: num_nodes (1-node/proc) or num_nodes × gpus_per_node (1-GPU/proc).
profile_drill.pypython3 utils/profile_drill.py $(cat /tmp/traces.txt)
The tool prints three blocks:
trace_files, GPUs auto-detected, profile_steps, divisor.input_scatter_fusion, loop_select_fusion, loop_convert_fusion, loop_transpose_fusion, loop_reduce_fusion, input_reduce_select, input_broadcast_reduce_select) — each row is one kernel name, its time per GPU per step, and launch count.Other kernels catch-all, plus the Total kernel time summed across all streams.Before trusting any number, confirm:
GPUs auto-detected == num_nodes × gpus_per_node
If lower → missing a timestamp subdir in the glob or traces never wrote. If higher → you included an unrelated profile window; scope to a specific timestamp.
step − total kernelLook up the steady-state step time from the training log (use a no-profile companion run if available — the profile run's step-time is skewed by writeback on the step immediately after the profile window). Then:
idle_gap_or_overlap = step_time − total_kernel_time
Interpretation — see The step − total kernel row is the key diagnostic.
step − total kernel row is the key diagnostic| Sign | Meaning |
|---|---|
| Positive (+) | Main-stream blocker(s) are stalling the GPU. Something on the compute stream occupies kernel time that does not permit overlap, and downstream kernels can't begin. The positive gap is real idle wall-clock that no family bucket captures. |
| Near zero | Kernels are tightly serialised, minimal idle, minimal overlap. Rare in practice. |
| Negative (−) | Streams are genuinely overlapping (compute stream + RCCL comm stream kernels run concurrently). total kernel sums across all streams, so a healthy overlapping step has total > wallclock. |
See Examples for concrete numbers measured on real profiles.
Symptom: a single kernel (family) takes a large fraction of the step and step − total kernel is strongly positive. Because the idle-gap row surfaces stalls more directly on 1-node/proc (see Launcher-mode differences), this analysis is usually easiest on a 1-node/proc profile even if the production launcher is 1-GPU/proc.
Common suspects on AMD MI3xx:
stream_executor::gpu::RaggedAllToAllKernelImpl<N> — 1-node/proc only. XLA's in-process fallback for ragged-all-to-all when all EP ranks share a JAX process. Sequential across peers, no pipelining, one xGMI link at a time. The typical remedy is to pass the XLA flag --xla_gpu_unsupported_use_ragged_all_to_all_one_shot_kernel=false so the ragged thunk picks the RCCL path instead — the same path that 1-GPU/proc selects automatically. This kernel does not exist on 1-GPU/proc traces; zero time in this family on a 1-GPU profile is normal.input_scatter_fusion_*.kd when JAX autodiff inverts a recv_x[indices] gather where indices has duplicates (e.g. MoE top-K fan-out). Each atomic scatters one peer-word at a time through HBM and cannot overlap with following GEMMs. Typical remedies: compose chained gathers into one (halves the atomics), or replace the duplicate-index backward with a jax.custom_vjp that uses a permutation gather + reduce-sum (eliminates atomics entirely).loop_*_fusion_<N>.kd that ends up on the main stream with per-call duration above ~50 ms — appears on both launchers. Inspect its HLO to see what it's doing.The tool sums times using this classification (lambdas in utils/profile_drill.py):
| Family | What it means |
|---|---|
| RaggedAllToAllKernelImpl | XLA's in-process ragged-all-to-all (naive, serial across peers). Appears on 1-node/proc when the one-shot kernel is enabled. |
| primus_turbo::deep_ep | AMD Primus-Turbo's MoE dispatch/combine HIP kernels. Appears with use_deepep_dispatch=true. |
| input_scatter_fusion / loop_select_fusion / loop_gather_fusion / etc. | XLA-emitted fusion kernels around user-code indexing / masking / permutation. |
| RCCL ncclDevKernel | RCCL collective-ops on the comm stream. A higher RCCL number on a faster variant often means the scheduler packs more RCCL work in (because a main-stream blocker is now gone), not that RCCL itself got slower — look at step − total kernel to confirm. |
| CK+primus GEMM (non-DeepEP) | AMD Composable-Kernel and Primus-Turbo GEMM kernels (grouped + dense), excluding DeepEP HIP calls. |
| flash_attn (fmha) | AMD aiter::fmha_* flash-attention forward/backward. |
| Other kernels | Everything that matched is_gpu_kernel() but no family predicate (memcpy, barriers, minor XLA fusions). |
When comparing experimental variants against a baseline:
looks like 1-node/proc or looks like 1-GPU/proc — if it differs across variants, the attribution is not comparable (see Launcher-mode differences).--profile-steps on every invocation (defaults to 3, matching profiler_steps=3). Mismatching profile_steps makes per-step numbers incomparable by a constant factor.profile/*/*.trace.json.gz). Under-globbing one variant silently under-counts its GPUs.GPUs auto-detected is identical across the variants you're comparing. It must equal num_nodes × gpus_per_node for every profile.To build a composition table for a single variant:
profile_drill.py on all trace files from one profile window.RaggedAllToAllKernelImpl, primus_turbo::deep_ep, input_scatter_fusion, loop_select_fusion, loop_gather_fusion, RCCL, CK+primus GEMM, flash_attn).loop_reduce_fusion, loop_convert_fusion, loop_transpose_fusion, input_reduce_select, input_broadcast_reduce_select) plus Other kernels into an "Other fusions" row. This must sum with the itemised rows to equal Total kernel time.step − total kernel and put it in its own row.When comparing variants, put each variant in its own column, with all columns computed identically (same --profile-steps, same glob shape, same family definitions).
performance-analysis Step 3 recommends reading kernel_launchers_summary_by_category.csv and kernel_launchers_summary.csv from TraceLens. These CSVs have a systematic ~1.5×–2× inflation on 1-node/proc profiles.
time ms per gpu = total_direct_kernel_time_ms / N, where N is the number of trace files (= number of hosts on 1-node/proc). TraceLens divides by the number of trace files, not the number of GPUs.Diagnostic: compare a TraceLens CSV category (say GEMM or RCCL) between 1-node/proc and 1-GPU/proc profiles of the same pdbs on the same model. If the 1-GPU reading is ~half the 1-node reading, TraceLens is bitten — the actual per-GPU GEMM time should be identical across launchers modulo small stream-placement effects, and the profile_drill.py 1-GPU number is ground truth.
Remedy: use profile_drill.py whenever you need per-kernel numbers you can cite. It counts raw dur fields from the trace JSON and divides by auto-detected gpus × profile_steps.
Symptom: GPUs auto-detected=NN where NN < num_nodes × gpus_per_node. A portion of the profiling window landed in a sibling profile/<ts+1>/ directory that your glob didn't include. Re-glob with profile/*/… (not a specific timestamp).
Symptom: GPUs auto-detected=NN larger than expected, or families show unexpected kernels (e.g. RaggedAllToAllKernelImpl with positive time on a variant that shouldn't go through that path). The glob picked up an older profiling window in the same job directory. Scope to a specific timestamp window.
The step that immediately follows the 3-step profile window (step 8 for skip_first=5, steps=3) is slowed by profiler writeback — measured overhead ranges from ~30 % to ~70 % above steady state, depending on how much profile data had to be serialised. If you use the profile run's mean TGS for the denominator in a step-time composition table, your step − total kernel will be artificially inflated. Source the step time from a companion no-profile run, or exclude the writeback step from the profile run's TGS.
Two worked examples reconstructed from real DeepSeek-V3 671B runs (MI355, 8 nodes × 8 GPUs, pdbs=6, seq_len = 4096, so 64 GPUs × 3 profiled steps → divisor 192). Both illustrate the patterns described above. Treat them as illustrative — the kernel times will look different on your hardware/parallelism, but the ratios and the workflow of identifying the bottleneck class transfer.
Three profiles of the same pdbs=6 workload under different launcher/config combinations:
| Variant | Step | Total kernel | step − total | Dominant kernel pattern |
|---------------------------|------:|-------------:|-------------:|-------------------------|
| sparse-gmm 1-node/proc | 82.5 s | 45.1 s | +37.4 s | RaggedAllToAllKernelImpl = 28.4 s/GPU/step — blocks everything on main stream |
| sparse-gmm 1-GPU/proc | 26.1 s | 33.5 s | −7.4 s | RCCL ncclDevKernel = 15.3 s overlaps with ~14 s of compute — healthy |
| sparse-gmm-deepep 1-node/proc | 38.0 s | 26.9 s | +11.1 s | input_scatter_fusion_2.kd (≈ 4.4 s) is a smaller main-stream blocker |
Lessons:
RaggedAllToAllKernelImpl on the 1-node profile is purely sequential, so its disappearance under 1-GPU frees ~28 s of wallclock directly; (b) scheduler cascade recovery — once the blocker is gone, kernels that were stranded in gaps on the main stream (RCCL, flash-attn, etc.) can run concurrently with compute on their respective streams, collapsing most of the remaining idle. Step time drops wholesale, not just by the blocker's duration.input_scatter_fusion_2.kd. Compare with 1-GPU sparse-gmm (12916), which has a healthy negative gap of −7.4 s (stream overlap): even though both paths do similar work, the idle-gap sign flips because nothing on 1-GPU blocks the main stream. Takeaway: step − total kernel is the single best diagnostic for "is this workload bottlenecked by a blocking kernel?" — more reliable than ranking kernel family totals. (Do not compare absolute kernel-time totals across launchers — that comparison is subject to the attribution artefact described in Launcher-mode differences.)moe.py change)Same workload, same image, same flags — only src/MaxText/layers/moe.py differs between three yihuang/moe-turbo-gmm-and-deepep{,-v2,-v3} patch branches:
| Variant | input_scatter_fusion_*.kd | Total kernel | Step | step − total | Step-time saving vs prev |
|---|---:|---:|---:|---:|---:|
| v1 baseline (two gathers, two atomic scatter-adds) | 8.97 | 26.93 s | 38.0 s | +11.1 s | — |
| v2 (compose gathers: one atomic scatter-add) | 4.45 | 21.69 s | 30.5 s | +8.8 s | −7.5 s |
| v3 (custom_vjp replaces scatter-add with reduce-sum) | 0.04 | 19.13 s | 23.9 s | +4.8 s | −6.6 s |
Lessons:
input_scatter_fusion_*.kd drops monotonically (8.97 → 4.45 → 0.04 s) as Python-side MoE changes remove duplicate-index atomic scatter-adds from the backward HLO.Discipline applied in both examples: identical --profile-steps 3, identical glob (profile/*/*.trace.json.gz), GPUs auto-detected=64 on every run, headline TGS sourced to exclude the profiler writeback step — either from a companion no-profile run (v2, v3) or from the profile run's steady-state tail only (steps 9-14 when skip_first=5 steps=3).
utils/profile_drill.py — the tool itself. --help shows CLI options; the module docstring covers usage + multi-subdir gotcha.main.tools
Comprehensive pre-commit verification checklist with five independent responsibilities. (1) Launcher path coverage - verify a change to any launcher-chain file preserves correct behavior across all 16 combinations of entry point × launch mode × stack (Steps 1-4 + 5.1). (2) Ancillary scripts smoke - syntax / help / read-only / caller checks for any `.sh` or `.py` outside the launcher chain (Step 5.2; covers analysis utilities, sourced libraries, debug helpers, sweep tooling). (3) Code quality and design review (Step 6) - propose-first surface of code smells (duplication, long functions, magic numbers, deep nesting, unclear naming, primitive obsession, etc.) and design-decay signals (5th case in a switch, N-th env-var read, hand-rolled retry loops); auto-fix mechanical findings, hold design-shaped ones for explicit go-ahead. (4) Docs / comments / format-consistency (Step 7) - check any commit for stale prose, trailing-comment alignment drift, broken anchors / missing files in links, drifted cross-references, and this skill itself drifting from the code it describes. (5) Sensitive-info leak scan (Step 8) - cluster hostnames, internal IPs, vendor mount paths, hard-coded credentials, internal job IDs; final security gate. Trigger keywords - "verify all launcher paths", "trace launcher chain", "audit entry × launch × stack", "path coverage", "(entry × launch × stack) matrix", "post-launch teardown verification", "pre-commit audit", "before commit", "ready to commit", "verify scripts / utils not broken", "smoke-test the changed scripts", "any utility script broken", "code quality", "design review", "code smells", "tighten and polish", "avoid quality decay", "revisit design choice", "scrub leaks", "check for sensitive info before commit", "any docs or skills need update", "any stale comments", "any inaccurate comments", "comment alignment", "link policy", "broken anchors". Use when modifying `_train.sh`, `_train_with_ray.sh`, `_ray_actor.py`, `_container.sh`, `_job.sbatch`, `_k8s_job.sh`, `in_container_run.sh`, `run_local.sh`, `submit.sh`, `k8s_submit.sh`, `utils/run_setup.sh`, `utils/ray_cluster.sh`, `utils/monkey_patch_maxtext.py`, `utils/coredump.sh`, `utils/stage_timeout.sh`, or anywhere else in the launcher chain. Also use proactively before opening any PR (Steps 5.2, 6, 7, 8 apply universally to all changes that touch code / docs / comments), when investigating a path-specific bug ("this only happens in K8s + 1-gpu-per-process"), after adding a new entry point / launch mode / stack option, after touching any analysis utility (`utils/analyze_job.py`, `utils/perf_server.py`, `utils/profile_drill.py`, `utils/slurm_job_monitor.sh`, etc.), or after editing any doc or skill in the repo (Step 7 catches cross-reference drift).
testing
Find the XLA flag / NCCL env-var combination that maximizes steady-state TGS for one (model × parallelism) cell. Produces an evidence-backed leaderboard, mechanistic explanation of the winning flag, and a deployment recipe. Use when the user asks to tune XLA flags, tune NCCL, find best collective-permute / all-gather threshold, optimize FSDP/PP/TP, close a parallelism-vs-parallelism throughput gap, or sweep cross-iteration prefetch / overlap-limit / async-stream-priority knobs for a specific model.
testing
Diagnose training job incidents and check cluster health using the per-job Prometheus TSDB. Use when the user asks to diagnose a failure root cause, check GPU/network health, query Prometheus metrics, investigate a hang, or when the triage skill recommends deeper TSDB analysis.
testing
Use Telegram as the agent's I/O channel. Once triggered, the agent enters a REPL state — reading instructions from TG, executing them, printing results back to TG, and looping. Use when the user asks to be notified, messaged, or alerted via Telegram, or wants to interact with the agent through TG. This is a cross-cutting skill — other skills (batch-sweep, model-config, job-triage) can trigger it when the user explicitly requests it.