plugins/xprof-profiling-analysis/skills/xprof-profiling-analysis/SKILL.md
Use when analyzing TPU/GPU profiling performance — XProf MCP tools for operator breakdowns, memory profiles, A/B comparisons; offline methodology for trace parsing, MFU calculation, HBM memory analysis, XLA flag auditing, communication bottleneck identification
npx skillsauth add primatrix/skills xprof-profiling-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
TPU/GPU training performance analysis. Primary workflow uses XProf MCP tools (xprof_*) for live queries; appendix sections provide deep domain knowledge for offline analysis and result interpretation.
Before any analysis, verify that XProf MCP tools are available:
Try calling xprof_list_runs().
If it succeeds — skip to Step 1.
If the tool is not found — the plugin is not enabled. Tell the user:
XProf MCP tools are not available. Enable the plugin and restart Claude Code:
claude settings set enabledPlugins.xprof-profiling-analysis@primatrix-skills true
Stop here — wait for the user to restart before continuing.
If the tool exists but returns an error — ask the user to configure XPROF_URL to point to their XProf instance. They need to either:
kubectl port-forward svc/xprof-service <local-port>:8080 and set XPROF_URL=http://localhost:<local-port>The URL is configured in the plugin's .mcp.json env or via the XPROF_URL environment variable.
xprof_list_runs()
Pick the runs to analyze. If comparing optimizations, identify:
main branch or previous bestxprof_overview(run="<run_name>")
Check key metrics:
xprof_framework_ops(run="<run_name>", top=30)
This gives JAX-level operation statistics. Filter by category to focus:
xprof_framework_ops(run="<run_name>", category="gla") # GLA kernels
xprof_framework_ops(run="<run_name>", category="moe") # MoE/GMM kernels
xprof_framework_ops(run="<run_name>", category="matmul") # Dense matmuls
xprof_framework_ops(run="<run_name>", category="collective") # Communication
Categories: gla, moe, attention, norm, embedding, collective, matmul, other
xprof_compare(run_a="<baseline>", run_b="<experiment>", top=20)
Focus on:
xprof_memory(run="<run_name>")
Check:
For deeper analysis, get browser URLs for human inspection:
xprof_trace_url(run="<run_name>")
Share these URLs with the user for:
Generate a comparison table:
| Metric | Baseline | Experiment | Delta |
|--------|----------|------------|-------|
| Step time | X.XXs | X.XXs | -X.X% |
| MXU utilization | XX% | XX% | +X.X% |
| Top hotspot | op_name (XX%) | op_name (XX%) | -X.X% |
| Peak HBM | XX.X GiB | XX.X GiB | -X.X GiB |
Include:
runs = xprof_list_runs()
xprof_overview(run=runs[-1])
xprof_framework_ops(run="<run>", category="gla")
xprof_compare(run_a="<baseline>", run_b="<experiment>")
xprof_memory(run="<run>")
xprof_trace_url(run="<run>") # memory viewer URL for human inspection
xprof_trace_url(run="<run>") # returns browser URLs for all XProf views
| Bound Type | Indicator | Optimization Direction | |------------|-----------|----------------------| | Compute bound | High FLOP rate, arithmetic intensity > 313 FLOPs/byte (TPUv7x) | Reduce FLOPs (algorithmic), increase MXU utilization | | Memory bound | High bandwidth utilization, low arithmetic intensity | Reduce data movement, fuse ops, quantize | | Latency bound | Low both FLOP rate and bandwidth | Pipeline better, reduce sync barriers, batch ops |
Ridge Point = Peak_FLOPS / HBM_BW: TPU v7x = 2307T / 7380 GB/s = 313 FLOPs/byte
| Pattern | Symptom | Action | |---------|---------|--------| | Stalled communication | High device idle, large collective ops | Overlap compute + comms, check XLA flags (see Appendix D) | | Kernel regression | Same op category significantly slower after change | Check pallas-kernel version, chunk_size, block_size | | Memory pressure | Peak HBM near capacity, high fragmentation | Enable remat, reduce batch size, chunked CE (see Appendix E) | | Load imbalance | Cross-device step time variance >5% | Check MoE expert routing balance, PP stage sizing | | Host bottleneck | High host idle time | Check data pipeline, increase grain workers |
Domain knowledge for interpreting MCP results and offline analysis when MCP data is insufficient.
XLA-compiled op names lose JAX semantics. Classification by hlo_category + name patterns:
| Category | hlo_category / Name Pattern | Meaning |
|----------|---------------------------|---------|
| matmul | convolution fusion | Matrix multiply (XLA uses convolution for matmul) |
| attention | splash_mha, flash_attention | Attention kernel |
| communication | all-reduce, all-gather, reduce-scatter, all-to-all (incl. async-start/async-done) | Collective comms |
| custom_kernel | custom-call (excl. GMM/TGMM/splash_mha) | Pallas kernel, offload |
| custom_fusion | custom fusion | XLA custom fusions |
| elementwise_fusion | loop fusion, non-fusion elementwise | Elementwise fusions |
| data_formatting | data formatting | Tensor layout conversion (FSDP sharding copies) |
| routing | sort | MoE TopK routing |
Key: async-done events for communication (e.g., all-reduce.*.call-done) have duration = stall time (compute engine waiting). This is the core metric for exposed communication cost.
| Pattern | Meaning | Phase |
|---------|---------|-------|
| convolution_bitcast_fusion.* | MatMul (fwd or bwd dlhs) | fwd/bwd |
| tgmm.* | MoE weight gradient (Transposed GMM) | backward only |
| splash_mha_fwd_* | SplashAttention forward | forward |
| splash_mha_dkv_* / splash_mha_dq_* | SplashAttention gradients | backward |
| ragged-all-to-all | MoE expert routing (EP) | fwd/bwd |
Via tf_op field (most reliable):
transpose(jvp(...)) → backwardjvp(...) without transpose → forwardmoe_layers_0)Via op name (quick): tgmm.* → always backward; splash_mha_d* → backward; splash_mha_fwd_* → forward.
MFU = F_useful / (Peak_cluster x T_step)
F_useful = GBS x seq_len x FLOPs_per_token(fwd+bwd)
Peak_cluster = num_devices x peak_flops_per_device
MoE caveat: Per-token FLOPs much lower than equivalent dense model (only top-k experts activated). MoE MFU is naturally low (~10%) — this is architectural, not poor hardware utilization.
Pallas kernel caveat: SplashAttention model_flops reports 0. FLOPs from trace accumulation underestimates actual compute. GMM/TGMM model_flops correct in newer versions.
| Primitive | Parallelism | Purpose | |-----------|------------|---------| | AllGather | FSDP | Collect sharded weights | | ReduceScatter | FSDP | Aggregate and distribute gradients | | AllReduce | DP (non-FSDP) | Gradient sync | | AllToAll | EP | Expert routing data exchange |
Single TPU device executes all ops serially. Communication overlap = between communication engine (ICI/DCN) and compute engine (TensorCore):
async-done with duration = 0: Communication completed, no stall → effectively overlappedasync-done with duration > 0: Compute engine waiting = exposed communication costWrong: Don't analyze interval overlaps on single-device timeline (events are serial). Measure async-done duration instead.
data formatting ops = tensor layout conversions. When >5% of step time: check for unnecessary FSDP sharding layout copies.
Check LIBTPU_INIT_ARGS when communication is not overlapped.
Continuation Fusion (CF):
| Flag | Purpose |
|------|---------|
| xla_tpu_enable_async_collective_fusion=true | Enable async collective fusion |
| xla_tpu_enable_async_collective_fusion_fuse_all_gather=true | AllGather CF |
| xla_tpu_enable_async_collective_fusion_fuse_all_reduce=true | AllReduce CF |
| xla_tpu_enable_async_collective_fusion_fuse_reduce_scatter=true | ReduceScatter CF |
| xla_tpu_overlap_compute_collective_tc=true | TensorCore overlap compute/comm |
SparseCore Offloading (v7x recommended):
| Flag | Purpose |
|------|---------|
| xla_tpu_enable_sparse_core_collective_offload_all_gather=true | AllGather via SparseCore |
| xla_tpu_enable_sparse_core_collective_offload_reduce_scatter=true | ReduceScatter via SparseCore |
| xla_tpu_enable_sparse_core_collective_offload_all_reduce=true | AllReduce via SparseCore |
CF and SparseCore are mutually exclusive for the same primitive.
| Symptom | Likely Missing Flags |
|---------|---------------------|
| AllReduce stall high | fuse_all_reduce=true or SparseCore offload |
| AllGather stall high | fuse_all_gather=true or SparseCore offload |
| All comms not overlapped | async_collective_fusion=true base flag |
| MoE comms not optimized | DATA_PARALLEL_OVERLAP flags |
| Source | Best For | Notes |
|--------|----------|-------|
| xprof_memory() MCP | Quick peak/capacity/fragmentation | Use first |
| memory_viewer.json | Peak composition, optimization targets | Static upper bound |
| xplane.pb /host:CPU | Runtime actual HBM | actual = reserved - available |
| hlo_proto.pb | Buffer-level detail | Large files, slow |
Critical: bytes_allocated only tracks dynamic I/O buffers, NOT internal workspace (~70 GiB). Actual HBM = bytes_reserved - bytes_available.
maxHeap)| Category | Rule (tfOpName / groupName) | Typical Shape |
|----------|--------------------------------|---------------|
| Parameters | groupName=="Parameter" + state.params | Weight shapes |
| Optimizer | groupName=="Parameter" + opt_state | Same as weights, f32 |
| Logits/CE | logits_dense / cross_entropy / softmax | [B,T,vocab] |
| MoE experts | moe / expert / gmm / router | [num_experts,dim,dim] |
| Attention/MLA | attention / mla / wq_ / wkv_ | [B,T,H,D] |
| Optimization | Expected Savings | Impact | |-------------|-----------------|--------| | Chunked cross-entropy | ~18 GiB (3x vocab logits) | No accuracy impact | | save_out_proj remat | Large activation savings | ~33% recompute increase | | Increase FSDP | Linear per-chip param reduction | More communication | | bf16 optimizer (mu_dtype) | ~2 GiB (mu f32 -> bf16) | May affect stability |
Vocab-sized buffer detection: Search maxHeap for any dimension >= 100K → chunked CE targets.
From memory_viewer.json heapSizes timeline:
Forward: Y = gmm(X, W)
Backward:
dlhs = gmm(dY, W^T) # activation gradient → convolution_bitcast_fusion
drhs = tgmm(X^T, dY) # weight gradient → tgmm.*
Layer count: MoE_layers = GMM_count / 4 = TGMM_count / 2
XLA schedules backward in two loops: (1) all dlhs (serial), then (2) all tgmm (parallelizable).
When MCP is insufficient, use tpu-profiling/scripts/xprof.py:
xprof memory peak --run-dir <run> # Peak HBM composition
xprof memory diagnose --run-dir <run> # Anomaly detection
xprof compute breakdown --run-dir <run> # Op category breakdown
xprof compute mfu --run-dir <run> --gbs 5120 --num-chips 128
xprof comm overlap --run-dir <run> # Async-done stall analysis
xprof compare --runs 115,183 # A/B comparison
xprof audit flags --flags-file launch.sh # XLA flags check
Data source priority: xplane.pb (complete) > trace.json.gz (1M event hard limit — check Complete (X) events).
| Pitfall | Correct |
|---------|---------|
| XLA convolution = CNN | XLA uses convolution for all matmul |
| hlo_category distinguishes fwd/bwd | Cannot — use op name patterns or tf_op |
| Low MFU = poor utilization | MoE MFU naturally low; understand per-token FLOPs |
| bytes_allocated = actual HBM | Only dynamic I/O buffers; actual = reserved - available |
| memory_viewer peak = runtime peak | Static upper bound; runtime 5-15% lower |
| trace.json.gz is complete | 1M event hard limit — use xplane.pb if truncated |
| Comms not overlapped = need algorithm change | Check XLA flags first (CF/SparseCore) |
| Single-device interval overlap = comm overlap | Single device is serial; measure async-done duration |
| Logits size is fixed | Chunked CE: [B,T,vocab] → [B,chunk,vocab], saving ~18 GiB |
development
Use when analyzing TPU pretraining HBM occupancy from a profile directory — locates the static HBM peak (the same number TensorBoard's Memory Viewer shows), enumerates every buffer alive at the peak schedule moment with size / HLO instruction / opcode / op_name, and rolls the alive set up by opcode and op_name. Reads compile-time `*.hlo_proto.pb` (BufferAssignmentProto) as the primary source; runtime `*.xplane.pb` allocator events are a secondary, often-truncated signal.
testing
Use when analyzing TPU pretraining compute efficiency from xplane.pb — produces source-line-aggregated HLO duration tables, layer-scoped breakdowns, non-compute (padding/cast/copy) audits, and v7x roofline shortfall vs theoretical peak. Reads schema documented by profile-anatomy.
tools
--- name: comm-analysis description: Use when analyzing communication on a TPU pretraining profile — extracts every comm primitive (async + sync, TC + SparseCore), attributes axes via HLO replica_groups, computes per-row NCCL bus BW vs per-axis peak ICI BW (peak_link × k_torus_dims × directions_per_dim; TPUv7x: 200 GB/s bidir per link on a 3D torus; util% requires `--mesh-spec` with topology), and reports per-step compute/comm overlap. Builds on profile-anatomy. --- # Communication Analysis **
documentation
Use when reading TPU pretraining profiles (xplane.pb, trace.json.gz) — describes the on-disk layout, the XSpace/XPlane/XLine/XEvent/XStat hierarchy, and provides reference scripts that future tpu-perf skills can read as schema documentation.