plugins/tpu-perf-model/skills/tpu-perf-model/SKILL.md
Use when analyzing theoretical TPU v7x performance for a mathematical formula or comparing kernel performance against theoretical bounds. Trigger when the user asks about TPU performance modeling, roofline analysis, data flow optimization, or tiling strategy.
npx skillsauth add primatrix/skills tpu-perf-modelInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Theoretical performance modeling tool for TPU v7x centered on the Register ↔ VMEM ↔ HBM data flow hierarchy.
| Component | Spec | |-----------|------| | HBM | 192 GB, 3690 GB/s | | VMEM | 64 MiB on-chip scratchpad | | SPR | 4096 scalar registers (32bit) | | VPR | 32 vector registers (8×128×32bit = 4KB each) | | MXU | 2307 TFLOPS BF16 (dual MXU) | | VPU | Vector processing unit (elementwise/reduce) | | Ridge Point | ~625 FLOPs/byte | | Alignment | Block dims must be divisible by 128 |
Given the user's math formula, decompose it into a list of ComputeStep objects. This layer defines the mathematical pipeline, the FLOPs model, and which steps can be fused before any fragment-level scheduling begins.
| Op Type | FLOPs Formula | Compute Unit | Notes | |---------|---------------|--------------|-------| | matmul [M,K]×[K,N] | 2×M×N×K | MXU | Includes multiply + accumulate | | elementwise (unary) | N (elements) | VPU | exp, log, scale, sqrt | | elementwise (binary) | N (elements) | VPU | add, mul, sub, div | | reduce (sum/max/min) | N (elements) | VPU | Along one dimension | | softmax [M,N] | 5×M×N | VPU | max + sub + exp + sum + div | | layer_norm [M,N] | 7×M×N | VPU | mean + var + sub + div + scale + shift |
Determine fusable_with_prev for each step. Fusion keeps intermediate tensors in VMEM/VPR instead of writing back to HBM.
| Pattern | Fusable? | VPR Pressure | |---------|----------|-------------| | matmul → elementwise | YES | Low | | matmul → reduce | MAYBE | Medium (accumulator VPRs) | | elementwise → elementwise | YES | Low | | elementwise → reduce | YES | Low | | matmul → matmul | NO | Very high | | reduce → elementwise | YES | Low |
Write a JSON file with array of steps:
[
{
"name": "descriptive_name",
"op_type": "matmul|elementwise|reduce|softmax",
"inputs": [{"name": "A", "shape": [M, K], "dtype": "bf16"}],
"outputs": [{"name": "C", "shape": [M, N], "dtype": "bf16"}],
"flops_formula": "2*M*N*K",
"flops_vars": {"M": 4096, "N": 4096, "K": 128},
"compute_unit": "MXU|VPU",
"fusable_with_prev": false
}
]
Save to a temporary file, e.g., steps.json.
After the ComputeStep list is defined, refine the analysis into fragment-level dataflow. Your job is to explain the schedule as a resource-constrained mathematical model, not as a vague optimization intuition.
At this layer, you must explicitly reason about:
HBM, VMEM, and REGDMA, MXU, and VPU micro-ops consume each fragmentWhen describing the optimal schedule, use the VMEM and register constraints directly:
sum(vmem_live_bytes(t)) <= VMEM_CAPACITYsum(reg_groups_live(t)) <= REG_GROUP_CAPACITYstart(B) >= end(A) for each dependency edge A -> Bmakespan = max(end(op_i))The point of this layer is to answer:
t, which data is in registers?# Basic analysis
python scripts/cli.py --steps steps.json
# JSON output
python scripts/cli.py --steps steps.json --format json
# Micro-op analysis
python scripts/cli.py --steps steps.json --analysis-level micro
# Micro-op JSON output
python scripts/cli.py --steps steps.json --format json \
--analysis-level \
micro
# Micro-op analysis with timeline details
python scripts/cli.py --steps steps.json --show-timeline \
--analysis-level \
micro
# Micro-op analysis with pipeline diagram
python scripts/cli.py --steps steps.json --analysis-level micro --mermaid
# Pipeline diagram showing first 5 tiles
python scripts/cli.py --steps steps.json --analysis-level micro --mermaid --max-tiles 5
# With detailed tiling analysis
python scripts/cli.py --steps steps.json --tiling
# Compare with measured profile data
python scripts/cli.py --steps steps.json --eval eval_result.json
The scripts/ directory is at: plugins/tpu-perf-model/skills/tpu-perf-model/scripts/
For each step, the report shows:
HBM_BW (memory-bound) or COMPUTE (compute-bound)| Observation | Action | |-------------|--------| | Bottleneck = HBM_BW | Increase tile size, enable fusion, reduce data movement | | Bottleneck = COMPUTE | Current tiling is good, focus on MXU utilization | | Pipeline balance ratio ≫ 1 | DMA dominates — increase compute per tile | | Pipeline balance ratio ≪ 1 | Compute dominates — tiles are large enough | | Fusion savings > 0 | Verify fusion is implemented in actual kernel | | Low efficiency vs peak | Multiple optimization opportunities exist |
| Observation | Action | |-------------|--------| | Peak VPR > 24/32 (75%) | Approaching register limit, consider smaller tiles | | Peak VPR = 32/32 | At register limit, no room for fusion | | Spill count > 0 | Register spills detected — reduce tile size or unfuse ops | | VPR per tile > 16 | Large tiles — verify MXU utilization justifies the VPR cost |
When using micro-op mode, interpret the report as a fragment-level execution plan:
VMEM slots, register groups, and units are active over timeWAIT_DATA, WAIT_UNIT, WAIT_VMEM, or WAIT_REGUse this mode when the user asks for finer-grained dataflow, explicit dependency reasoning, or a proof-like explanation of why one schedule is faster than another.
When you answer with the micro-op model, use these sections in order:
Do not collapse this into a generic summary. The point is to make the dataflow explicit enough that the user can see which fragments, units, and constraints control performance.
When writing analysis conclusions, use Chinese for all narrative text:
When using micro-op analysis, ALWAYS include the Mermaid pipeline diagrams by adding --mermaid to the CLI command:
python scripts/cli.py --steps steps.json --analysis-level micro --mermaid
Include the generated Mermaid blocks in your output. The --mermaid flag produces two complementary diagrams:
Shows VMEM slot and REG group occupancy over time. Each row is a storage resource, not an execution unit.
slot_name [op_label]reg_name [op_label]crit bars between intervals on the same resource, labeled with wait reason (WAIT_DATA, WAIT_UNIT, WAIT_VMEM, WAIT_REG)Shows first 3 tiles by default. Use --max-tiles N to adjust.
One flowchart per tile showing data movement through the memory hierarchy:
HBM: tensor[shape] dtype, VMEM slot: tensor[shape], REG group: tensor[shape]-->): Data transfers labeled with op_kind latency-. .-->): Stall/wait relationships labeled with reasonUse the flowchart to trace how data flows from HBM through VMEM into registers, through compute, and back.
After generating the Mermaid pipeline diagram, render it to a PNG and open it for visual inspection:
```mermaid and ``` fences) and save to a .mmd filemmdc and open:# Save Mermaid source (without fences) to file
cat > pipeline.mmd << 'EOF'
<paste mermaid content here>
EOF
# Render to PNG and open
npx -y @mermaid-js/mermaid-cli mmdc -i pipeline.mmd -o pipeline.png && open pipeline.png
The timeline x-axis displays values in nanoseconds (axisFormat %Q renders the raw numeric timestamps). The title suffix (ns) confirms the unit.
When comparing against eval_result.json from pallas-evolve profiling:
| Gap | Diagnosis | |-----|-----------| | HBM bytes: measured > theoretical | Missing fusion or redundant loads | | MXU util: measured < theoretical | Tile too small, alignment issues | | Vector spills > 0 | Register pressure — reduce fusion or tile size | | Total time: measured ≫ theoretical | Significant optimization headroom |
Formula: Y = softmax(QK^T / sqrt(d)) @ V, shapes Q/K/V = [4096, 128]
See scripts/examples/flash_attention.json for the decomposed steps.
development
Use when analyzing TPU pretraining HBM occupancy from a profile directory — locates the static HBM peak (the same number TensorBoard's Memory Viewer shows), enumerates every buffer alive at the peak schedule moment with size / HLO instruction / opcode / op_name, and rolls the alive set up by opcode and op_name. Reads compile-time `*.hlo_proto.pb` (BufferAssignmentProto) as the primary source; runtime `*.xplane.pb` allocator events are a secondary, often-truncated signal.
testing
Use when analyzing TPU pretraining compute efficiency from xplane.pb — produces source-line-aggregated HLO duration tables, layer-scoped breakdowns, non-compute (padding/cast/copy) audits, and v7x roofline shortfall vs theoretical peak. Reads schema documented by profile-anatomy.
tools
--- name: comm-analysis description: Use when analyzing communication on a TPU pretraining profile — extracts every comm primitive (async + sync, TC + SparseCore), attributes axes via HLO replica_groups, computes per-row NCCL bus BW vs per-axis peak ICI BW (peak_link × k_torus_dims × directions_per_dim; TPUv7x: 200 GB/s bidir per link on a 3D torus; util% requires `--mesh-spec` with topology), and reports per-step compute/comm overlap. Builds on profile-anatomy. --- # Communication Analysis **
documentation
Use when reading TPU pretraining profiles (xplane.pb, trace.json.gz) — describes the on-disk layout, the XSpace/XPlane/XLine/XEvent/XStat hierarchy, and provides reference scripts that future tpu-perf skills can read as schema documentation.