TPU Performance Model

Theoretical performance modeling tool for TPU v7x centered on the Register ↔ VMEM ↔ HBM data flow hierarchy.

When to Use

Before writing a Pallas kernel: predict theoretical performance, identify bottleneck, guide tiling
After profiling a kernel: compare theoretical vs measured to find optimization opportunities

TPU v7x Hardware Quick Reference

| Component | Spec | |-----------|------| | HBM | 192 GB, 3690 GB/s | | VMEM | 64 MiB on-chip scratchpad | | SPR | 4096 scalar registers (32bit) | | VPR | 32 vector registers (8×128×32bit = 4KB each) | | MXU | 2307 TFLOPS BF16 (dual MXU) | | VPU | Vector processing unit (elementwise/reduce) | | Ridge Point | ~625 FLOPs/byte | | Alignment | Block dims must be divisible by 128 |

Layer A: Formula -> ComputeStep

Given the user's math formula, decompose it into a list of ComputeStep objects. This layer defines the mathematical pipeline, the FLOPs model, and which steps can be fused before any fragment-level scheduling begins.

FLOPs Reference Table

| Op Type | FLOPs Formula | Compute Unit | Notes | |---------|---------------|--------------|-------| | matmul [M,K]×[K,N] | 2×M×N×K | MXU | Includes multiply + accumulate | | elementwise (unary) | N (elements) | VPU | exp, log, scale, sqrt | | elementwise (binary) | N (elements) | VPU | add, mul, sub, div | | reduce (sum/max/min) | N (elements) | VPU | Along one dimension | | softmax [M,N] | 5×M×N | VPU | max + sub + exp + sum + div | | layer_norm [M,N] | 7×M×N | VPU | mean + var + sub + div + scale + shift |

Fusion Rules

Determine fusable_with_prev for each step. Fusion keeps intermediate tensors in VMEM/VPR instead of writing back to HBM.

| Pattern | Fusable? | VPR Pressure | |---------|----------|-------------| | matmul → elementwise | YES | Low | | matmul → reduce | MAYBE | Medium (accumulator VPRs) | | elementwise → elementwise | YES | Low | | elementwise → reduce | YES | Low | | matmul → matmul | NO | Very high | | reduce → elementwise | YES | Low |

ComputeStep JSON Format

Write a JSON file with array of steps:

[
  {
    "name": "descriptive_name",
    "op_type": "matmul|elementwise|reduce|softmax",
    "inputs": [{"name": "A", "shape": [M, K], "dtype": "bf16"}],
    "outputs": [{"name": "C", "shape": [M, N], "dtype": "bf16"}],
    "flops_formula": "2*M*N*K",
    "flops_vars": {"M": 4096, "N": 4096, "K": 128},
    "compute_unit": "MXU|VPU",
    "fusable_with_prev": false
  }
]

Save to a temporary file, e.g., steps.json.

Layer B: ComputeStep -> TensorFragment -> MicroOp -> Schedule

After the ComputeStep list is defined, refine the analysis into fragment-level dataflow. Your job is to explain the schedule as a resource-constrained mathematical model, not as a vague optimization intuition.

At this layer, you must explicitly reason about:

Which tensor fragments or tiles exist at each stage
Which fragments live in HBM, VMEM, and REG
Which DMA, MXU, and VPU micro-ops consume each fragment
Which dependencies block later micro-ops from issuing
Which fragments are retained, evicted, or reloaded under VMEM pressure
Why the reported critical path determines total latency

When describing the optimal schedule, use the VMEM and register constraints directly:

sum(vmem_live_bytes(t)) <= VMEM_CAPACITY
sum(reg_groups_live(t)) <= REG_GROUP_CAPACITY
start(B) >= end(A) for each dependency edge A -> B
makespan = max(end(op_i))

The point of this layer is to answer:

At time t, which data is in registers?
Which buffer slots are occupied?
Which compute unit is active or stalled?
Why is this schedule optimal or near-optimal under the current VMEM limit?

Run Simulation

# Basic analysis
python scripts/cli.py --steps steps.json

# JSON output
python scripts/cli.py --steps steps.json --format json

# Micro-op analysis
python scripts/cli.py --steps steps.json --analysis-level micro

# Micro-op JSON output
python scripts/cli.py --steps steps.json --format json \
  --analysis-level \
  micro

# Micro-op analysis with timeline details
python scripts/cli.py --steps steps.json --show-timeline \
  --analysis-level \
  micro

# Micro-op analysis with pipeline diagram
python scripts/cli.py --steps steps.json --analysis-level micro --mermaid

# Pipeline diagram showing first 5 tiles
python scripts/cli.py --steps steps.json --analysis-level micro --mermaid --max-tiles 5

# With detailed tiling analysis
python scripts/cli.py --steps steps.json --tiling

# Compare with measured profile data
python scripts/cli.py --steps steps.json --eval eval_result.json

The scripts/ directory is at: plugins/tpu-perf-model/skills/tpu-perf-model/scripts/

Interpret Results

Per-Step Analysis

For each step, the report shows:

T(HBM): Time to transfer data between HBM and VMEM
T(compute): Time for MXU/VPU computation
T(step): Effective time with double-buffer pipeline overlap
Bottleneck: HBM_BW (memory-bound) or COMPUTE (compute-bound)
Arithmetic Intensity: FLOPs/byte — compare against ridge point (~625)

Key Decisions from Results

| Observation | Action | |-------------|--------| | Bottleneck = HBM_BW | Increase tile size, enable fusion, reduce data movement | | Bottleneck = COMPUTE | Current tiling is good, focus on MXU utilization | | Pipeline balance ratio ≫ 1 | DMA dominates — increase compute per tile | | Pipeline balance ratio ≪ 1 | Compute dominates — tiles are large enough | | Fusion savings > 0 | Verify fusion is implemented in actual kernel | | Low efficiency vs peak | Multiple optimization opportunities exist |

VPR Pressure Analysis

| Observation | Action | |-------------|--------| | Peak VPR > 24/32 (75%) | Approaching register limit, consider smaller tiles | | Peak VPR = 32/32 | At register limit, no room for fusion | | Spill count > 0 | Register spills detected — reduce tile size or unfuse ops | | VPR per tile > 16 | Large tiles — verify MXU utilization justifies the VPR cost |

Micro-Op Analysis

When using micro-op mode, interpret the report as a fragment-level execution plan:

Timeline: the ordered micro-op schedule with start and end times
Residency and Occupancy: which VMEM slots, register groups, and units are active over time
Critical Path: the dependency chain that determines total makespan
Stall Breakdown: whether time is lost to WAIT_DATA, WAIT_UNIT, WAIT_VMEM, or WAIT_REG
Optimization Hints: which resource or dependency bottleneck should be attacked first

Use this mode when the user asks for finer-grained dataflow, explicit dependency reasoning, or a proof-like explanation of why one schedule is faster than another.

Required Output Sections

When you answer with the micro-op model, use these sections in order:

Fragment Inventory
Micro-Op Expansion
Residency Timeline
Dependency Graph
Critical Path
VPR Register Map — shows per-VPR allocation timeline across tiles
Optimality Argument Under VMEM Constraint

Do not collapse this into a generic summary. The point is to make the dataflow explicit enough that the user can see which fragments, units, and constraints control performance.

Output Language

When writing analysis conclusions, use Chinese for all narrative text:

Section headers, bottleneck diagnoses, optimization recommendations, and summary conclusions: 用中文
Technical terms (HBM, VMEM, MXU, VPU, DMA, FLOPS, roofline) keep English spelling
Numeric data, formulas, units (ns, us, ms, GB/s, TFLOPS), and code blocks remain unchanged

Pipeline Diagram

When using micro-op analysis, ALWAYS include the Mermaid pipeline diagrams by adding --mermaid to the CLI command:

python scripts/cli.py --steps steps.json --analysis-level micro --mermaid

Include the generated Mermaid blocks in your output. The --mermaid flag produces two complementary diagrams:

Resource Occupancy Gantt

Shows VMEM slot and REG group occupancy over time. Each row is a storage resource, not an execution unit.

section VMEM Slots: One bar per VMEM slot occupancy interval, labeled slot_name [op_label]
section REG Groups: One bar per REG group occupancy interval, labeled reg_name [op_label]
Stall bars: Red crit bars between intervals on the same resource, labeled with wait reason (WAIT_DATA, WAIT_UNIT, WAIT_VMEM, WAIT_REG)
Capacity comments: Peak VMEM slots and REG groups with percentage of hardware limit

Shows first 3 tiles by default. Use --max-tiles N to adjust.

Register Data Flow Flowchart

One flowchart per tile showing data movement through the memory hierarchy:

Nodes: Data fragments at each memory level — HBM: tensor[shape] dtype, VMEM slot: tensor[shape], REG group: tensor[shape]
Solid edges (-->): Data transfers labeled with op_kind latency
Dashed edges (-. .-->): Stall/wait relationships labeled with reason

Use the flowchart to trace how data flows from HBM through VMEM into registers, through compute, and back.

Render Mermaid to Image

After generating the Mermaid pipeline diagram, render it to a PNG and open it for visual inspection:

Extract the Mermaid source (content between ```mermaid and ``` fences) and save to a .mmd file
Render with mmdc and open:

# Save Mermaid source (without fences) to file
cat > pipeline.mmd << 'EOF'
<paste mermaid content here>
EOF

# Render to PNG and open
npx -y @mermaid-js/mermaid-cli mmdc -i pipeline.mmd -o pipeline.png && open pipeline.png

The timeline x-axis displays values in nanoseconds (axisFormat %Q renders the raw numeric timestamps). The title suffix (ns) confirms the unit.

Gap Analysis (Optional)

When comparing against eval_result.json from pallas-evolve profiling:

| Gap | Diagnosis | |-----|-----------| | HBM bytes: measured > theoretical | Missing fusion or redundant loads | | MXU util: measured < theoretical | Tile too small, alignment issues | | Vector spills > 0 | Register pressure — reduce fusion or tile size | | Total time: measured ≫ theoretical | Significant optimization headroom |

Example: Flash Attention

Formula: Y = softmax(QK^T / sqrt(d)) @ V, shapes Q/K/V = [4096, 128]

See scripts/examples/flash_attention.json for the decomposed steps.

TPU Performance Model

Theoretical performance modeling tool for TPU v7x centered on the Register ↔ VMEM ↔ HBM data flow hierarchy.

When to Use

Before writing a Pallas kernel: predict theoretical performance, identify bottleneck, guide tiling
After profiling a kernel: compare theoretical vs measured to find optimization opportunities

TPU v7x Hardware Quick Reference

Layer A: Formula -> ComputeStep

FLOPs Reference Table

Fusion Rules

Determine fusable_with_prev for each step. Fusion keeps intermediate tensors in VMEM/VPR instead of writing back to HBM.

ComputeStep JSON Format

Write a JSON file with array of steps:

[
  {
    "name": "descriptive_name",
    "op_type": "matmul|elementwise|reduce|softmax",
    "inputs": [{"name": "A", "shape": [M, K], "dtype": "bf16"}],
    "outputs": [{"name": "C", "shape": [M, N], "dtype": "bf16"}],
    "flops_formula": "2*M*N*K",
    "flops_vars": {"M": 4096, "N": 4096, "K": 128},
    "compute_unit": "MXU|VPU",
    "fusable_with_prev": false
  }
]

Save to a temporary file, e.g., steps.json.

Layer B: ComputeStep -> TensorFragment -> MicroOp -> Schedule

At this layer, you must explicitly reason about:

Which tensor fragments or tiles exist at each stage
Which fragments live in HBM, VMEM, and REG
Which DMA, MXU, and VPU micro-ops consume each fragment
Which dependencies block later micro-ops from issuing
Which fragments are retained, evicted, or reloaded under VMEM pressure
Why the reported critical path determines total latency

When describing the optimal schedule, use the VMEM and register constraints directly:

sum(vmem_live_bytes(t)) <= VMEM_CAPACITY
sum(reg_groups_live(t)) <= REG_GROUP_CAPACITY
start(B) >= end(A) for each dependency edge A -> B
makespan = max(end(op_i))

The point of this layer is to answer:

At time t, which data is in registers?
Which buffer slots are occupied?
Which compute unit is active or stalled?
Why is this schedule optimal or near-optimal under the current VMEM limit?

Run Simulation

# Basic analysis
python scripts/cli.py --steps steps.json

# JSON output
python scripts/cli.py --steps steps.json --format json

# Micro-op analysis
python scripts/cli.py --steps steps.json --analysis-level micro

# Micro-op JSON output
python scripts/cli.py --steps steps.json --format json \
  --analysis-level \
  micro

# Micro-op analysis with timeline details
python scripts/cli.py --steps steps.json --show-timeline \
  --analysis-level \
  micro

# Micro-op analysis with pipeline diagram
python scripts/cli.py --steps steps.json --analysis-level micro --mermaid

# Pipeline diagram showing first 5 tiles
python scripts/cli.py --steps steps.json --analysis-level micro --mermaid --max-tiles 5

# With detailed tiling analysis
python scripts/cli.py --steps steps.json --tiling

# Compare with measured profile data
python scripts/cli.py --steps steps.json --eval eval_result.json

The scripts/ directory is at: plugins/tpu-perf-model/skills/tpu-perf-model/scripts/

Interpret Results

Per-Step Analysis

For each step, the report shows:

T(HBM): Time to transfer data between HBM and VMEM
T(compute): Time for MXU/VPU computation
T(step): Effective time with double-buffer pipeline overlap
Bottleneck: HBM_BW (memory-bound) or COMPUTE (compute-bound)
Arithmetic Intensity: FLOPs/byte — compare against ridge point (~625)

Key Decisions from Results

VPR Pressure Analysis

Micro-Op Analysis

When using micro-op mode, interpret the report as a fragment-level execution plan:

Timeline: the ordered micro-op schedule with start and end times
Residency and Occupancy: which VMEM slots, register groups, and units are active over time
Critical Path: the dependency chain that determines total makespan
Stall Breakdown: whether time is lost to WAIT_DATA, WAIT_UNIT, WAIT_VMEM, or WAIT_REG
Optimization Hints: which resource or dependency bottleneck should be attacked first

Use this mode when the user asks for finer-grained dataflow, explicit dependency reasoning, or a proof-like explanation of why one schedule is faster than another.

Required Output Sections

When you answer with the micro-op model, use these sections in order:

Fragment Inventory
Micro-Op Expansion
Residency Timeline
Dependency Graph
Critical Path
VPR Register Map — shows per-VPR allocation timeline across tiles
Optimality Argument Under VMEM Constraint

Do not collapse this into a generic summary. The point is to make the dataflow explicit enough that the user can see which fragments, units, and constraints control performance.

Output Language

When writing analysis conclusions, use Chinese for all narrative text:

Section headers, bottleneck diagnoses, optimization recommendations, and summary conclusions: 用中文
Technical terms (HBM, VMEM, MXU, VPU, DMA, FLOPS, roofline) keep English spelling
Numeric data, formulas, units (ns, us, ms, GB/s, TFLOPS), and code blocks remain unchanged

Pipeline Diagram

When using micro-op analysis, ALWAYS include the Mermaid pipeline diagrams by adding --mermaid to the CLI command:

python scripts/cli.py --steps steps.json --analysis-level micro --mermaid

Include the generated Mermaid blocks in your output. The --mermaid flag produces two complementary diagrams:

Resource Occupancy Gantt

Shows VMEM slot and REG group occupancy over time. Each row is a storage resource, not an execution unit.

section VMEM Slots: One bar per VMEM slot occupancy interval, labeled slot_name [op_label]
section REG Groups: One bar per REG group occupancy interval, labeled reg_name [op_label]
Stall bars: Red crit bars between intervals on the same resource, labeled with wait reason (WAIT_DATA, WAIT_UNIT, WAIT_VMEM, WAIT_REG)
Capacity comments: Peak VMEM slots and REG groups with percentage of hardware limit

Shows first 3 tiles by default. Use --max-tiles N to adjust.

Register Data Flow Flowchart

One flowchart per tile showing data movement through the memory hierarchy:

Nodes: Data fragments at each memory level — HBM: tensor[shape] dtype, VMEM slot: tensor[shape], REG group: tensor[shape]
Solid edges (-->): Data transfers labeled with op_kind latency
Dashed edges (-. .-->): Stall/wait relationships labeled with reason

Use the flowchart to trace how data flows from HBM through VMEM into registers, through compute, and back.

Render Mermaid to Image

After generating the Mermaid pipeline diagram, render it to a PNG and open it for visual inspection:

Extract the Mermaid source (content between ```mermaid and ``` fences) and save to a .mmd file
Render with mmdc and open:

# Save Mermaid source (without fences) to file
cat > pipeline.mmd << 'EOF'
<paste mermaid content here>
EOF

# Render to PNG and open
npx -y @mermaid-js/mermaid-cli mmdc -i pipeline.mmd -o pipeline.png && open pipeline.png

The timeline x-axis displays values in nanoseconds (axisFormat %Q renders the raw numeric timestamps). The title suffix (ns) confirms the unit.

Gap Analysis (Optional)

When comparing against eval_result.json from pallas-evolve profiling:

Example: Flash Attention

Formula: Y = softmax(QK^T / sqrt(d)) @ V, shapes Q/K/V = [4096, 128]

See scripts/examples/flash_attention.json for the decomposed steps.

Adoption

primatrix/tpu-perf-model

$ install --global

Security Scan Results

SKILL.md

TPU Performance Model

When to Use

TPU v7x Hardware Quick Reference

Layer A: Formula -> ComputeStep

FLOPs Reference Table

Fusion Rules

ComputeStep JSON Format

Layer B: ComputeStep -> TensorFragment -> MicroOp -> Schedule

Run Simulation

Interpret Results

Per-Step Analysis

Key Decisions from Results

VPR Pressure Analysis

Micro-Op Analysis

Required Output Sections

Output Language

Pipeline Diagram

Resource Occupancy Gantt

Register Data Flow Flowchart

Render Mermaid to Image

Gap Analysis (Optional)

Example: Flash Attention

Related Skills

primatrix/memory-profile

primatrix/compute-breakdown

primatrix/plugins/tpu-perf/skills/comm-analysis

primatrix/profile-anatomy

primatrix/tpu-perf-model

$ install --global

Security Scan Results

SKILL.md

TPU Performance Model

When to Use

TPU v7x Hardware Quick Reference

Layer A: Formula -> ComputeStep

FLOPs Reference Table

Fusion Rules

ComputeStep JSON Format

Layer B: ComputeStep -> TensorFragment -> MicroOp -> Schedule

Run Simulation

Interpret Results

Per-Step Analysis

Key Decisions from Results

VPR Pressure Analysis

Micro-Op Analysis

Required Output Sections

Output Language

Pipeline Diagram

Resource Occupancy Gantt

Register Data Flow Flowchart

Render Mermaid to Image

Gap Analysis (Optional)

Example: Flash Attention

Related Skills

primatrix/memory-profile

primatrix/compute-breakdown

primatrix/plugins/tpu-perf/skills/comm-analysis

primatrix/profile-anatomy