plugins/tpu-perf-model/skills/tpu-pipeline-scheduler/SKILL.md
Use when analyzing register-level pipeline scheduling for TPU v7x kernels. Trigger when the user asks about instruction-level pipeline analysis, VPR register pressure, data hazard detection (RAW/WAR/WAW), or optimal instruction ordering for TPU pipelines.
npx skillsauth add primatrix/skills tpu-pipeline-schedulerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Analyze register-level pipeline scheduling for TPU v7x kernels. Given an explicit sequence of instructions with VPR assignments, this skill detects data hazards, schedules across hardware units, analyzes VPR pressure, and suggests optimal ordering.
The input is a JSON file describing a sequence of hardware instructions with explicit VPR assignments:
{
"name": "kernel_tile_name",
"hw": "v7x",
"ops": [
{
"op_id": "unique_name",
"op_kind": "DMA_LOAD | DMA_STORE | MXU | VPU | VMEM_TO_REG | REG_TO_VMEM",
"input_vprs": [0, 1, 2, 3],
"output_vprs": [4, 5, 6, 7],
"weight_vprs": [0, 1],
"data_vprs": [2, 3],
"input_vmem": ["slot_name"],
"output_vmem": ["slot_name"],
"latency_ns": 500,
"unit": "DMA | MXU | VPU",
"label": "Human-readable description",
"pseudocode": "S = Q @ K.T"
}
]
}
| Field | Description |
|-------|-------------|
| op_id | Unique instruction identifier |
| op_kind | Instruction type (DMA_LOAD, DMA_STORE, MXU, VPU, VMEM_TO_REG, REG_TO_VMEM) |
| input_vprs | VPR numbers read (0-31). For MXU ops, auto-computed from weight_vprs + data_vprs if omitted |
| output_vprs | VPR numbers written (0-31) |
| weight_vprs | (MXU only) VPRs loaded during MXU weight phase |
| data_vprs | (MXU only) VPRs used during MXU data/compute phase |
| input_vmem | VMEM slot names read |
| output_vmem | VMEM slot names written |
| latency_ns | Instruction latency in nanoseconds |
| unit | Execution unit (DMA, MXU, VPU) |
| label | Optional human-readable description |
| pseudocode | Optional short pseudocode (shown in animation panel) |
Each MXU op is split into two phases on separate sub-units:
This enables pipelining: the next MXU op's weight phase can overlap with the current op's data phase, as long as the weight VPRs are ready. Specify weight_vprs and data_vprs separately to enable this optimization.
# All analyses (text)
python scripts/pipeline_ir_cli.py --pipeline kernel.json --format text --show all
# Dependency graph only (JSON)
python scripts/pipeline_ir_cli.py --pipeline kernel.json --format json --show deps
# Gantt + Mermaid diagrams
python scripts/pipeline_ir_cli.py --pipeline kernel.json --format text --show deps,gantt --mermaid
# VPR pressure only
python scripts/pipeline_ir_cli.py --pipeline kernel.json --format text --show vpr
# Reorder suggestion
python scripts/pipeline_ir_cli.py --pipeline kernel.json --format text --show suggest
# VPR timeline plot (PNG)
python scripts/pipeline_ir_cli.py --pipeline kernel.json --plot
# Custom output path
python scripts/pipeline_ir_cli.py --pipeline kernel.json --plot --plot-output my_chart.png
# Interactive HTML animation
python scripts/pipeline_ir_cli.py --pipeline kernel.json --animate
# Animation with custom output path
python scripts/pipeline_ir_cli.py --pipeline kernel.json --animate --animate-output my_animation.html
| Flag | Values | Description |
|------|--------|-------------|
| --pipeline | path | Pipeline IR JSON file (required) |
| --format | text, json | Output format (default: text) |
| --show | deps, gantt, vpr, suggest, all | Sections to show (comma-separated, default: all) |
| --mermaid | flag | Include Mermaid diagrams (text format only) |
| --plot | flag | Generate VPR timeline heatmap as PNG image |
| --plot-output | path | Output path for plot (default: <name>_vpr_timeline.png) |
| --animate | flag | Generate interactive HTML animation |
| --animate-output | path | Output path for animation (default: <name>_pipeline.html) |
Detects three types of data hazards:
| Hazard | Condition | Impact | |--------|-----------|--------| | RAW (Read-After-Write) | Op B reads VPR[n] that Op A writes | True dependency — B must wait for A | | WAR (Write-After-Read) | Op B writes VPR[n] that Op A reads | Anti-dependency — B can't overwrite before A reads | | WAW (Write-After-Write) | Op B writes VPR[n] that Op A writes | Output dependency — ordering must be preserved |
Same analysis applies to VMEM slots. Transitive reduction is applied to keep the DAG minimal.
Mermaid output uses: solid arrows for RAW, dashed for WAR, dotted for WAW.
Shows each hardware unit's timeline with instruction placement and stall markers. Each instruction reports:
ASCII grid showing which VPRs are live at each time step. Reports:
Compares original instruction ordering against analysis:
Matplotlib-rendered 2D heatmap with:
Requires matplotlib (pip install matplotlib).
Self-contained HTML file with animated pipeline playback:
Narrative text in Chinese, technical terms (VPR, RAW, WAR, WAW, DMA, MXU, VPU, VMEM, HBM) in English.
See scripts/examples/flash_attention_tile.json for a complete Flash Attention tile decomposition with 11 instructions across DMA/MXU/VPU units using VPR[0:23].
development
Use when analyzing TPU pretraining HBM occupancy from a profile directory — locates the static HBM peak (the same number TensorBoard's Memory Viewer shows), enumerates every buffer alive at the peak schedule moment with size / HLO instruction / opcode / op_name, and rolls the alive set up by opcode and op_name. Reads compile-time `*.hlo_proto.pb` (BufferAssignmentProto) as the primary source; runtime `*.xplane.pb` allocator events are a secondary, often-truncated signal.
testing
Use when analyzing TPU pretraining compute efficiency from xplane.pb — produces source-line-aggregated HLO duration tables, layer-scoped breakdowns, non-compute (padding/cast/copy) audits, and v7x roofline shortfall vs theoretical peak. Reads schema documented by profile-anatomy.
tools
--- name: comm-analysis description: Use when analyzing communication on a TPU pretraining profile — extracts every comm primitive (async + sync, TC + SparseCore), attributes axes via HLO replica_groups, computes per-row NCCL bus BW vs per-axis peak ICI BW (peak_link × k_torus_dims × directions_per_dim; TPUv7x: 200 GB/s bidir per link on a 3D torus; util% requires `--mesh-spec` with topology), and reports per-step compute/comm overlap. Builds on profile-anatomy. --- # Communication Analysis **
documentation
Use when reading TPU pretraining profiles (xplane.pb, trace.json.gz) — describes the on-disk layout, the XSpace/XPlane/XLine/XEvent/XStat hierarchy, and provides reference scripts that future tpu-perf skills can read as schema documentation.