Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

primatrix/tpu-pipeline-scheduler

Name: tpu-pipeline-scheduler
Author: primatrix

plugins/tpu-perf-model/skills/tpu-pipeline-scheduler/SKILL.md

npx skillsauth add primatrix/skills tpu-pipeline-scheduler

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

TPU Pipeline Scheduler

Analyze register-level pipeline scheduling for TPU v7x kernels. Given an explicit sequence of instructions with VPR assignments, this skill detects data hazards, schedules across hardware units, analyzes VPR pressure, and suggests optimal ordering.

When to Use

Designing optimal instruction interleaving for a Pallas kernel tile
Analyzing VPR register pressure to determine if a tiling strategy is feasible
Identifying data dependency bottlenecks (RAW/WAR/WAW hazards)
Comparing alternative instruction orderings for pipeline efficiency

Input Format: Pipeline IR

The input is a JSON file describing a sequence of hardware instructions with explicit VPR assignments:

{
  "name": "kernel_tile_name",
  "hw": "v7x",
  "ops": [
    {
      "op_id": "unique_name",
      "op_kind": "DMA_LOAD | DMA_STORE | MXU | VPU | VMEM_TO_REG | REG_TO_VMEM",
      "input_vprs": [0, 1, 2, 3],
      "output_vprs": [4, 5, 6, 7],
      "weight_vprs": [0, 1],
      "data_vprs": [2, 3],
      "input_vmem": ["slot_name"],
      "output_vmem": ["slot_name"],
      "latency_ns": 500,
      "unit": "DMA | MXU | VPU",
      "label": "Human-readable description",
      "pseudocode": "S = Q @ K.T"
    }
  ]
}

Fields

| Field | Description | |-------|-------------| | op_id | Unique instruction identifier | | op_kind | Instruction type (DMA_LOAD, DMA_STORE, MXU, VPU, VMEM_TO_REG, REG_TO_VMEM) | | input_vprs | VPR numbers read (0-31). For MXU ops, auto-computed from weight_vprs + data_vprs if omitted | | output_vprs | VPR numbers written (0-31) | | weight_vprs | (MXU only) VPRs loaded during MXU weight phase | | data_vprs | (MXU only) VPRs used during MXU data/compute phase | | input_vmem | VMEM slot names read | | output_vmem | VMEM slot names written | | latency_ns | Instruction latency in nanoseconds | | unit | Execution unit (DMA, MXU, VPU) | | label | Optional human-readable description | | pseudocode | Optional short pseudocode (shown in animation panel) |

TPU v7x Hardware Reference

32 VPRs (Vector Pipeline Registers), 4 KiB each
3 execution units: DMA, MXU, VPU — each runs one instruction at a time
Dual MXU at 2307 TFLOPS BF16
64 MiB VMEM, 192 GB HBM at 3690 GB/s

Dual MXU Pipeline Model

Each MXU op is split into two phases on separate sub-units:

MXU_W (weight loading): loads weight VPRs into the systolic array. ~10% of total latency.
MXU_D (data computation): streams data VPRs through the array and writes output. ~90% of total latency.

This enables pipelining: the next MXU op's weight phase can overlap with the current op's data phase, as long as the weight VPRs are ready. Specify weight_vprs and data_vprs separately to enable this optimization.

CLI Usage

# All analyses (text)
python scripts/pipeline_ir_cli.py --pipeline kernel.json --format text --show all

# Dependency graph only (JSON)
python scripts/pipeline_ir_cli.py --pipeline kernel.json --format json --show deps

# Gantt + Mermaid diagrams
python scripts/pipeline_ir_cli.py --pipeline kernel.json --format text --show deps,gantt --mermaid

# VPR pressure only
python scripts/pipeline_ir_cli.py --pipeline kernel.json --format text --show vpr

# Reorder suggestion
python scripts/pipeline_ir_cli.py --pipeline kernel.json --format text --show suggest

# VPR timeline plot (PNG)
python scripts/pipeline_ir_cli.py --pipeline kernel.json --plot

# Custom output path
python scripts/pipeline_ir_cli.py --pipeline kernel.json --plot --plot-output my_chart.png

# Interactive HTML animation
python scripts/pipeline_ir_cli.py --pipeline kernel.json --animate

# Animation with custom output path
python scripts/pipeline_ir_cli.py --pipeline kernel.json --animate --animate-output my_animation.html

CLI Options

| Flag | Values | Description | |------|--------|-------------| | --pipeline | path | Pipeline IR JSON file (required) | | --format | text, json | Output format (default: text) | | --show | deps, gantt, vpr, suggest, all | Sections to show (comma-separated, default: all) | | --mermaid | flag | Include Mermaid diagrams (text format only) | | --plot | flag | Generate VPR timeline heatmap as PNG image | | --plot-output | path | Output path for plot (default: <name>_vpr_timeline.png) | | --animate | flag | Generate interactive HTML animation | | --animate-output | path | Output path for animation (default: <name>_pipeline.html) |

Output Sections

1. Data Dependency Graph

Detects three types of data hazards:

| Hazard | Condition | Impact | |--------|-----------|--------| | RAW (Read-After-Write) | Op B reads VPR[n] that Op A writes | True dependency — B must wait for A | | WAR (Write-After-Read) | Op B writes VPR[n] that Op A reads | Anti-dependency — B can't overwrite before A reads | | WAW (Write-After-Write) | Op B writes VPR[n] that Op A writes | Output dependency — ordering must be preserved |

Same analysis applies to VMEM slots. Transitive reduction is applied to keep the DAG minimal.

Mermaid output uses: solid arrows for RAW, dashed for WAR, dotted for WAW.

2. Pipeline Gantt

Shows each hardware unit's timeline with instruction placement and stall markers. Each instruction reports:

Start/end time in ns
Wait reason: NONE, WAIT_DATA (blocked on dependency), WAIT_UNIT (unit busy)
Stall duration

3. VPR Occupancy Heatmap

ASCII grid showing which VPRs are live at each time step. Reports:

Peak concurrent VPR count and when it occurs
Utilization ratio (average live VPRs / 32)
Pressure warnings when >75% VPRs are simultaneously live

4. Reorder Suggestion

Compares original instruction ordering against analysis:

Critical path identification and latency
Parallelism efficiency (critical path / total latency)
Stall breakdown

5. VPR Timeline Plot (PNG)

Matplotlib-rendered 2D heatmap with:

X-axis: Time (ns), continuous scale
Y-axis: VPR registers, one row per used VPR
Cell color: 3-state × 3-unit color matrix
- Write (deep): op is actively writing this VPR
- Read (mid): op is actively reading this VPR
- Live (light): VPR holds data but no op is accessing it
- Colors: DMA=blue, MXU=red, VPU=green
Top band: Gantt strips showing DMA/MXU/VPU unit utilization
Dependency arrows: Arc arrows between VPR rows (RAW=solid, WAR=dashed, WAW=dotted)
Title bar: Kernel name, total latency, peak VPR count, stall time

Requires matplotlib (pip install matplotlib).

6. HTML Interactive Animation

Self-contained HTML file with animated pipeline playback:

Layout: Gantt chart (top), VPR heatmap (center), pseudocode panel (right side), playback controls (bottom)
Playback controls: Play/pause button, time scrubber, speed selector (0.5x-4x)
Color scheme: DMA=blue, MXU=red, VPU=green; intensity varies by access state (write/read/live)
Fusion effects: Fused op groups flash with a glow effect during playback
Pseudocode panel: Highlights the active op's pseudocode line in real time

Workflow

Decompose your kernel tile into Pipeline IR instructions
Assign VPRs explicitly, or use VPR auto-allocation (set VPR numbers to placeholder values and let the allocator assign optimal registers)
Run analysis to identify hazards, stalls, and pressure points
Iterate on VPR assignments and instruction ordering
Validate that peak VPR pressure stays within hardware limits (32 VPRs)

Output Language

Narrative text in Chinese, technical terms (VPR, RAW, WAR, WAW, DMA, MXU, VPU, VMEM, HBM) in English.

Example

See scripts/examples/flash_attention_tile.json for a complete Flash Attention tile decomposition with 11 instructions across DMA/MXU/VPU units using VPR[0:23].

primatrix/tpu-pipeline-scheduler

plugins/tpu-perf-model/skills/tpu-pipeline-scheduler/SKILL.md

Use when analyzing register-level pipeline scheduling for TPU v7x kernels. Trigger when the user asks about instruction-level pipeline analysis, VPR register pressure, data hazard detection (RAW/WAR/WAW), or optimal instruction ordering for TPU pipelines.

devops

Updated Apr 20, 2026

$ install --global

skillsauth

npx skillsauth add primatrix/skills tpu-pipeline-scheduler

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 20, 2026, 5:18 AM37.8s20 files scanned

SKILL.md

name:: tpu-pipeline-scheduler
description:: >

TPU Pipeline Scheduler

When to Use

Designing optimal instruction interleaving for a Pallas kernel tile
Analyzing VPR register pressure to determine if a tiling strategy is feasible
Identifying data dependency bottlenecks (RAW/WAR/WAW hazards)
Comparing alternative instruction orderings for pipeline efficiency

Input Format: Pipeline IR

The input is a JSON file describing a sequence of hardware instructions with explicit VPR assignments:

{
  "name": "kernel_tile_name",
  "hw": "v7x",
  "ops": [
    {
      "op_id": "unique_name",
      "op_kind": "DMA_LOAD | DMA_STORE | MXU | VPU | VMEM_TO_REG | REG_TO_VMEM",
      "input_vprs": [0, 1, 2, 3],
      "output_vprs": [4, 5, 6, 7],
      "weight_vprs": [0, 1],
      "data_vprs": [2, 3],
      "input_vmem": ["slot_name"],
      "output_vmem": ["slot_name"],
      "latency_ns": 500,
      "unit": "DMA | MXU | VPU",
      "label": "Human-readable description",
      "pseudocode": "S = Q @ K.T"
    }
  ]
}

Fields

TPU v7x Hardware Reference

32 VPRs (Vector Pipeline Registers), 4 KiB each
3 execution units: DMA, MXU, VPU — each runs one instruction at a time
Dual MXU at 2307 TFLOPS BF16
64 MiB VMEM, 192 GB HBM at 3690 GB/s

Dual MXU Pipeline Model

Each MXU op is split into two phases on separate sub-units:

MXU_W (weight loading): loads weight VPRs into the systolic array. ~10% of total latency.
MXU_D (data computation): streams data VPRs through the array and writes output. ~90% of total latency.

CLI Usage

# All analyses (text)
python scripts/pipeline_ir_cli.py --pipeline kernel.json --format text --show all

# Dependency graph only (JSON)
python scripts/pipeline_ir_cli.py --pipeline kernel.json --format json --show deps

# Gantt + Mermaid diagrams
python scripts/pipeline_ir_cli.py --pipeline kernel.json --format text --show deps,gantt --mermaid

# VPR pressure only
python scripts/pipeline_ir_cli.py --pipeline kernel.json --format text --show vpr

# Reorder suggestion
python scripts/pipeline_ir_cli.py --pipeline kernel.json --format text --show suggest

# VPR timeline plot (PNG)
python scripts/pipeline_ir_cli.py --pipeline kernel.json --plot

# Custom output path
python scripts/pipeline_ir_cli.py --pipeline kernel.json --plot --plot-output my_chart.png

# Interactive HTML animation
python scripts/pipeline_ir_cli.py --pipeline kernel.json --animate

# Animation with custom output path
python scripts/pipeline_ir_cli.py --pipeline kernel.json --animate --animate-output my_animation.html

CLI Options

Output Sections

1. Data Dependency Graph

Detects three types of data hazards:

Same analysis applies to VMEM slots. Transitive reduction is applied to keep the DAG minimal.

Mermaid output uses: solid arrows for RAW, dashed for WAR, dotted for WAW.

2. Pipeline Gantt

Shows each hardware unit's timeline with instruction placement and stall markers. Each instruction reports:

Start/end time in ns
Wait reason: NONE, WAIT_DATA (blocked on dependency), WAIT_UNIT (unit busy)
Stall duration

3. VPR Occupancy Heatmap

ASCII grid showing which VPRs are live at each time step. Reports:

Peak concurrent VPR count and when it occurs
Utilization ratio (average live VPRs / 32)
Pressure warnings when >75% VPRs are simultaneously live

4. Reorder Suggestion

Compares original instruction ordering against analysis:

Critical path identification and latency
Parallelism efficiency (critical path / total latency)
Stall breakdown

5. VPR Timeline Plot (PNG)

Matplotlib-rendered 2D heatmap with:

X-axis: Time (ns), continuous scale
Y-axis: VPR registers, one row per used VPR
Cell color: 3-state × 3-unit color matrix
- Write (deep): op is actively writing this VPR
- Read (mid): op is actively reading this VPR
- Live (light): VPR holds data but no op is accessing it
- Colors: DMA=blue, MXU=red, VPU=green
Top band: Gantt strips showing DMA/MXU/VPU unit utilization
Dependency arrows: Arc arrows between VPR rows (RAW=solid, WAR=dashed, WAW=dotted)
Title bar: Kernel name, total latency, peak VPR count, stall time

Requires matplotlib (pip install matplotlib).

6. HTML Interactive Animation

Self-contained HTML file with animated pipeline playback:

Layout: Gantt chart (top), VPR heatmap (center), pseudocode panel (right side), playback controls (bottom)
Playback controls: Play/pause button, time scrubber, speed selector (0.5x-4x)
Color scheme: DMA=blue, MXU=red, VPU=green; intensity varies by access state (write/read/live)
Fusion effects: Fused op groups flash with a glow effect during playback
Pseudocode panel: Highlights the active op's pseudocode line in real time

Workflow

Decompose your kernel tile into Pipeline IR instructions
Assign VPRs explicitly, or use VPR auto-allocation (set VPR numbers to placeholder values and let the allocator assign optimal registers)
Run analysis to identify hazards, stalls, and pressure points
Iterate on VPR assignments and instruction ordering
Validate that peak VPR pressure stays within hardware limits (32 VPRs)

Output Language

Narrative text in Chinese, technical terms (VPR, RAW, WAR, WAW, DMA, MXU, VPU, VMEM, HBM) in English.

Example

See scripts/examples/flash_attention_tile.json for a complete Flash Attention tile decomposition with 11 instructions across DMA/MXU/VPU units using VPR[0:23].

Related Skills

primatrix/memory-profile

development

VerifiedTrustedCommunity

Use when analyzing TPU pretraining HBM occupancy from a profile directory — locates the static HBM peak (the same number TensorBoard's Memory Viewer shows), enumerates every buffer alive at the peak schedule moment with size / HLO instruction / opcode / op_name, and rolls the alive set up by opcode and op_name. Reads compile-time `*.hlo_proto.pb` (BufferAssignmentProto) as the primary source; runtime `*.xplane.pb` allocator events are a secondary, often-truncated signal.

SKILL.mdUpdated May 27, 2026

primatrix/memory-profile

primatrix/compute-breakdown

testing

VerifiedTrustedCommunity

Use when analyzing TPU pretraining compute efficiency from xplane.pb — produces source-line-aggregated HLO duration tables, layer-scoped breakdowns, non-compute (padding/cast/copy) audits, and v7x roofline shortfall vs theoretical peak. Reads schema documented by profile-anatomy.

SKILL.mdUpdated May 25, 2026

primatrix/compute-breakdown

primatrix/plugins/tpu-perf/skills/comm-analysis

tools

VerifiedTrustedCommunity

--- name: comm-analysis description: Use when analyzing communication on a TPU pretraining profile — extracts every comm primitive (async + sync, TC + SparseCore), attributes axes via HLO replica_groups, computes per-row NCCL bus BW vs per-axis peak ICI BW (peak_link × k_torus_dims × directions_per_dim; TPUv7x: 200 GB/s bidir per link on a 3D torus; util% requires `--mesh-spec` with topology), and reports per-step compute/comm overlap. Builds on profile-anatomy. --- # Communication Analysis **

SKILL.mdUpdated May 25, 2026

primatrix/plugins/tpu-perf/skills/comm-analysis

primatrix/profile-anatomy

documentation

VerifiedTrustedCommunity

Use when reading TPU pretraining profiles (xplane.pb, trace.json.gz) — describes the on-disk layout, the XSpace/XPlane/XLine/XEvent/XStat hierarchy, and provides reference scripts that future tpu-perf skills can read as schema documentation.

SKILL.mdUpdated May 24, 2026

primatrix/profile-anatomy

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/primatrix/skills.git

# Copy into Claude Code skills folder (global)
cp -r skills/plugins/tpu-perf-model/skills/tpu-pipeline-scheduler ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

primatrix/skills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT