plugins/tpu-perf/skills/profile-anatomy/SKILL.md
Use when reading TPU pretraining profiles (xplane.pb, trace.json.gz) — describes the on-disk layout, the XSpace/XPlane/XLine/XEvent/XStat hierarchy, and provides reference scripts that future tpu-perf skills can read as schema documentation.
npx skillsauth add primatrix/skills profile-anatomyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
回答语言要求:调用此 skill 时,所有面向用户的回答必须使用中文。
Reference for what's inside a TPU pretraining profile directory and how to parse each artifact. This skill is schema documentation, not an analysis tool — it answers "what's in here and what does each field mean", not "is my training fast".
Future tpu-perf skills (MFU, comm overlap, HBM pressure, …) build on
the schema described here.
A typical capture (e.g. /Users/xl/tensorboard/tensorboard/plugins/profile/dp8_fsdp128/)
contains:
| File pattern | What it is | When to read it |
|---|---|---|
| *.xplane.pb | The authoritative protobuf trace. Contains all profiled hosts and devices, all events, all metadata, and the HloProto for every JIT-compiled module (see §2.1). | Whenever you need anything reliable. This is the source of truth. |
| *.trace.json.gz | Chrome-trace-format JSON gzipped. A flattened, browser-viewable export of the same data, capped at ~1M events. | Quick browser inspection, manual scripts that don't need every event. Do not use it for total-time accounting if the cap was hit. |
| *.hlo_proto.pb | Standalone xla.HloProto (defined in scripts/_proto/hlo.proto) — one file per JAX-compiled module, named <module>.hlo_proto.pb. Equivalent to the embedded copy in xplane (decodes to the same HloProto; only field ordering differs). May be absent in some captures. | Quick HLO inspection without parsing the (possibly multi-GB) xplane. |
| ALL_HOSTS.op_stats_v2.pb | Pre-aggregated op stats (defined in comm-analysis/scripts/_proto/op_stats.proto). | Quick pre-aggregated reads. |
Five-level proto tree, defined in scripts/_proto/xplane.proto. Quote of the field shape:
XSpace (top-level container)
repeated XPlane planesrepeated string errorsrepeated string warningsrepeated string hostnamesXPlane (one timeline source — a host, a device, a metadata
plane)
int64 idstring name (e.g. "/device:TPU:0", "/host:CPU",
"/device:CUSTOM:Megascale Trace", "Task Environment")repeated XLine linesmap<int64, XEventMetadata> event_metadata — every
XEvent.metadata_id resolves through this map.map<int64, XStatMetadata> stat_metadata — every
XStat.metadata_id resolves through this map.repeated XStat stats — plane-level stats (e.g. device
capabilities).XLine (a single timeline within a plane — e.g. "Steps",
"XLA Ops", "Async XLA Ops")
int64 id, int64 display_id, string name,
string display_nameint64 timestamp_ns — start of this line, nanoseconds since
epoch. XEvent.offset_ps is picoseconds relative to this.int64 duration_psrepeated XEvent eventsreserved 5, 6, 7, 8XEvent (one event on a timeline)
int64 metadata_id → event_metadata[metadata_id].name for the
human label (and op-level stats — see note below)oneof data { int64 offset_ps | int64 num_occurrences } —
offset_ps for normal events, num_occurrences for aggregated
counts. Use WhichOneof("data") to discriminate.int64 duration_psrepeated XStat stats — per-execution counters only (e.g.
device_offset_ps, device_duration_ps). Op-level stats like
hlo_category / flops / shape_with_layout are NOT here — they
are on XEventMetadata.stats (see below). XLA shares one
XEventMetadata across every execution of the same HLO op.XStat (a named value attached to an event, plane, or
event-metadata)
int64 metadata_id → stat_metadata[metadata_id].nameoneof value with six variants:
double_value, uint64_value, int64_value, str_value,
bytes_value, ref_value. Use WhichOneof("value") to discriminate.
ref_value is a back-reference whose payload is stored in
XStatMetadata.name.XEventMetadata (shared metadata per event-type-id within a
plane)
int64 id, string name (the HLO op text for XLA Ops),
string display_name,
bytes metadata, repeated XStat stats,
repeated int64 child_id.stats here is the op-level payload for HLO events:
hlo_category, tf_op, program_id, flops, model_flops,
bytes_accessed, raw_bytes_accessed, shape_with_layout, etc.
Resolve names via the same XPlane.stat_metadata map used by
XEvent.stats.XStatMetadata (shared metadata per stat-type-id within a
plane)
int64 id, string name, string description.value_type field — value type is determined per-XStat at
the use site, via WhichOneof("value").HloProto on /host:metadataModern XProf/JAX captures embed the full xla.HloProto of every
JIT-compiled module inside the xplane itself — you do not need
the standalone *.hlo_proto.pb files (and they are not always
shipped). Layout:
XPlane(name='/host:metadata')
└── event_metadata[i] # one entry per compiled module
├── name = 'jit_train_step(8722433274278871538)' # module label
└── stats[j]
├── metadata_id → stat_metadata[*].name == 'Hlo Proto'
└── bytes_value → serialized xla.HloProto
To decode: parse the bytes_value with
scripts/_proto/hlo_pb2.HloProto. The
embedded blob and the on-disk <module>.hlo_proto.pb decode to the
same HloProto (only proto field ordering may differ — re-serializing
either produces equally-sized buffers but bytes-level inequality is
expected).
What lives inside HloProto:
hlo_module.name, hlo_module.id, hlo_module.entry_computation_idhlo_module.computations — list of HloComputationProto, each with
instructions: list[HloInstructionProto]. Each instruction carries
opcode, name, shape, operand_ids, metadata,
frontend_attributes, channel_id (for collectives), etc.buffer_assignment, schedule, hlo_module.sharding,
hlo_module.spmd_output_sharding,
hlo_module.frontend_attributes (mesh_shape, num_partitions,
Shardy mesh definitions, …).Use scripts/extract_hlo_proto.py as
the reference reader.
dp8_fsdp128/host:metadata, /device:TPU:0, /device:TPU:1,
/device:TPU:0 SparseCore 0, /device:TPU:0 SparseCore 1,
/device:CUSTOM:Megascale Trace, /host:CPU, Task Environment.
/device:TPU:0_counters_, Scalar Unit, Steps, XLA Modules, XLA Ops,
Async XLA Ops, TC Overlay, XLA TraceMe, counters_0.
/device:TPU:0Real stat names that show up in this fixture (81 total; non-exhaustive highlights):
flops, model_flops, bytes_accessed,
raw_bytes_accessed,
peak_teraflops_per_second,
peak_hbm_bw_gigabytes_per_second,
peak_sram_rd_bw_gigabytes_per_second,
peak_sram_wr_bw_gigabytes_per_second,
peak_vmem_rd_bw_gigabytes_per_second,
peak_vmem_wr_bw_gigabytes_per_second,
peak_cmem_rd_bw_gigabytes_per_second,
peak_cmem_wr_bw_gigabytes_per_second.hlo_category, hlo_op, tf_op, program_id,
symbol_id, deduplicated_name, shape_with_layout, source,
source_stack.flow (uint64 flow id used to pair
*-start ↔ *-done events), device_offset_ps,
device_duration_ps, all_reduce_id, all_reduce_unique_id,
dcn_collective_info.device_id, core_type, core_details,
global_chip_id, process_id, replica_id, run_id, queue_id.counter_value, % util, power,
temperature, throttle %, various HBM FW *, VDD Core FW *,
PCIe FW *.Names you might see in older docs but that are not present in this
capture: is_root, occupancy_pct. Don't write code that depends on
them without first verifying via dump_xplane_metadata.py.
After gzip.open(...).read() → json.loads(...):
{
"displayTimeUnit": "ns",
"metadata": { "highres-ticks": true },
"traceEvents": [ ...up to ~1,000,000 events... ]
}
Each event has a ph (phase) field:
| ph | Meaning | Notable fields |
|---|---|---|
| M | Metadata. Names processes & threads. | name ∈ {process_name, process_sort_index, thread_name, thread_sort_index}, args.name |
| X | Complete event (one start + one duration). | name, cat, pid, tid, ts (µs), dur (µs), args |
| i | Instant event. | name, pid, tid, ts |
| B / E | Paired begin/end (rare in TPU profiles). | matched by name within a tid |
pid ↔ XPlane.name and tid ↔ XLine.name are established by the
M events; you must scan all ph='M' events first to build the
pid → process_name and (pid, tid) → thread_name maps before
reading any X/i event.
Truncation caveat: if len(traceEvents) is at the 1M cap, the
trace is incomplete (events at the tail of capture were dropped). Do
not compute totals from a truncated trace.
All scripts under scripts/ accept a profile directory as
argv[1] and run standalone with stdlib + protobuf. They print
[absent] and exit 0 (no traceback) when the slice they cover is
missing.
| Script | What it shows |
|---|---|
| walk_xplane.py | Full XSpace → planes → lines → events → stats tree, indented. First-look overview. |
| dump_xplane_metadata.py | The event_metadata{} and stat_metadata{} reverse-lookup tables of every plane. |
| extract_step_events.py | Per-step events on the device plane's "Steps" line. |
| extract_hlo_events.py | HLO-level events on "XLA Ops". Op-level stats (hlo_category, tf_op, program_id, flops, model_flops, bytes_accessed, raw_bytes_accessed, shape_with_layout) are read from XEventMetadata.stats, not XEvent.stats. The HLO op text itself is XEventMetadata.name. |
| extract_hlo_proto.py | Decodes the embedded xla.HloProto for every compiled module (the 'Hlo Proto' bytes stat on /host:metadata's XEventMetadata.stats). Cross-checks against any sibling *.hlo_proto.pb files. Pass --dump <module-substring> to print the entry-computation instructions. |
| extract_framework_ops.py | /host:CPU framework events, with stat names discovered (not assumed). |
| extract_collective_events.py | "Async XLA Ops" paired by the flow stat (per-XEvent.stats) — measures exposed comm stall via device_duration_ps of *-done events. |
| read_trace_json.py | trace.json.gz top-level plus pid/tid name maps and sample X/i events. |
python3 plugins/tpu-perf/skills/profile-anatomy/scripts/walk_xplane.py \
/Users/xl/tensorboard/tensorboard/plugins/profile/dp8_fsdp128
# Embedded HloProto for every compiled module — works on the
# 2026_05_26_11_29_35 fixture (12 modules embedded) as well as any
# capture lacking standalone *.hlo_proto.pb files.
python3 plugins/tpu-perf/skills/profile-anatomy/scripts/extract_hlo_proto.py \
/Users/xl/tensorboard/tensorboard/plugins/profile/2026_05_26_11_29_35
# Dump the entry computation of the matching module.
python3 plugins/tpu-perf/skills/profile-anatomy/scripts/extract_hlo_proto.py \
/Users/xl/tensorboard/tensorboard/plugins/profile/2026_05_26_11_29_35 \
--dump jit_train_step
*.xplane.pb is binary protobuf;
you must use xplane_pb2. The vendored module is at
scripts/_proto/xplane_pb2.py (regeneratable from the adjacent
xplane.proto).XEvent.stats vs XEventMetadata.stats. This is the most
common mistake. XEvent.stats carries only per-execution counters
(e.g. device_offset_ps, device_duration_ps). Op-level facts
about an HLO op — hlo_category, flops, shape_with_layout etc.
— live on XEventMetadata.stats, because XLA shares one
XEventMetadata across every execution of the same op. Resolve
names from the same XPlane.stat_metadata map either way. The
Async XLA Ops line is the exception: its events carry per-event
stats (flow, hlo_op, device_duration_ps) directly on
XEvent.stats — verify with dump_xplane_metadata.py if unsure.XStat.value is a 6-variant oneof. Always
WhichOneof("value") first; never assume int64_value.XLine.timestamp_ns is nanoseconds
since epoch; XEvent.offset_ps and XEvent.duration_ps are
picoseconds relative to that nanosecond timestamp. Convert
carefully.trace.json.gz may be truncated at ~1M events. Do not compute
totals from it; use *.xplane.pb for accurate counts."TC Overlay" is a derived line, not raw hardware events —
don't double-count its events against "XLA Ops".flow, not is_root. Don't write code that
looks for an is_root stat — it doesn't exist in current captures./host:metadata, not /device:*. The
per-module HloProto is attached to XEventMetadata.stats of the
/host:metadata plane as a 'Hlo Proto' bytes_value (see §2.1) —
not on any device plane and not on any XEvent.stats. Don't rely on
the on-disk *.hlo_proto.pb files being present; the xplane-embedded
copy is the source of truth.development
Use when analyzing TPU pretraining HBM occupancy from a profile directory — locates the static HBM peak (the same number TensorBoard's Memory Viewer shows), enumerates every buffer alive at the peak schedule moment with size / HLO instruction / opcode / op_name, and rolls the alive set up by opcode and op_name. Reads compile-time `*.hlo_proto.pb` (BufferAssignmentProto) as the primary source; runtime `*.xplane.pb` allocator events are a secondary, often-truncated signal.
testing
Use when analyzing TPU pretraining compute efficiency from xplane.pb — produces source-line-aggregated HLO duration tables, layer-scoped breakdowns, non-compute (padding/cast/copy) audits, and v7x roofline shortfall vs theoretical peak. Reads schema documented by profile-anatomy.
tools
--- name: comm-analysis description: Use when analyzing communication on a TPU pretraining profile — extracts every comm primitive (async + sync, TC + SparseCore), attributes axes via HLO replica_groups, computes per-row NCCL bus BW vs per-axis peak ICI BW (peak_link × k_torus_dims × directions_per_dim; TPUv7x: 200 GB/s bidir per link on a 3D torus; util% requires `--mesh-spec` with topology), and reports per-step compute/comm overlap. Builds on profile-anatomy. --- # Communication Analysis **
development
Mine local Claude/Codex session history to produce a structured work recap for the past 1-7 days, with optional sync to GitHub Issues. Trigger when the user asks to summarize their recent work, generate a daily/weekly report, or wants to see what they solved/researched/reviewed/was blocked on. Default range is 1 day.