plugins/exec-remote/skills/apply-resource/SKILL.md
Manages GKE TPU clusters using xpk. Creates, deletes, and lists TPU Nodepool resources on Google Kubernetes Engine. Multi-user safe - always queries GKE for real-time cluster state.
npx skillsauth add primatrix/skills apply-resourceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill manages TPU clusters on Google Kubernetes Engine (GKE) using xpk.
Multi-User Safe: This skill does NOT use local caching. All cluster information is queried directly from GKE in real-time, ensuring accuracy in multi-user scenarios where clusters may be created, modified, or deleted by other users.
Before using this skill, ensure the following tools are installed:
gcloud auth login to authenticategcloud auth listgke-gcloud-auth-pluginkubectl version --clientxpk --helpThe following defaults apply unless the user explicitly overrides them:
| Parameter | Default |
|----------------|----------------------------|
| PROJECT_ID | tpu-service-473302 |
| CLUSTER_NAME | sglang-jax-agent-tests |
| ZONE | asia-northeast1-b |
| NUM_SLICES | 1 |
Use these values directly — do NOT ask the user to confirm or re-enter them unless they specify otherwise.
When creating a cluster, the following parameters are required:
tpu-service-473302)sglang-jax-agent-tests)v6e-16, v6e-4, v4-8) — must be specified1)asia-northeast1-b)If these parameters are already known from an upstream caller (e.g., exec-remote or deploy-cluster), use them directly — do NOT re-ask the user. Only prompt interactively when this skill is invoked standalone and the user wants to override defaults.
Creates a new GKE cluster with TPU nodepool using Pathways.
Command:
xpk cluster create-pathways \
--cluster $CLUSTER_NAME \
--num-slices=$NUM_SLICES \
--tpu-type=$TPU_TYPE \
--zone=$ZONE \
--spot \
--project=$PROJECT_ID
Interactive Flow:
Deletes an existing GKE cluster.
Command:
xpk cluster delete \
--cluster $CLUSTER_NAME \
--zone=$ZONE \
--project=$PROJECT_ID
Interactive Flow:
Lists all managed clusters.
Command:
xpk cluster list
Real-time Query:
gcloud container clusters listShows detailed information about a specific cluster.
Command:
xpk cluster describe \
--cluster $CLUSTER_NAME \
--zone=$ZONE
When a specified ZONE doesn't support the requested TPU_TYPE:
This skill is designed for multi-user environments:
This skill can be invoked in two ways:
/apply-resource create): Prompt user for all required parameters interactively.deploy-cluster or exec-remote): Parameters (CLUSTER_NAME, TPU_TYPE, NUM_SLICES, ZONE) are already known — use them directly without prompting.gcloud auth listdevelopment
Use when analyzing TPU pretraining HBM occupancy from a profile directory — locates the static HBM peak (the same number TensorBoard's Memory Viewer shows), enumerates every buffer alive at the peak schedule moment with size / HLO instruction / opcode / op_name, and rolls the alive set up by opcode and op_name. Reads compile-time `*.hlo_proto.pb` (BufferAssignmentProto) as the primary source; runtime `*.xplane.pb` allocator events are a secondary, often-truncated signal.
testing
Use when analyzing TPU pretraining compute efficiency from xplane.pb — produces source-line-aggregated HLO duration tables, layer-scoped breakdowns, non-compute (padding/cast/copy) audits, and v7x roofline shortfall vs theoretical peak. Reads schema documented by profile-anatomy.
tools
--- name: comm-analysis description: Use when analyzing communication on a TPU pretraining profile — extracts every comm primitive (async + sync, TC + SparseCore), attributes axes via HLO replica_groups, computes per-row NCCL bus BW vs per-axis peak ICI BW (peak_link × k_torus_dims × directions_per_dim; TPUv7x: 200 GB/s bidir per link on a 3D torus; util% requires `--mesh-spec` with topology), and reports per-step compute/comm overlap. Builds on profile-anatomy. --- # Communication Analysis **
documentation
Use when reading TPU pretraining profiles (xplane.pb, trace.json.gz) — describes the on-disk layout, the XSpace/XPlane/XLine/XEvent/XStat hierarchy, and provides reference scripts that future tpu-perf skills can read as schema documentation.