plugins/exec-remote/skills/exec-remote/SKILL.md
Executes Python scripts, tests, or benchmarks on a provisioned remote cluster (GPU or TPU) using SkyPilot. Use this skill when the user asks to run code on GPU, TPU, or any "remote" cluster.
npx skillsauth add primatrix/skills exec-remoteInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill handles running code on remote GPU or TPU clusters via SkyPilot.
The following defaults apply unless the user explicitly overrides them:
| Parameter | Default |
|----------------|----------------------------|
| PROJECT_ID | tpu-service-473302 |
| CLUSTER_NAME | sglang-jax-agent-tests |
| ZONE | asia-northeast1-b |
| NUM_SLICES | 1 |
Use these values directly — do NOT ask the user to confirm or re-enter them unless they specify otherwise.
Identify the target device from the user's request:
| Target | Cluster name file | Env prefix |
|--------|---------------------|------------------------------------|
| GPU | .cluster_name_gpu | export CUDA_VISIBLE_DEVICES=0; |
| TPU | .cluster_name_tpu | (none) |
If the user does not specify a device, ask them which one to use.
.cluster_name_gpu or .cluster_name_tpu) exists and is non-empty in the project root.GPU clusters are provisioned using the standalone launch_gpu.sh script. Locate it in the scripts/ directory alongside this skill definition.
# Common accelerator types: H100:1, A100:1, L4:1
bash <absolute_path_to_launch_gpu.sh> <accelerator_type> <experiment_name>
The launch script automatically updates .cluster_name_gpu.
There are two provisioning paths for TPU:
deploy-cluster skill) — RecommendedThis path provisions TPU on GKE using the full pipeline: apply-resource -> deploy-cluster -> exec-remote.
Each TPU type gets its own SkyPilot cluster named <cluster>-<username>-<tpu_type>, allowing multiple topologies to run in parallel.
deploy-cluster skill which will:
apply-resource).cluster_name_tpu/deploy-cluster
Supported TPU types: v6e-1, v6e-4, v6e-8, v6e-16, v6e-32, v6e-64, v6e-128, v6e-256
For quick, single-node TPU usage without GKE, use the standalone launch_tpu.sh script:
# Common accelerator types: tpu-v4-8, tpu-v4-16, tpu-v6e-1, tpu-v6e-4
bash <absolute_path_to_launch_tpu.sh> <accelerator_type> <experiment_name>
The launch script automatically updates .cluster_name_tpu.
# GPU
sky down $(cat .cluster_name_gpu) -y
# TPU (tear down all per-TPU-type clusters)
sky down <CLUSTER_NAME>-<USERNAME>-v6e-1 -y
sky down <CLUSTER_NAME>-<USERNAME>-v6e-4 -y
For GKE-based TPU, also remove the GKE cluster via /apply-resource delete if no longer needed.
sky exec $(cat .cluster_name_gpu) --workdir . "export CUDA_VISIBLE_DEVICES=0; uv run --extra gpu python <PATH_TO_SCRIPT> [ARGS]"
export CUDA_VISIBLE_DEVICES=0; ensures deterministic single-GPU execution. Adjust for multi-GPU jobs.--extra gpu activates GPU optional dependencies (e.g. jax[cuda]).sky exec <CLUSTER_NAME>-<USERNAME>-<TPU_TYPE> --workdir . "uv run --extra tpu python <PATH_TO_SCRIPT> [ARGS]"
--extra tpu activates TPU optional dependencies (e.g. jax[tpu]).sglang-jax-agent-tests-hongmao-v6e-1).--workdir . syncs the current local directory to the remote instance before running.python -m pytest <test_path> instead of calling pytest directly.Run a benchmark on GPU:
sky exec $(cat .cluster_name_gpu) --workdir . "export CUDA_VISIBLE_DEVICES=0; uv run --extra gpu python src/lynx/perf/benchmark_train.py"
Run tests on TPU (single type):
sky exec sglang-jax-agent-tests-hongmao-v6e-4 --workdir . "uv run --extra tpu python -m pytest src/lynx/test/"
Run CI tests on multiple TPU types in parallel:
# Deploy both types (sequential — config.yaml is global)
python <deploy-cluster>/scripts/deploy.py sglang-jax-agent-tests v6e-1 asia-northeast1-b
python <deploy-cluster>/scripts/deploy.py sglang-jax-agent-tests v6e-4 asia-northeast1-b
# Execute in parallel
sky exec sglang-jax-agent-tests-hongmao-v6e-1 --workdir . "python test/srt/run_suite.py --suite unit-test-tpu-v6e-1" &
sky exec sglang-jax-agent-tests-hongmao-v6e-4 --workdir . "python test/srt/run_suite.py --suite e2e-test-tpu-v6e-4" &
wait
stdout and stderr directly to the terminal.Ctrl+C may not kill the remote process; check SkyPilot docs for cleanup if needed.When the user requests to run code on TPU and no .cluster_name_tpu exists (or the user explicitly wants a new cluster), follow this procedure to orchestrate the full pipeline: apply-resource -> deploy-cluster -> exec-remote.
All parameters use defaults unless the user explicitly overrides them — do NOT ask for confirmation.
Only ask the user for parameters they haven't specified. Use defaults for everything else:
| Parameter | Default | Notes |
|----------------|-----------------------------------|---------------------------------|
| PROJECT_ID | tpu-service-473302 | GCP project ID |
| CLUSTER_NAME | sglang-jax-agent-tests | GKE cluster name |
| TPU_TYPE | (must specify) | e.g. v6e-4, v6e-1 |
| NUM_SLICES | 1 | Default to 1 |
| ZONE | asia-northeast1-b | Must support the chosen TPU type |
Check prerequisites, then create the GKE cluster:
which xpk && which gcloud && which kubectl
xpk cluster create-pathways \
--cluster $CLUSTER_NAME \
--num-slices=$NUM_SLICES \
--tpu-type=$TPU_TYPE \
--zone=$ZONE \
--spot \
--project=$PROJECT_ID
Poll until the cluster status becomes RUNNING. Do NOT proceed to deploy SkyPilot while status is PROVISIONING or RECONCILING — it will fail with SSL errors.
gcloud container clusters list --project=$PROJECT_ID \
--filter="name=$CLUSTER_NAME" --format="table(name,location,status)"
Run the deploy script for each required TPU type. Each call creates a separate SkyPilot cluster.
# Deploy each TPU type (must be sequential — config.yaml is global)
# Only tpu_type is required; cluster_name and zone use defaults
python <path-to-deploy-cluster>/scripts/deploy.py v6e-1
python <path-to-deploy-cluster>/scripts/deploy.py v6e-4
This creates:
$CLUSTER_NAME-$USERNAME-v6e-1 — SkyPilot cluster for v6e-1 tests$CLUSTER_NAME-$USERNAME-v6e-4 — SkyPilot cluster for v6e-4 testsAfter completion, verify:
sky status # Both clusters should show as UP
Determine num_nodes from the TPU type (v6e-N where total_chips = N, num_nodes = N / 4, minimum 1):
| TPU type | num_nodes | |----------|-----------| | v6e-1 | 1 | | v6e-4 | 1 | | v6e-8 | 2 | | v6e-16 | 4 | | v6e-32 | 8 | | v6e-64 | 16 | | v6e-128 | 32 | | v6e-256 | 64 |
For single-node types (v6e-1, v6e-4), omit --num-nodes. For multi-node types, add --num-nodes <N>.
# Single-node (v6e-1, v6e-4) — use per-TPU-type cluster name
sky exec $CLUSTER_NAME-$USERNAME-v6e-1 --workdir . \
"uv run --extra tpu python <PATH_TO_SCRIPT> [ARGS]"
# Multi-node (v6e-8+)
sky exec $CLUSTER_NAME-$USERNAME-v6e-8 --num-nodes 2 --workdir . \
"uv run --extra tpu python <PATH_TO_SCRIPT> [ARGS]"
# Parallel execution across multiple TPU types
sky exec $CLUSTER_NAME-$USERNAME-v6e-1 --workdir . "..." &
sky exec $CLUSTER_NAME-$USERNAME-v6e-4 --workdir . "..." &
wait
When the user requests teardown, remove both layers:
# 1. Remove SkyPilot clusters (one per TPU type)
sky down $CLUSTER_NAME-$USERNAME-v6e-1 -y
sky down $CLUSTER_NAME-$USERNAME-v6e-4 -y
# 2. Remove GKE cluster (only for Path A / GKE-based)
xpk cluster delete \
--cluster $CLUSTER_NAME \
--zone=$ZONE \
--project=$PROJECT_ID
development
Use when analyzing TPU pretraining HBM occupancy from a profile directory — locates the static HBM peak (the same number TensorBoard's Memory Viewer shows), enumerates every buffer alive at the peak schedule moment with size / HLO instruction / opcode / op_name, and rolls the alive set up by opcode and op_name. Reads compile-time `*.hlo_proto.pb` (BufferAssignmentProto) as the primary source; runtime `*.xplane.pb` allocator events are a secondary, often-truncated signal.
testing
Use when analyzing TPU pretraining compute efficiency from xplane.pb — produces source-line-aggregated HLO duration tables, layer-scoped breakdowns, non-compute (padding/cast/copy) audits, and v7x roofline shortfall vs theoretical peak. Reads schema documented by profile-anatomy.
tools
--- name: comm-analysis description: Use when analyzing communication on a TPU pretraining profile — extracts every comm primitive (async + sync, TC + SparseCore), attributes axes via HLO replica_groups, computes per-row NCCL bus BW vs per-axis peak ICI BW (peak_link × k_torus_dims × directions_per_dim; TPUv7x: 200 GB/s bidir per link on a 3D torus; util% requires `--mesh-spec` with topology), and reports per-step compute/comm overlap. Builds on profile-anatomy. --- # Communication Analysis **
documentation
Use when reading TPU pretraining profiles (xplane.pb, trace.json.gz) — describes the on-disk layout, the XSpace/XPlane/XLine/XEvent/XStat hierarchy, and provides reference scripts that future tpu-perf skills can read as schema documentation.