skills/gke-ai-troubleshooting-tpu-connection-failure-vbar-oom/SKILL.md
Diagnose and prevent `vbar_control_agent` segfaults and OOMs caused by race conditions during TPU device resets and frequent metrics collection (e.g. every 3s). Use when TPU slice initialization fails or `vbar_control_agent` crashes on TPU v6e nodes.
npx skillsauth add googlecloudplatform/gke-mcp gke-ai-troubleshooting-tpu-connection-failure-vbar-oomInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill to systematically diagnose and prevent vbar_control_agent
segfaults and Out-Of-Memory (OOM) errors on TPU v6e nodes.
gcloud or equivalent tool.To begin troubleshooting, acquire the following context from the user:
customer-ai-project-123)tpu-cluster-prod)tpu-node-1)my-training-job-456)2026-04-14T20:00:00Z)T, calculate the query window as [T - 30m] to [T + 30m].
Start_Time = T - 30mEnd_Time = T + 30mvbar_control_agent OOMsLook for specific out of memory messages from vbar_control_agent in serial
console logs.
query_logsSerial Console Logs (OOMs):
logName="projects/<project_id>/logs/serialconsole.googleapis.com%2fserial_port_1_output"
AND labels."compute.googleapis.com/resource_name"="<node_name>"
AND SEARCH(text_payload, "Memory cgroup out of memory: Killed process .* (vbar_control_ag)")
AND timestamp >= "<Start_Time>"
AND timestamp <= "<End_Time>"
Memory cgroup out of memory messages related to
vbar_control_agent. Stack traces pointing to
libtpu::tpunetd::VBARControlHelper::MetricsReadFromVBAR are a strong indicator.references/failure_signatures.md for example log patterns.tpu-device-plugin Metrics Fetch Failures [Low Risk]Check if tpu-device-plugin is reporting metric fetch failures.
query_logsresource.type="k8s_container"
AND resource.labels.project_id="<project_id>"
AND resource.labels.cluster_name="<cluster_name>"
AND resource.labels.container_name="tpu-device-plugin"
AND severity=ERROR
AND textPayload:"metrics fetch failed for .* deviceID and .* device path with error: checksum didn't match with the metrics data. Corrupt data found"
AND timestamp >= "<Start_Time>"
AND timestamp <= "<End_Time>"
Inquire with the user about any custom TPU metrics collection mechanisms they have deployed.
libtpu.sdk.tpumonitoring) that frequently query GetHostMetrics from
vBAR Control Agent.If a custom metrics collection agent is identified, advise the user to temporarily disable it.
vbar_control_agent Resiliency Update [Low Risk]Advise the user that a permanent fix will be available in a future GKE version.
[T - 30m, T + 30m] window.vbar_control_agent segfaults and OOMs using query_logs.tpu-device-plugin failures using query_logs.data-ai
Systematically diagnose GKE JobSet interruptions, restarts, and preemptions for AI/ML training workloads. Identifies preemption events, maintenance interruptions, bad host VMs, unhealthy pods, and coordinator worker failures.
development
Expert instructions for building high-quality GKE troubleshooting skills. Codifies Step 0 context rules, zero-hallucination signatures, and explicit LQL/PromQL query requirements.
tools
Assists in preparing applications and clusters on GKE for production.
devops
Workflows for containerizing and deploying applications to GKE for the first time.