skills/gke-ai-troubleshooting-jobset-interruption/SKILL.md
Systematically diagnose GKE JobSet interruptions, restarts, and preemptions for AI/ML training workloads. Identifies preemption events, maintenance interruptions, bad host VMs, unhealthy pods, and coordinator worker failures.
npx skillsauth add googlecloudplatform/gke-mcp gke-ai-troubleshooting-jobset-interruptionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill to systematically diagnose and resolve JobSet interruptions, restarts, and preemptions on GKE clusters hosting large-scale AI/ML workloads.
kube-state-metrics for your cluster (see KSM JobSet Metrics).To begin troubleshooting, acquire the following context from the user:
customer-ai-project-123)tpu-cluster-prod)llama3-70b-training)default)2026-05-20T08:15:00Z)T, calculate the query window as [T - 30m] to [T + 30m].
Start_Time = T - 30mEnd_Time = T + 30mVerify if the JobSet is experiencing restart loops and determine the frequency of restarts.
monitoring_time_series_chartfetch prometheus_target
| metric 'prometheus.googleapis.com/kube_jobset_restarts/gauge'
| filter resource.cluster_name == '<cluster_name>' && metric.jobset_name == '<workload_name>'
| align next_older(1m)
| every 1m
| group_by [metric.jobset_name], [val: max(value)]
BypassSandbox: true).curl -sSf -H "Authorization: Bearer \$(gcloud auth print-access-token)" \
"https://monitoring.googleapis.com/v1/projects/<project_id>/location/global/prometheus/api/v1/query?query=kube_jobset_restarts%7Bjobset_name%3D%22<workload_name>%22%2Ccluster%3D%22<cluster_name>%22%7D"
Determine if the JobSet restarts were triggered by physical nodepool-level events (such as spot preemptions, maintenance, or host terminations).
monitoring_time_series_chartfetch k8s_node_pool
| metric 'kubernetes.io/node_pool/interruption_count'
| filter cluster_name == '<cluster_name>'
| align next_older(10m)
| every 10m
| group_by [metric.interruption_type, metric.interruption_reason, metadata.system.node_pool_name], [val: sum(value)]
BypassSandbox: true).curl -sSf -H "Authorization: Bearer \$(gcloud auth print-access-token)" \
"https://monitoring.googleapis.com/v1/projects/<project_id>/location/global/prometheus/api/v1/query?query=sum%20by%20%28interruption_type%2C%20interruption_reason%2C%20node_pool_name%2C%20cluster_name%29%20%28avg_over_time%28%7B__name__%3D%22kubernetes.io%2Fnode_pool%2Finterruption_count%22%2C%20monitored_resource%3D%22k8s_node_pool%22%2C%20cluster_name%3D%22<cluster_name>%22%7D%5B10m%5D%29%29"
query_logsresource.type="gke_nodepool"
AND resource.labels.cluster_name="<cluster_name>"
AND timestamp >= "<Start_Time>"
AND timestamp <= "<End_Time>"
interruption_reason or logs for host issues.Correlate node readiness failures with physical host VMs to see if a single faulty host repeatedly fails coordinator pods.
monitoring_time_series_chartfetch k8s_node
| metric 'kubernetes.io/node/status_condition'
| filter cluster_name == '<cluster_name>' && metric.condition == 'Ready' && metric.status == 'False'
| align next_older(1m)
| every 1m
| group_by [node_name, metadata.user.gke_nodepool], [val: max(value)]
BypassSandbox: true).curl -sSf -H "Authorization: Bearer \$(gcloud auth print-access-token)" \
"https://monitoring.googleapis.com/v1/projects/<project_id>/location/global/prometheus/api/v1/query?query=sum%20by%20%28status%2C%20condition%2C%20node_pool_name%29%20%28%7B__name__%3D%22kubernetes.io%2Fnode%2Fstatus_condition%22%2C%20monitored_resource%3D%22k8s_node%22%2C%20cluster_name%3D%22<cluster_name>%22%2C%20condition%3D%22Ready%22%2C%20status%3D%22False%22%7D%29"
monitoring_time_series_chartfetch k8s_node
| metric 'kubernetes.io/node/cpu/total_cores'
| filter cluster_name == '<cluster_name>'
| align next_older(1m)
| every 1m
| group_by [node_name, metadata.user.gce_topology_host, metadata.user.gke_nodepool], [val: max(value)]
query_logsresource.type="k8s_node"
AND resource.labels.cluster_name="<cluster_name>"
AND (textPayload:"host error" OR textPayload:"kernel panic" OR textPayload:"hardware failure" OR textPayload:"NodeNotReady")
AND timestamp >= "<Start_Time>"
AND timestamp <= "<End_Time>"
Ready=False or Unknown) and correlate them to their GCE physical host ID via metadata.user.gce_topology_host. Check if the same host is repeatedly failing.Analyze pod status phases and retrieve coordinator worker logs to identify application-level crashes or network deadlocks.
monitoring_time_series_chartfetch k8s_pod
| metric 'kubernetes.io/pod/status/phase'
| filter cluster_name == '<cluster_name>' && pod_name ==~ '<workload_name>.*'
| align next_older(10m)
| every 10m
| group_by [metric.phase], [val: count()]
BypassSandbox: true).curl -sSf -H "Authorization: Bearer \$(gcloud auth print-access-token)" \
"https://monitoring.googleapis.com/v1/projects/<project_id>/location/global/prometheus/api/v1/query?query=sum%20by%20%28phase%29%20%28avg_over_time%28%7B__name__%3D%22kube_pod_status_phase%22%2C%20cluster%3D%22<cluster_name>%22%2C%20pod%3D~%22<workload_name>.*%22%7D%5B10m%5D%29%29"
monitoring_time_series_chartfetch k8s_pod
| metric 'kubernetes.io/pod/status/unschedulable'
| filter cluster_name == '<cluster_name>' && pod_name ==~ '<workload_name>.*'
| align next_older(10m)
| every 10m
| group_by [pod_name], [val: max(value)]
query_logsresource.type="k8s_container"
AND resource.labels.cluster_name="<cluster_name>"
AND labels."k8s-pod/jobset_sigs_k8s_io/jobset-name"="<workload_name>"
AND timestamp >= "<Start_Time>"
AND timestamp <= "<End_Time>"
If Step 2 showed high preemption counts on Spot VMs:
If Step 3 identified a specific host ID (gce-topology-host) that consistently fails or triggers restarts across multiple attempts:
[T - 30m, T + 30m] window.development
Diagnose and prevent `vbar_control_agent` segfaults and OOMs caused by race conditions during TPU device resets and frequent metrics collection (e.g. every 3s). Use when TPU slice initialization fails or `vbar_control_agent` crashes on TPU v6e nodes.
development
Expert instructions for building high-quality GKE troubleshooting skills. Codifies Step 0 context rules, zero-hallucination signatures, and explicit LQL/PromQL query requirements.
tools
Assists in preparing applications and clusters on GKE for production.
devops
Workflows for containerizing and deploying applications to GKE for the first time.