skills/gke-reliability/SKILL.md
Workflows for ensuring high availability and reliability of GKE workloads.
npx skillsauth add googlecloudplatform/gke-mcp gke-reliabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides workflows for configuring your GKE cluster and workloads for high availability and reliability.
Check if the cluster is regional or has multi-zonal node pools.
Command:
gcloud container clusters describe <cluster-name> --region <region> --format="json(location, locations)"
If location is a region (e.g., us-central1), the control plane is regional.
If locations has multiple entries, nodes are spread across multiple zones.
PDBs ensure that a minimum number of pods are available during voluntary disruptions (like node upgrades).
Check existing PDBs:
kubectl get pdb -n <namespace>
Example Manifest:
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: my-app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: my-app
Ensure all production containers have Liveness, Readiness, and optionally Startup probes.
Check workload probes:
kubectl get deployment <app-name> -n <namespace> -o yaml | grep -E "livenessProbe|readinessProbe|startupProbe"
Ensure applications handle SIGTERM signals gracefully and have an appropriate terminationGracePeriodSeconds set (default is 30s).
Ensure pods are spread across zones or nodes to avoid correlated failures.
Example Manifest excerpt:
spec:
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule # or ScheduleAnyway
labelSelector:
matchLabels:
app: my-app
Configure when GKE can perform automated upgrades to avoid peak hours.
Command to set maintenance window:
gcloud container clusters update <cluster-name> \
--region <region> \
--maintenance-window-start <start-time> \
--maintenance-window-recurrence "FREQ=DAILY"
topologySpreadConstraints to ensure pods are distributed across zones, even in regional clusters.data-ai
Systematically diagnose GKE JobSet interruptions, restarts, and preemptions for AI/ML training workloads. Identifies preemption events, maintenance interruptions, bad host VMs, unhealthy pods, and coordinator worker failures.
development
Diagnose and prevent `vbar_control_agent` segfaults and OOMs caused by race conditions during TPU device resets and frequent metrics collection (e.g. every 3s). Use when TPU slice initialization fails or `vbar_control_agent` crashes on TPU v6e nodes.
development
Expert instructions for building high-quality GKE troubleshooting skills. Codifies Step 0 context rules, zero-hallucination signatures, and explicit LQL/PromQL query requirements.
tools
Assists in preparing applications and clusters on GKE for production.