skills/gke-cluster-lifecycle/SKILL.md
Guidance on managing the lifecycle and upgrades of Google Kubernetes Engine (GKE) clusters.
npx skillsauth add googlecloudplatform/gke-mcp gke-cluster-lifecycleInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides guidance on managing the lifecycle and upgrades of Google Kubernetes Engine (GKE) clusters.
Managing cluster upgrades is crucial for security and access to new features. GKE provides automated upgrades, but they must be configured to minimize disruption.
Release channels allow you to choose the balance between stability and feature availability.
Command to set release channel:
gcloud container clusters update <cluster-name> \
--release-channel=stable \
--region <region>
Surge upgrades allow you to specify how many nodes can be created above the target size during an upgrade, minimizing disruption.
Example configuration:
gcloud container node-pools update <pool-name> \
--cluster=<cluster-name> \
--max-surge-upgrade=2 \
--max-unavailable-upgrade=0 \
--region <region>
Setting max-unavailable-upgrade=0 ensures that no nodes are taken offline before new ones are ready.
For high-risk upgrades, you can create a new node pool (Green) with the new version, test it, and then migrate workloads from the old node pool (Blue).
Steps:
Stable or Regular).max-surge-upgrade to ensure availability during upgrades.data-ai
Systematically diagnose GKE JobSet interruptions, restarts, and preemptions for AI/ML training workloads. Identifies preemption events, maintenance interruptions, bad host VMs, unhealthy pods, and coordinator worker failures.
development
Diagnose and prevent `vbar_control_agent` segfaults and OOMs caused by race conditions during TPU device resets and frequent metrics collection (e.g. every 3s). Use when TPU slice initialization fails or `vbar_control_agent` crashes on TPU v6e nodes.
development
Expert instructions for building high-quality GKE troubleshooting skills. Codifies Step 0 context rules, zero-hallucination signatures, and explicit LQL/PromQL query requirements.
tools
Assists in preparing applications and clusters on GKE for production.