skills/gke-backup-dr/SKILL.md
Workflows for configuring Backup for GKE and disaster recovery.
npx skillsauth add googlecloudplatform/gke-mcp gke-backup-drInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides workflows for protecting your stateful workloads on GKE using Backup for GKE.
Backup for GKE must be enabled on the cluster level.
Command:
gcloud container clusters update <cluster-name> \
--enable-gke-backup \
--region <region>
A Backup Plan defines what to back up, when, and for how long.
Command to create a backup plan:
gcloud container backup-restore backup-plans create <plan-name> \
--cluster=<cluster-name> \
--region=<region> \
--retention-days=<days> \
--cron-schedule="<cron-expression>" \
--all-namespaces
[!NOTE] You can replace
--all-namespaceswith--included-namespaces=<namespace1>,<namespace2>to back up specific namespaces instead of all of them.
Encryption Note: You can specify a Customer-Managed Encryption Key (CMEK) to encrypt backups. Add --backup-encryption-key=<key-resource-name> to the create command.
Trigger a backup immediately outside the schedule.
Command:
gcloud container backup-restore backups create <backup-name> \
--backup-plan=<plan-name> \
--region=<region>
Restore a workload or cluster from a backup.
Command to create a restore plan:
gcloud container backup-restore restore-plans create <restore-plan-name> \
--cluster=<target-cluster-name> \
--region=<region> \
--backup-plan=<source-backup-plan-name> \
--cluster-resource-conflict-policy=USE_EXISTING_VERSION \
--namespaced-resource-restore-mode=FAIL_ON_CONFLICT
Execute the restore:
gcloud container backup-restore restores create <restore-name> \
--restore-plan=<restore-plan-name> \
--backup=<backup-name> \
--region=<region>
data-ai
Systematically diagnose GKE JobSet interruptions, restarts, and preemptions for AI/ML training workloads. Identifies preemption events, maintenance interruptions, bad host VMs, unhealthy pods, and coordinator worker failures.
development
Diagnose and prevent `vbar_control_agent` segfaults and OOMs caused by race conditions during TPU device resets and frequent metrics collection (e.g. every 3s). Use when TPU slice initialization fails or `vbar_control_agent` crashes on TPU v6e nodes.
development
Expert instructions for building high-quality GKE troubleshooting skills. Codifies Step 0 context rules, zero-hallucination signatures, and explicit LQL/PromQL query requirements.
tools
Assists in preparing applications and clusters on GKE for production.