skills/gke-storage/SKILL.md
Guidance on managing storage in Google Kubernetes Engine (GKE) clusters.
npx skillsauth add googlecloudplatform/gke-mcp gke-storageInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides guidance on managing storage in Google Kubernetes Engine (GKE) clusters.
GKE supports various storage options, from Persistent Disks to Cloud Storage. Choosing the right storage type and configuring it correctly is essential for performance and reliability.
StorageClasses allow you to describe the "classes" of storage you offer. Different classes might map to quality-of-service levels, or to backup policies.
Example StorageClass Manifest:
apiVersion: storage.k8s.io/v1
kind: StorageClass
metadata:
name: premium-rwo
provisioner: pd.csi.storage.gke.io
parameters:
type: pd-ssd
replication-type: regional-pd
volumeBindingMode: WaitForFirstConsumer
allowVolumeExpansion: true
Setting allowVolumeExpansion: true is highly recommended for production.
GKE includes container storage interface (CSI) drivers for dynamic provisioning of storage.
Example using Filestore CSI Driver:
apiVersion: v1
kind: PersistentVolumeClaim
metadata:
name: filestore-pvc
spec:
accessModes:
- ReadWriteMany
storageClassName: standard-rwm # Pre-defined for Filestore
resources:
requests:
storage: 1Ti
If allowVolumeExpansion is true in the StorageClass, you can resize a volume by updating the PVC manifest.
Steps:
Kubernetes will automatically resize the file system on the volume.
allowVolumeExpansion: true in your StorageClasses to allow for growth.pd-ssd or pd-extreme for I/O intensive workloads, and pd-standard or pd-balanced for others.data-ai
Systematically diagnose GKE JobSet interruptions, restarts, and preemptions for AI/ML training workloads. Identifies preemption events, maintenance interruptions, bad host VMs, unhealthy pods, and coordinator worker failures.
development
Diagnose and prevent `vbar_control_agent` segfaults and OOMs caused by race conditions during TPU device resets and frequent metrics collection (e.g. every 3s). Use when TPU slice initialization fails or `vbar_control_agent` crashes on TPU v6e nodes.
development
Expert instructions for building high-quality GKE troubleshooting skills. Codifies Step 0 context rules, zero-hallucination signatures, and explicit LQL/PromQL query requirements.
tools
Assists in preparing applications and clusters on GKE for production.