infrastructure/cloud-gcp/gcp-gke/SKILL.md
Deploy and manage Google Kubernetes Engine clusters. Configure node pools, networking, and workload identity. Use when running Kubernetes on GCP.
npx skillsauth add bagelhole/devops-security-agent-skills gcp-gkeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Deploy, operate, and scale managed Kubernetes clusters on Google Cloud Platform.
gcloud) and kubectl installedroles/container.admin for cluster managementgcloud services enable container.googleapis.com compute.googleapis.com
gcloud components install kubectl
| Feature | Standard | Autopilot | |---------|----------|-----------| | Node management | You manage node pools | Google manages nodes | | Pricing | Pay per node (VM) | Pay per pod resource request | | GPU/TPU | Full support | Supported (with limits) | | DaemonSets | Allowed | Restricted | | Best for | Full control, specialized HW | Hands-off, cost-optimized |
gcloud container clusters create prod-cluster \
--region=us-central1 --num-nodes=2 \
--machine-type=e2-standard-4 --disk-size=100 \
--enable-autoscaling --min-nodes=1 --max-nodes=5 \
--enable-autorepair --enable-autoupgrade \
--release-channel=regular \
--workload-pool=${PROJECT_ID}.svc.id.goog \
--enable-ip-alias --enable-network-policy \
--enable-shielded-nodes \
--logging=SYSTEM,WORKLOAD --monitoring=SYSTEM,WORKLOAD \
--labels=env=production,team=platform
gcloud container clusters get-credentials prod-cluster --region=us-central1
gcloud container clusters create-auto autopilot-prod \
--region=us-central1 --release-channel=regular \
--workload-pool=${PROJECT_ID}.svc.id.goog \
--network=my-vpc --subnetwork=gke-subnet
# High-memory pool with taint
gcloud container node-pools create highmem-pool \
--cluster=prod-cluster --region=us-central1 \
--machine-type=n2-highmem-8 --disk-size=200 --disk-type=pd-ssd \
--num-nodes=1 --enable-autoscaling --min-nodes=0 --max-nodes=4 \
--node-labels=workload=memory-intensive \
--node-taints=dedicated=highmem:NoSchedule
# GPU pool
gcloud container node-pools create gpu-pool \
--cluster=prod-cluster --region=us-central1 \
--machine-type=n1-standard-8 \
--accelerator=type=nvidia-tesla-t4,count=1 \
--num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=4 \
--node-taints=nvidia.com/gpu=present:NoSchedule
# Spot pool for batch workloads
gcloud container node-pools create spot-pool \
--cluster=prod-cluster --region=us-central1 \
--machine-type=e2-standard-4 --spot \
--num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=20 \
--node-taints=cloud.google.com/gke-spot=true:NoSchedule
# Create GSA and grant permissions
gcloud iam service-accounts create app-gsa
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
--member="serviceAccount:app-gsa@${PROJECT_ID}.iam.gserviceaccount.com" \
--role="roles/storage.objectViewer"
# Create KSA and bind to GSA
kubectl create namespace myapp
kubectl create serviceaccount app-ksa --namespace=myapp
gcloud iam service-accounts add-iam-policy-binding \
app-gsa@${PROJECT_ID}.iam.gserviceaccount.com \
--role=roles/iam.workloadIdentityUser \
--member="serviceAccount:${PROJECT_ID}.svc.id.goog[myapp/app-ksa]"
kubectl annotate serviceaccount app-ksa --namespace=myapp \
iam.gke.io/gcp-service-account=app-gsa@${PROJECT_ID}.iam.gserviceaccount.com
# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: web-app
namespace: myapp
spec:
replicas: 3
selector:
matchLabels: { app: web-app }
template:
metadata:
labels: { app: web-app }
spec:
serviceAccountName: app-ksa
containers:
- name: web
image: us-central1-docker.pkg.dev/PROJECT_ID/repo/web-app:v1.2.0
ports: [{ containerPort: 8080 }]
resources:
requests: { cpu: 250m, memory: 512Mi }
limits: { cpu: 500m, memory: 1Gi }
readinessProbe:
httpGet: { path: /healthz, port: 8080 }
initialDelaySeconds: 5
livenessProbe:
httpGet: { path: /healthz, port: 8080 }
initialDelaySeconds: 15
topologySpreadConstraints:
- maxSkew: 1
topologyKey: topology.kubernetes.io/zone
whenUnsatisfiable: DoNotSchedule
labelSelector:
matchLabels: { app: web-app }
---
apiVersion: v1
kind: Service
metadata: { name: web-app, namespace: myapp }
spec:
selector: { app: web-app }
ports: [{ port: 80, targetPort: 8080 }]
type: ClusterIP
apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
name: web-ingress
namespace: myapp
annotations:
kubernetes.io/ingress.class: "gce"
networking.gke.io/managed-certificates: "web-cert"
kubernetes.io/ingress.global-static-ip-name: "web-static-ip"
spec:
rules:
- host: app.example.com
http:
paths:
- path: /
pathType: Prefix
backend:
service: { name: web-app, port: { number: 80 } }
---
apiVersion: networking.gke.io/v1
kind: ManagedCertificate
metadata: { name: web-cert, namespace: myapp }
spec:
domains: [app.example.com]
gcloud compute addresses create web-static-ip --global
resource "google_container_cluster" "primary" {
name = "prod-cluster"
location = "us-central1"
release_channel { channel = "REGULAR" }
workload_identity_config { workload_pool = "${var.project_id}.svc.id.goog" }
network = google_compute_network.vpc.name
subnetwork = google_compute_subnetwork.gke.name
ip_allocation_policy {
cluster_secondary_range_name = "pods"
services_secondary_range_name = "services"
}
private_cluster_config {
enable_private_nodes = true
master_ipv4_cidr_block = "172.16.0.0/28"
}
network_policy { enabled = true }
logging_config { enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"] }
monitoring_config {
enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
managed_prometheus { enabled = true }
}
remove_default_node_pool = true
initial_node_count = 1
}
resource "google_container_node_pool" "primary" {
name = "primary-pool"
cluster = google_container_cluster.primary.name
location = "us-central1"
initial_node_count = 2
autoscaling { min_node_count = 1; max_node_count = 5 }
management { auto_repair = true; auto_upgrade = true }
node_config {
machine_type = "e2-standard-4"
disk_size_gb = 100
disk_type = "pd-balanced"
oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
shielded_instance_config {
enable_secure_boot = true
enable_integrity_monitoring = true
}
metadata = { disable-legacy-endpoints = "true" }
}
}
resource "google_compute_subnetwork" "gke" {
name = "gke-subnet"
ip_cidr_range = "10.0.0.0/20"
region = "us-central1"
network = google_compute_network.vpc.id
secondary_ip_range { range_name = "pods"; ip_cidr_range = "10.4.0.0/14" }
secondary_ip_range { range_name = "services"; ip_cidr_range = "10.8.0.0/20" }
}
gcloud container clusters list
gcloud container clusters upgrade prod-cluster --region=us-central1 --master
kubectl top nodes && kubectl top pods --namespace=myapp
kubectl scale deployment web-app --replicas=5 --namespace=myapp
kubectl autoscale deployment web-app --namespace=myapp --min=3 --max=20 --cpu-percent=70
kubectl logs -f deployment/web-app --namespace=myapp --all-containers
| Symptom | Cause | Fix |
|---------|-------|-----|
| Pods stuck in Pending | No nodes with enough resources | Check autoscaler; add larger node pool; verify resource requests |
| ImagePullBackOff | Wrong image path or missing AR access | Verify image URL; grant roles/artifactregistry.reader to node SA |
| Workload Identity wrong account | KSA annotation missing | Re-annotate KSA; restart pods to pick up new token |
| Nodes NotReady | Disk/memory pressure or network issue | Run kubectl describe node; check taints and conditions |
| Ingress returns 502 | Backend pods failing health check | Verify readiness probe; check NEG health in Console |
| Cluster create quota error | Insufficient regional CPU/IP quota | Request quota increase in IAM & Admin > Quotas |
| Network policy not working | Not enabled on cluster | Recreate with --enable-network-policy or use Dataplane V2 |
development
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
testing
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.