Google Kubernetes Engine (GKE)

Deploy, operate, and scale managed Kubernetes clusters on Google Cloud Platform.

When to Use

Running containerized microservices at scale with automatic scaling and healing
Workloads requiring fine-grained orchestration, service mesh, or custom scheduling
Teams already invested in Kubernetes tooling (Helm, Argo CD, Flux)
When Cloud Run's request-based model does not fit (long-running, stateful workloads)

Prerequisites

Google Cloud SDK (gcloud) and kubectl installed
APIs enabled: Kubernetes Engine, Compute Engine
IAM role roles/container.admin for cluster management

gcloud services enable container.googleapis.com compute.googleapis.com
gcloud components install kubectl

Standard vs Autopilot

| Feature | Standard | Autopilot | |---------|----------|-----------| | Node management | You manage node pools | Google manages nodes | | Pricing | Pay per node (VM) | Pay per pod resource request | | GPU/TPU | Full support | Supported (with limits) | | DaemonSets | Allowed | Restricted | | Best for | Full control, specialized HW | Hands-off, cost-optimized |

Create a Standard Cluster

gcloud container clusters create prod-cluster \
  --region=us-central1 --num-nodes=2 \
  --machine-type=e2-standard-4 --disk-size=100 \
  --enable-autoscaling --min-nodes=1 --max-nodes=5 \
  --enable-autorepair --enable-autoupgrade \
  --release-channel=regular \
  --workload-pool=${PROJECT_ID}.svc.id.goog \
  --enable-ip-alias --enable-network-policy \
  --enable-shielded-nodes \
  --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM,WORKLOAD \
  --labels=env=production,team=platform

gcloud container clusters get-credentials prod-cluster --region=us-central1

Create an Autopilot Cluster

gcloud container clusters create-auto autopilot-prod \
  --region=us-central1 --release-channel=regular \
  --workload-pool=${PROJECT_ID}.svc.id.goog \
  --network=my-vpc --subnetwork=gke-subnet

Node Pools

# High-memory pool with taint
gcloud container node-pools create highmem-pool \
  --cluster=prod-cluster --region=us-central1 \
  --machine-type=n2-highmem-8 --disk-size=200 --disk-type=pd-ssd \
  --num-nodes=1 --enable-autoscaling --min-nodes=0 --max-nodes=4 \
  --node-labels=workload=memory-intensive \
  --node-taints=dedicated=highmem:NoSchedule

# GPU pool
gcloud container node-pools create gpu-pool \
  --cluster=prod-cluster --region=us-central1 \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-t4,count=1 \
  --num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=4 \
  --node-taints=nvidia.com/gpu=present:NoSchedule

# Spot pool for batch workloads
gcloud container node-pools create spot-pool \
  --cluster=prod-cluster --region=us-central1 \
  --machine-type=e2-standard-4 --spot \
  --num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=20 \
  --node-taints=cloud.google.com/gke-spot=true:NoSchedule

Workload Identity

# Create GSA and grant permissions
gcloud iam service-accounts create app-gsa
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
  --member="serviceAccount:app-gsa@${PROJECT_ID}.iam.gserviceaccount.com" \
  --role="roles/storage.objectViewer"

# Create KSA and bind to GSA
kubectl create namespace myapp
kubectl create serviceaccount app-ksa --namespace=myapp
gcloud iam service-accounts add-iam-policy-binding \
  app-gsa@${PROJECT_ID}.iam.gserviceaccount.com \
  --role=roles/iam.workloadIdentityUser \
  --member="serviceAccount:${PROJECT_ID}.svc.id.goog[myapp/app-ksa]"
kubectl annotate serviceaccount app-ksa --namespace=myapp \
  iam.gke.io/gcp-service-account=app-gsa@${PROJECT_ID}.iam.gserviceaccount.com

Deploying Workloads

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  namespace: myapp
spec:
  replicas: 3
  selector:
    matchLabels: { app: web-app }
  template:
    metadata:
      labels: { app: web-app }
    spec:
      serviceAccountName: app-ksa
      containers:
      - name: web
        image: us-central1-docker.pkg.dev/PROJECT_ID/repo/web-app:v1.2.0
        ports: [{ containerPort: 8080 }]
        resources:
          requests: { cpu: 250m, memory: 512Mi }
          limits: { cpu: 500m, memory: 1Gi }
        readinessProbe:
          httpGet: { path: /healthz, port: 8080 }
          initialDelaySeconds: 5
        livenessProbe:
          httpGet: { path: /healthz, port: 8080 }
          initialDelaySeconds: 15
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels: { app: web-app }
---
apiVersion: v1
kind: Service
metadata: { name: web-app, namespace: myapp }
spec:
  selector: { app: web-app }
  ports: [{ port: 80, targetPort: 8080 }]
  type: ClusterIP

Ingress with Managed SSL

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-ingress
  namespace: myapp
  annotations:
    kubernetes.io/ingress.class: "gce"
    networking.gke.io/managed-certificates: "web-cert"
    kubernetes.io/ingress.global-static-ip-name: "web-static-ip"
spec:
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service: { name: web-app, port: { number: 80 } }
---
apiVersion: networking.gke.io/v1
kind: ManagedCertificate
metadata: { name: web-cert, namespace: myapp }
spec:
  domains: [app.example.com]

gcloud compute addresses create web-static-ip --global

Terraform Configuration

resource "google_container_cluster" "primary" {
  name     = "prod-cluster"
  location = "us-central1"

  release_channel { channel = "REGULAR" }
  workload_identity_config { workload_pool = "${var.project_id}.svc.id.goog" }

  network    = google_compute_network.vpc.name
  subnetwork = google_compute_subnetwork.gke.name

  ip_allocation_policy {
    cluster_secondary_range_name  = "pods"
    services_secondary_range_name = "services"
  }

  private_cluster_config {
    enable_private_nodes   = true
    master_ipv4_cidr_block = "172.16.0.0/28"
  }

  network_policy { enabled = true }
  logging_config { enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"] }
  monitoring_config {
    enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
    managed_prometheus { enabled = true }
  }

  remove_default_node_pool = true
  initial_node_count       = 1
}

resource "google_container_node_pool" "primary" {
  name     = "primary-pool"
  cluster  = google_container_cluster.primary.name
  location = "us-central1"

  initial_node_count = 2
  autoscaling { min_node_count = 1; max_node_count = 5 }
  management  { auto_repair = true; auto_upgrade = true }

  node_config {
    machine_type = "e2-standard-4"
    disk_size_gb = 100
    disk_type    = "pd-balanced"
    oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }
    metadata = { disable-legacy-endpoints = "true" }
  }
}

resource "google_compute_subnetwork" "gke" {
  name          = "gke-subnet"
  ip_cidr_range = "10.0.0.0/20"
  region        = "us-central1"
  network       = google_compute_network.vpc.id

  secondary_ip_range { range_name = "pods";     ip_cidr_range = "10.4.0.0/14" }
  secondary_ip_range { range_name = "services"; ip_cidr_range = "10.8.0.0/20" }
}

Common Operations

gcloud container clusters list
gcloud container clusters upgrade prod-cluster --region=us-central1 --master
kubectl top nodes && kubectl top pods --namespace=myapp
kubectl scale deployment web-app --replicas=5 --namespace=myapp
kubectl autoscale deployment web-app --namespace=myapp --min=3 --max=20 --cpu-percent=70
kubectl logs -f deployment/web-app --namespace=myapp --all-containers

Troubleshooting

| Symptom | Cause | Fix | |---------|-------|-----| | Pods stuck in Pending | No nodes with enough resources | Check autoscaler; add larger node pool; verify resource requests | | ImagePullBackOff | Wrong image path or missing AR access | Verify image URL; grant roles/artifactregistry.reader to node SA | | Workload Identity wrong account | KSA annotation missing | Re-annotate KSA; restart pods to pick up new token | | Nodes NotReady | Disk/memory pressure or network issue | Run kubectl describe node; check taints and conditions | | Ingress returns 502 | Backend pods failing health check | Verify readiness probe; check NEG health in Console | | Cluster create quota error | Insufficient regional CPU/IP quota | Request quota increase in IAM & Admin > Quotas | | Network policy not working | Not enabled on cluster | Recreate with --enable-network-policy or use Dataplane V2 |

Related Skills

gcp-networking - VPC, firewall rules, and load balancers for GKE clusters
terraform-gcp - Provision GKE clusters with Infrastructure as Code
gcp-compute - When workloads are better suited for VMs than containers
gcp-cloud-sql - Connecting GKE pods to Cloud SQL via sidecar proxy

Google Kubernetes Engine (GKE)

Deploy, operate, and scale managed Kubernetes clusters on Google Cloud Platform.

When to Use

Running containerized microservices at scale with automatic scaling and healing
Workloads requiring fine-grained orchestration, service mesh, or custom scheduling
Teams already invested in Kubernetes tooling (Helm, Argo CD, Flux)
When Cloud Run's request-based model does not fit (long-running, stateful workloads)

Prerequisites

Google Cloud SDK (gcloud) and kubectl installed
APIs enabled: Kubernetes Engine, Compute Engine
IAM role roles/container.admin for cluster management

gcloud services enable container.googleapis.com compute.googleapis.com
gcloud components install kubectl

Standard vs Autopilot

Create a Standard Cluster

gcloud container clusters create prod-cluster \
  --region=us-central1 --num-nodes=2 \
  --machine-type=e2-standard-4 --disk-size=100 \
  --enable-autoscaling --min-nodes=1 --max-nodes=5 \
  --enable-autorepair --enable-autoupgrade \
  --release-channel=regular \
  --workload-pool=${PROJECT_ID}.svc.id.goog \
  --enable-ip-alias --enable-network-policy \
  --enable-shielded-nodes \
  --logging=SYSTEM,WORKLOAD --monitoring=SYSTEM,WORKLOAD \
  --labels=env=production,team=platform

gcloud container clusters get-credentials prod-cluster --region=us-central1

Create an Autopilot Cluster

gcloud container clusters create-auto autopilot-prod \
  --region=us-central1 --release-channel=regular \
  --workload-pool=${PROJECT_ID}.svc.id.goog \
  --network=my-vpc --subnetwork=gke-subnet

Node Pools

# High-memory pool with taint
gcloud container node-pools create highmem-pool \
  --cluster=prod-cluster --region=us-central1 \
  --machine-type=n2-highmem-8 --disk-size=200 --disk-type=pd-ssd \
  --num-nodes=1 --enable-autoscaling --min-nodes=0 --max-nodes=4 \
  --node-labels=workload=memory-intensive \
  --node-taints=dedicated=highmem:NoSchedule

# GPU pool
gcloud container node-pools create gpu-pool \
  --cluster=prod-cluster --region=us-central1 \
  --machine-type=n1-standard-8 \
  --accelerator=type=nvidia-tesla-t4,count=1 \
  --num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=4 \
  --node-taints=nvidia.com/gpu=present:NoSchedule

# Spot pool for batch workloads
gcloud container node-pools create spot-pool \
  --cluster=prod-cluster --region=us-central1 \
  --machine-type=e2-standard-4 --spot \
  --num-nodes=0 --enable-autoscaling --min-nodes=0 --max-nodes=20 \
  --node-taints=cloud.google.com/gke-spot=true:NoSchedule

Workload Identity

# Create GSA and grant permissions
gcloud iam service-accounts create app-gsa
gcloud projects add-iam-policy-binding ${PROJECT_ID} \
  --member="serviceAccount:app-gsa@${PROJECT_ID}.iam.gserviceaccount.com" \
  --role="roles/storage.objectViewer"

# Create KSA and bind to GSA
kubectl create namespace myapp
kubectl create serviceaccount app-ksa --namespace=myapp
gcloud iam service-accounts add-iam-policy-binding \
  app-gsa@${PROJECT_ID}.iam.gserviceaccount.com \
  --role=roles/iam.workloadIdentityUser \
  --member="serviceAccount:${PROJECT_ID}.svc.id.goog[myapp/app-ksa]"
kubectl annotate serviceaccount app-ksa --namespace=myapp \
  iam.gke.io/gcp-service-account=app-gsa@${PROJECT_ID}.iam.gserviceaccount.com

Deploying Workloads

# deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
  name: web-app
  namespace: myapp
spec:
  replicas: 3
  selector:
    matchLabels: { app: web-app }
  template:
    metadata:
      labels: { app: web-app }
    spec:
      serviceAccountName: app-ksa
      containers:
      - name: web
        image: us-central1-docker.pkg.dev/PROJECT_ID/repo/web-app:v1.2.0
        ports: [{ containerPort: 8080 }]
        resources:
          requests: { cpu: 250m, memory: 512Mi }
          limits: { cpu: 500m, memory: 1Gi }
        readinessProbe:
          httpGet: { path: /healthz, port: 8080 }
          initialDelaySeconds: 5
        livenessProbe:
          httpGet: { path: /healthz, port: 8080 }
          initialDelaySeconds: 15
      topologySpreadConstraints:
      - maxSkew: 1
        topologyKey: topology.kubernetes.io/zone
        whenUnsatisfiable: DoNotSchedule
        labelSelector:
          matchLabels: { app: web-app }
---
apiVersion: v1
kind: Service
metadata: { name: web-app, namespace: myapp }
spec:
  selector: { app: web-app }
  ports: [{ port: 80, targetPort: 8080 }]
  type: ClusterIP

Ingress with Managed SSL

apiVersion: networking.k8s.io/v1
kind: Ingress
metadata:
  name: web-ingress
  namespace: myapp
  annotations:
    kubernetes.io/ingress.class: "gce"
    networking.gke.io/managed-certificates: "web-cert"
    kubernetes.io/ingress.global-static-ip-name: "web-static-ip"
spec:
  rules:
  - host: app.example.com
    http:
      paths:
      - path: /
        pathType: Prefix
        backend:
          service: { name: web-app, port: { number: 80 } }
---
apiVersion: networking.gke.io/v1
kind: ManagedCertificate
metadata: { name: web-cert, namespace: myapp }
spec:
  domains: [app.example.com]

gcloud compute addresses create web-static-ip --global

Terraform Configuration

resource "google_container_cluster" "primary" {
  name     = "prod-cluster"
  location = "us-central1"

  release_channel { channel = "REGULAR" }
  workload_identity_config { workload_pool = "${var.project_id}.svc.id.goog" }

  network    = google_compute_network.vpc.name
  subnetwork = google_compute_subnetwork.gke.name

  ip_allocation_policy {
    cluster_secondary_range_name  = "pods"
    services_secondary_range_name = "services"
  }

  private_cluster_config {
    enable_private_nodes   = true
    master_ipv4_cidr_block = "172.16.0.0/28"
  }

  network_policy { enabled = true }
  logging_config { enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"] }
  monitoring_config {
    enable_components = ["SYSTEM_COMPONENTS", "WORKLOADS"]
    managed_prometheus { enabled = true }
  }

  remove_default_node_pool = true
  initial_node_count       = 1
}

resource "google_container_node_pool" "primary" {
  name     = "primary-pool"
  cluster  = google_container_cluster.primary.name
  location = "us-central1"

  initial_node_count = 2
  autoscaling { min_node_count = 1; max_node_count = 5 }
  management  { auto_repair = true; auto_upgrade = true }

  node_config {
    machine_type = "e2-standard-4"
    disk_size_gb = 100
    disk_type    = "pd-balanced"
    oauth_scopes = ["https://www.googleapis.com/auth/cloud-platform"]
    shielded_instance_config {
      enable_secure_boot          = true
      enable_integrity_monitoring = true
    }
    metadata = { disable-legacy-endpoints = "true" }
  }
}

resource "google_compute_subnetwork" "gke" {
  name          = "gke-subnet"
  ip_cidr_range = "10.0.0.0/20"
  region        = "us-central1"
  network       = google_compute_network.vpc.id

  secondary_ip_range { range_name = "pods";     ip_cidr_range = "10.4.0.0/14" }
  secondary_ip_range { range_name = "services"; ip_cidr_range = "10.8.0.0/20" }
}

Common Operations

gcloud container clusters list
gcloud container clusters upgrade prod-cluster --region=us-central1 --master
kubectl top nodes && kubectl top pods --namespace=myapp
kubectl scale deployment web-app --replicas=5 --namespace=myapp
kubectl autoscale deployment web-app --namespace=myapp --min=3 --max=20 --cpu-percent=70
kubectl logs -f deployment/web-app --namespace=myapp --all-containers

Troubleshooting

Related Skills

gcp-networking - VPC, firewall rules, and load balancers for GKE clusters
terraform-gcp - Provision GKE clusters with Infrastructure as Code
gcp-compute - When workloads are better suited for VMs than containers
gcp-cloud-sql - Connecting GKE pods to Cloud SQL via sidecar proxy

Adoption

bagelhole/gcp-gke

$ install --global

Security Scan Results

SKILL.md

Google Kubernetes Engine (GKE)

When to Use

Prerequisites

Standard vs Autopilot

Create a Standard Cluster

Create an Autopilot Cluster

Node Pools

Workload Identity

Deploying Workloads

Ingress with Managed SSL

Terraform Configuration

Common Operations

Troubleshooting

Related Skills

Related Skills

bagelhole/sre-dashboards

bagelhole/openclaw-security-hardening

bagelhole/vector-database-ops

bagelhole/model-serving-kubernetes

bagelhole/gcp-gke

$ install --global

Security Scan Results

SKILL.md

Google Kubernetes Engine (GKE)

When to Use

Prerequisites

Standard vs Autopilot

Create a Standard Cluster

Create an Autopilot Cluster

Node Pools

Workload Identity

Deploying Workloads

Ingress with Managed SSL

Terraform Configuration

Common Operations

Troubleshooting

Related Skills

Related Skills

bagelhole/sre-dashboards

bagelhole/openclaw-security-hardening

bagelhole/vector-database-ops

bagelhole/model-serving-kubernetes