Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

googlecloudplatform/gke-reliability

Name: gke-reliability
Author: googlecloudplatform

skills/gke-reliability/SKILL.md

npx skillsauth add googlecloudplatform/gke-mcp gke-reliability

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

GKE Reliability Skill

This skill provides workflows for configuring your GKE cluster and workloads for high availability and reliability.

Workflows

1. Verify Cluster High Availability

Check if the cluster is regional or has multi-zonal node pools.

Command:

gcloud container clusters describe <cluster-name> --region <region> --format="json(location, locations)"

If location is a region (e.g., us-central1), the control plane is regional. If locations has multiple entries, nodes are spread across multiple zones.

2. Configure Pod Disruption Budgets (PDB)

PDBs ensure that a minimum number of pods are available during voluntary disruptions (like node upgrades).

Check existing PDBs:

kubectl get pdb -n <namespace>

Example Manifest:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

3. Configure Health Probes

Ensure all production containers have Liveness, Readiness, and optionally Startup probes.

Readiness Probe: Determines when a container is ready to start accepting traffic.
Liveness Probe: Determines when to restart a container.
Startup Probe: Disables liveness and readiness checks until the app has started up.

Check workload probes:

kubectl get deployment <app-name> -n <namespace> -o yaml | grep -E "livenessProbe|readinessProbe|startupProbe"

4. Graceful Shutdown

Ensure applications handle SIGTERM signals gracefully and have an appropriate terminationGracePeriodSeconds set (default is 30s).

5. Topology Spread Constraints

Ensure pods are spread across zones or nodes to avoid correlated failures.

Example Manifest excerpt:

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule # or ScheduleAnyway
      labelSelector:
        matchLabels:
          app: my-app

6. Maintenance Windows and Exclusions

Configure when GKE can perform automated upgrades to avoid peak hours.

Command to set maintenance window:

gcloud container clusters update <cluster-name> \
    --region <region> \
    --maintenance-window-start <start-time> \
    --maintenance-window-recurrence "FREQ=DAILY"

Best Practices

Regional Clusters: Always use regional clusters for production workloads to survive zone failures.
Probes for All Containers: Every container in a production pod should have at least a readiness probe.
PDBs for Critical Apps: Use PDBs to prevent downtime during automated node upgrades.
Zone Spreading: Always use topologySpreadConstraints to ensure pods are distributed across zones, even in regional clusters.
Schedule Maintenance: Set maintenance windows to ensure upgrades happen during low-traffic periods.

googlecloudplatform/gke-reliability

skills/gke-reliability/SKILL.md

Workflows for ensuring high availability and reliability of GKE workloads.

141 stars

data-ai

Updated Apr 18, 2026

$ install --global

skillsauth

npx skillsauth add googlecloudplatform/gke-mcp gke-reliability

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 18, 2026, 7:36 AM16.7s1 file scanned

SKILL.md

name:: gke-reliability
description:: Workflows for ensuring high availability and reliability of GKE workloads.

GKE Reliability Skill

This skill provides workflows for configuring your GKE cluster and workloads for high availability and reliability.

Workflows

1. Verify Cluster High Availability

Check if the cluster is regional or has multi-zonal node pools.

Command:

gcloud container clusters describe <cluster-name> --region <region> --format="json(location, locations)"

If location is a region (e.g., us-central1), the control plane is regional. If locations has multiple entries, nodes are spread across multiple zones.

2. Configure Pod Disruption Budgets (PDB)

PDBs ensure that a minimum number of pods are available during voluntary disruptions (like node upgrades).

Check existing PDBs:

kubectl get pdb -n <namespace>

Example Manifest:

apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
  name: my-app-pdb
spec:
  minAvailable: 2
  selector:
    matchLabels:
      app: my-app

3. Configure Health Probes

Ensure all production containers have Liveness, Readiness, and optionally Startup probes.

Readiness Probe: Determines when a container is ready to start accepting traffic.
Liveness Probe: Determines when to restart a container.
Startup Probe: Disables liveness and readiness checks until the app has started up.

Check workload probes:

kubectl get deployment <app-name> -n <namespace> -o yaml | grep -E "livenessProbe|readinessProbe|startupProbe"

4. Graceful Shutdown

Ensure applications handle SIGTERM signals gracefully and have an appropriate terminationGracePeriodSeconds set (default is 30s).

5. Topology Spread Constraints

Ensure pods are spread across zones or nodes to avoid correlated failures.

Example Manifest excerpt:

spec:
  topologySpreadConstraints:
    - maxSkew: 1
      topologyKey: topology.kubernetes.io/zone
      whenUnsatisfiable: DoNotSchedule # or ScheduleAnyway
      labelSelector:
        matchLabels:
          app: my-app

6. Maintenance Windows and Exclusions

Configure when GKE can perform automated upgrades to avoid peak hours.

Command to set maintenance window:

gcloud container clusters update <cluster-name> \
    --region <region> \
    --maintenance-window-start <start-time> \
    --maintenance-window-recurrence "FREQ=DAILY"

Best Practices

Regional Clusters: Always use regional clusters for production workloads to survive zone failures.
Probes for All Containers: Every container in a production pod should have at least a readiness probe.
PDBs for Critical Apps: Use PDBs to prevent downtime during automated node upgrades.
Zone Spreading: Always use topologySpreadConstraints to ensure pods are distributed across zones, even in regional clusters.
Schedule Maintenance: Set maintenance windows to ensure upgrades happen during low-traffic periods.

Related Skills

googlecloudplatform/gke-ai-troubleshooting-jobset-interruption

data-ai

VerifiedTrustedCommunity

Systematically diagnose GKE JobSet interruptions, restarts, and preemptions for AI/ML training workloads. Identifies preemption events, maintenance interruptions, bad host VMs, unhealthy pods, and coordinator worker failures.

158SKILL.mdUpdated Jun 4, 2026

googlecloudplatform/gke-ai-troubleshooting-jobset-interruption

googlecloudplatform/gke-ai-troubleshooting-tpu-connection-failure-vbar-oom

development

VerifiedTrustedCommunity

Diagnose and prevent `vbar_control_agent` segfaults and OOMs caused by race conditions during TPU device resets and frequent metrics collection (e.g. every 3s). Use when TPU slice initialization fails or `vbar_control_agent` crashes on TPU v6e nodes.

148SKILL.mdUpdated May 5, 2026

googlecloudplatform/gke-ai-troubleshooting-tpu-connection-failure-vbar-oom

googlecloudplatform/gke-ai-troubleshooting-skill-creation-guide

development

VerifiedTrustedCommunity

Expert instructions for building high-quality GKE troubleshooting skills. Codifies Step 0 context rules, zero-hallucination signatures, and explicit LQL/PromQL query requirements.

148SKILL.mdUpdated May 2, 2026

googlecloudplatform/gke-ai-troubleshooting-skill-creation-guide

googlecloudplatform/gke-productionize

tools

VerifiedTrustedCommunity

Assists in preparing applications and clusters on GKE for production.

148SKILL.mdUpdated Apr 18, 2026

googlecloudplatform/gke-productionize

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/googlecloudplatform/gke-mcp.git

# Copy into Claude Code skills folder (global)
cp -r gke-mcp/skills/gke-reliability ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

googlecloudplatform/gke-mcp

141 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT