skills/gke-observability/SKILL.md
Workflows for setting up and auditing observability (logging, monitoring, tracing) on GKE.
npx skillsauth add googlecloudplatform/gke-mcp gke-observabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides workflows for ensuring your GKE cluster and workloads have adequate observability for production use.
Check if Cloud Logging and Cloud Monitoring are enabled on the cluster.
Command:
gcloud container clusters describe <cluster-name> --region <region> --project <project-id> --format="json(loggingConfig, monitoringConfig)"
Look for loggingService and monitoringService to be set to something other than none (usually logging.googleapis.com/kubernetes and monitoring.googleapis.com/kubernetes).
Google Cloud Managed Service for Prometheus is the recommended way to collect metrics from your applications.
Command to enable:
gcloud container clusters update <cluster-name> \
--enable-managed-prometheus \
--region <region>
Verify installation:
kubectl get pods -n gmp-system
Ensure your workloads are logging to standard output, which Cloud Logging collects automatically.
Check workload logs:
kubectl logs <pod-name> -n <namespace>
Ensure logs are in a structured format (like JSON) if possible, for easier querying.
Recommend creating dashboards in Cloud Monitoring for key metrics:
Set up alerting policies for critical thresholds.
Enable distributed tracing to track requests across microservices.
Use continuous profiling to analyze application performance in production with low overhead.
Use Logging Query Language (LQL) in Cloud Logging to find specific logs.
Example LQL Queries:
resource.type="k8s_container"
resource.labels.container_name="my-app"
severity>=ERROR
resource.type="k8s_container"
textPayload:"connection refused"
For Standard clusters, you can enable collection of metrics from the Kubernetes API server, scheduler, and controller manager.
Command:
gcloud container clusters update <cluster-name> \
--monitoring=SYSTEM,API_SERVER,SCHEDULER,CONTROLLER_MANAGER \
--region <region>
If using GKE Dataplane V2, you can enable advanced L4 observability.
Command:
gcloud container clusters update <cluster-name> \
--enable-dataplane-v2-observability \
--region <region>
This allows you to observe traffic flows and network metrics.
data-ai
Systematically diagnose GKE JobSet interruptions, restarts, and preemptions for AI/ML training workloads. Identifies preemption events, maintenance interruptions, bad host VMs, unhealthy pods, and coordinator worker failures.
development
Diagnose and prevent `vbar_control_agent` segfaults and OOMs caused by race conditions during TPU device resets and frequent metrics collection (e.g. every 3s). Use when TPU slice initialization fails or `vbar_control_agent` crashes on TPU v6e nodes.
development
Expert instructions for building high-quality GKE troubleshooting skills. Codifies Step 0 context rules, zero-hallucination signatures, and explicit LQL/PromQL query requirements.
tools
Assists in preparing applications and clusters on GKE for production.