skills/gke-ai-troubleshooting-skill-creation-guide/SKILL.md
Expert instructions for building high-quality GKE troubleshooting skills. Codifies Step 0 context rules, zero-hallucination signatures, and explicit LQL/PromQL query requirements.
npx skillsauth add googlecloudplatform/gke-mcp gke-ai-troubleshooting-skill-creation-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this guide to build high-quality troubleshooting skills that enable AI agents to diagnose complex failures in GKE workloads.
SKILL.md: The core diagnostic and resolution workflow.README.md: Public-facing overview and "When to use" guide.references/failure_signatures.md: Authentic log/metric signatures.scripts/validate_queries.sh: Automatic syntax validator for all
queries.TEST.md: Manual verification plan for humans.EVAL.yaml: Evaluation suite for performance tracking.BUILD: Build definition.kebab-case (e.g.,
gke-ai-troubleshooting-tpu-vbar-oom).Every skill MUST begin with a "Step 0" to acquire necessary context.
<project_id>, <location>, <cluster_name>,
<timestamp>.<node_name>, <workload_name>,
<workload_namespace>, <nodepool_name>.[T - 30m] to [T + 30m].<project_id> instead
of curly braces for placeholders to avoid template resolution errors.scripts/validate_queries.sh) that
uses query_logs or gcloud logging read ... --limit=1 to verify its LQL
queries.references/failure_signatures.md
in relevant diagnostic steps.data-ai
Systematically diagnose GKE JobSet interruptions, restarts, and preemptions for AI/ML training workloads. Identifies preemption events, maintenance interruptions, bad host VMs, unhealthy pods, and coordinator worker failures.
development
Diagnose and prevent `vbar_control_agent` segfaults and OOMs caused by race conditions during TPU device resets and frequent metrics collection (e.g. every 3s). Use when TPU slice initialization fails or `vbar_control_agent` crashes on TPU v6e nodes.
tools
Assists in preparing applications and clusters on GKE for production.
devops
Workflows for containerizing and deploying applications to GKE for the first time.