skills/gke-inference-quickstart/SKILL.md
Deploy optimized AI/ML inference workloads on GKE using Google's Inference Quickstart (GIQ). Covers model discovery, manifest generation, and deployment using native MCP tools and CLI.
npx skillsauth add googlecloudplatform/gke-mcp gke-inference-quickstartInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill guides the deployment of AI/ML inference workloads on GKE using GIQ. It leverages gcloud container ai profiles manifests create to create optimized Kubernetes manifests based on Google's best practices and benchmarks.
gcloud CLI installed and authenticated (for discovery commands).Before generating a manifest, you often need to pick a valid combination of Model, Model Server, and Accelerator.
List all supported models:
gcloud container ai profiles models list
Find valid accelerators and servers for a specific model:
# Replace <MODEL_NAME> with a model from the list above (e.g., 'gemma-2-9b-it')
gcloud container ai profiles list --model=<MODEL_NAME>
View benchmarks/profiles (optional): To see costs and latency targets:
gcloud container ai profiles list --model=<MODEL_NAME>
Use the gcloud container ai profiles manifests create command. This ensures you are using the latest supported flags and options directly from the CLI.
Parameters:
--model: The model ID (e.g., gemma-2-9b-it).--model-server: The inference server (e.g., vllm, tgi, triton, tensorrt-llm).--accelerator-type: The accelerator type (e.g., nvidia-l4, nvidia-tesla-a100).--target-ntpot-milliseconds: (Optional) Target Normalized Time Per Output Token in ms.Example Command:
gcloud container ai profiles manifests create \
--model=gemma-2-9b-it \
--model-server=vllm \
--accelerator-type=nvidia-l4 \
--target-ntpot-milliseconds=50 > inference-workload.yaml
inference-workload.yaml. Ensure you have this file.kubectl apply -f inference-workload.yaml
NVIDIA_L4_GPUS).data-ai
Systematically diagnose GKE JobSet interruptions, restarts, and preemptions for AI/ML training workloads. Identifies preemption events, maintenance interruptions, bad host VMs, unhealthy pods, and coordinator worker failures.
development
Diagnose and prevent `vbar_control_agent` segfaults and OOMs caused by race conditions during TPU device resets and frequent metrics collection (e.g. every 3s). Use when TPU slice initialization fails or `vbar_control_agent` crashes on TPU v6e nodes.
development
Expert instructions for building high-quality GKE troubleshooting skills. Codifies Step 0 context rules, zero-hallucination signatures, and explicit LQL/PromQL query requirements.
tools
Assists in preparing applications and clusters on GKE for production.