Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

googlecloudplatform/gke-inference-quickstart

Name: gke-inference-quickstart
Author: googlecloudplatform

skills/gke-inference-quickstart/SKILL.md

npx skillsauth add googlecloudplatform/gke-mcp gke-inference-quickstart

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

GKE Inference Quickstart (GIQ)

Purpose

This skill guides the deployment of AI/ML inference workloads on GKE using GIQ. It leverages gcloud container ai profiles manifests create to create optimized Kubernetes manifests based on Google's best practices and benchmarks.

When to Use

Goal: Deploy an AI model (e.g., Llama, Gemma, Mistral) to GKE.
Goal: Generate a Kubernetes manifest for inference.
Context: User asks about "GIQ", "Inference Quickstart", or "AI benchmarks" on GKE.

Prerequisites

A GKE cluster (preferably with GPU/TPU node pools, though GIQ can help identify requirements).
gcloud CLI installed and authenticated (for discovery commands).

Workflow

1. Discovery: Find Models and Hardware

Before generating a manifest, you often need to pick a valid combination of Model, Model Server, and Accelerator.

List all supported models:

gcloud container ai profiles models list

Find valid accelerators and servers for a specific model:

# Replace <MODEL_NAME> with a model from the list above (e.g., 'gemma-2-9b-it')
gcloud container ai profiles list --model=<MODEL_NAME>

View benchmarks/profiles (optional): To see costs and latency targets:

gcloud container ai profiles list --model=<MODEL_NAME>

2. Generate Manifest

Use the gcloud container ai profiles manifests create command. This ensures you are using the latest supported flags and options directly from the CLI.

Parameters:

--model: The model ID (e.g., gemma-2-9b-it).
--model-server: The inference server (e.g., vllm, tgi, triton, tensorrt-llm).
--accelerator-type: The accelerator type (e.g., nvidia-l4, nvidia-tesla-a100).
--target-ntpot-milliseconds: (Optional) Target Normalized Time Per Output Token in ms.

Example Command:

gcloud container ai profiles manifests create \
  --model=gemma-2-9b-it \
  --model-server=vllm \
  --accelerator-type=nvidia-l4 \
  --target-ntpot-milliseconds=50 > inference-workload.yaml

3. Review and Deploy

Save: The example command above saves output to inference-workload.yaml. Ensure you have this file.
Review: Check for any placeholders or specific requirements (like PVCs or secrets).
- Note: Some models require Hugging Face tokens. Ensure query instructions for secrets are followed.

Deploy:

kubectl apply -f inference-workload.yaml

Troubleshooting

Invalid Combination: If the manifest creation fails with an invalid combination error, re-run the discovery commands in Step 1 to verify the tuple (model, server, accelerator).
Quota Issues: Ensure the target region has sufficient quota for the requested accelerator (e.g., NVIDIA_L4_GPUS).

Reference

Docs: GKE Inference Quickstart Documentation

googlecloudplatform/gke-inference-quickstart

skills/gke-inference-quickstart/SKILL.md

Deploy optimized AI/ML inference workloads on GKE using Google's Inference Quickstart (GIQ). Covers model discovery, manifest generation, and deployment using native MCP tools and CLI.

141 stars

tools

Updated Apr 18, 2026

$ install --global

skillsauth

npx skillsauth add googlecloudplatform/gke-mcp gke-inference-quickstart

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 18, 2026, 7:36 AM26.5s1 file scanned

SKILL.md

name:: gke-inference-quickstart
description:: Deploy optimized AI/ML inference workloads on GKE using Google's Inference Quickstart (GIQ). Covers model discovery, manifest generation, and deployment using native MCP tools and CLI.

GKE Inference Quickstart (GIQ)

Purpose

When to Use

Goal: Deploy an AI model (e.g., Llama, Gemma, Mistral) to GKE.
Goal: Generate a Kubernetes manifest for inference.
Context: User asks about "GIQ", "Inference Quickstart", or "AI benchmarks" on GKE.

Prerequisites

A GKE cluster (preferably with GPU/TPU node pools, though GIQ can help identify requirements).
gcloud CLI installed and authenticated (for discovery commands).

Workflow

1. Discovery: Find Models and Hardware

Before generating a manifest, you often need to pick a valid combination of Model, Model Server, and Accelerator.

List all supported models:

gcloud container ai profiles models list

Find valid accelerators and servers for a specific model:

# Replace <MODEL_NAME> with a model from the list above (e.g., 'gemma-2-9b-it')
gcloud container ai profiles list --model=<MODEL_NAME>

View benchmarks/profiles (optional): To see costs and latency targets:

gcloud container ai profiles list --model=<MODEL_NAME>

2. Generate Manifest

Use the gcloud container ai profiles manifests create command. This ensures you are using the latest supported flags and options directly from the CLI.

Parameters:

--model: The model ID (e.g., gemma-2-9b-it).
--model-server: The inference server (e.g., vllm, tgi, triton, tensorrt-llm).
--accelerator-type: The accelerator type (e.g., nvidia-l4, nvidia-tesla-a100).
--target-ntpot-milliseconds: (Optional) Target Normalized Time Per Output Token in ms.

Example Command:

gcloud container ai profiles manifests create \
  --model=gemma-2-9b-it \
  --model-server=vllm \
  --accelerator-type=nvidia-l4 \
  --target-ntpot-milliseconds=50 > inference-workload.yaml

3. Review and Deploy

Save: The example command above saves output to inference-workload.yaml. Ensure you have this file.
Review: Check for any placeholders or specific requirements (like PVCs or secrets).
- Note: Some models require Hugging Face tokens. Ensure query instructions for secrets are followed.

Deploy:

kubectl apply -f inference-workload.yaml

Troubleshooting

Invalid Combination: If the manifest creation fails with an invalid combination error, re-run the discovery commands in Step 1 to verify the tuple (model, server, accelerator).
Quota Issues: Ensure the target region has sufficient quota for the requested accelerator (e.g., NVIDIA_L4_GPUS).

Reference

Docs: GKE Inference Quickstart Documentation

Related Skills

googlecloudplatform/gke-ai-troubleshooting-jobset-interruption

data-ai

VerifiedTrustedCommunity

Systematically diagnose GKE JobSet interruptions, restarts, and preemptions for AI/ML training workloads. Identifies preemption events, maintenance interruptions, bad host VMs, unhealthy pods, and coordinator worker failures.

161SKILL.mdUpdated Jun 4, 2026

googlecloudplatform/gke-ai-troubleshooting-jobset-interruption

googlecloudplatform/gke-ai-troubleshooting-tpu-connection-failure-vbar-oom

development

VerifiedTrustedCommunity

Diagnose and prevent `vbar_control_agent` segfaults and OOMs caused by race conditions during TPU device resets and frequent metrics collection (e.g. every 3s). Use when TPU slice initialization fails or `vbar_control_agent` crashes on TPU v6e nodes.

161SKILL.mdUpdated May 5, 2026

googlecloudplatform/gke-ai-troubleshooting-tpu-connection-failure-vbar-oom

googlecloudplatform/gke-ai-troubleshooting-skill-creation-guide

development

VerifiedTrustedCommunity

Expert instructions for building high-quality GKE troubleshooting skills. Codifies Step 0 context rules, zero-hallucination signatures, and explicit LQL/PromQL query requirements.

161SKILL.mdUpdated May 2, 2026

googlecloudplatform/gke-ai-troubleshooting-skill-creation-guide

googlecloudplatform/gke-productionize

tools

VerifiedTrustedCommunity

Assists in preparing applications and clusters on GKE for production.

148SKILL.mdUpdated Apr 18, 2026

googlecloudplatform/gke-productionize

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/googlecloudplatform/gke-mcp.git

# Copy into Claude Code skills folder (global)
cp -r gke-mcp/skills/gke-inference-quickstart ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

googlecloudplatform/gke-mcp

141 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT