skills/evaluating-kubernetes-performance-genai/SKILL.md
Design and optimize Kubernetes-native GenAI inference platforms using Kueue job queuing, Dynamic Accelerator Slicer (DAS) GPU partitioning, and Gateway API Inference Extension (GAIE) with llm-d for multi-stage AI pipelines. Use when: 'set up Kubernetes for AI inference', 'configure Kueue for batch GPU jobs', 'partition GPUs with MIG slicing on Kubernetes', 'optimize LLM inference routing on Kubernetes', 'build a Whisper-to-LLM pipeline on K8s', 'reduce TTFT latency for LLM serving'.
npx skillsauth add ndpvt-web/arxiv-claude-skills evaluating-kubernetes-performance-genaiInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to architect, configure, and optimize multi-stage GenAI inference pipelines on Kubernetes using three complementary components: Kueue for batch job scheduling with priority and preemption, Dynamic Accelerator Slicer (DAS) for just-in-time GPU partitioning via NVIDIA MIG, and the Gateway API Inference Extension (GAIE) with llm-d for prefix-cache-aware LLM request routing. The technique is drawn from an industry paper demonstrating that these components form a cohesive platform delivering up to 15% makespan reduction (Kueue), 36% faster mean job completion (DAS), and 90% improvement in tail TTFT latency (GAIE) on real ASR-to-summarization workloads.
The paper's core insight is that three Kubernetes-native projects address three distinct bottlenecks in GenAI inference and, when combined, eliminate compounding inefficiencies:
Kueue replaces Kubernetes' default scheduler for batch workloads by introducing a hierarchical queue system (ResourceFlavors -> ClusterQueues -> LocalQueues). It admits jobs based on available cluster capacity rather than letting pods pile up in Pending state. With BestEffortFIFO ordering, priority classes, and preemption policies, Kueue ensures high-value jobs (e.g., large Whisper models) preempt lower-priority work, reducing total makespan by up to 15%. Critically, Kueue's admission latency stays under 25ms at P99 even with complex multi-queue configurations.
DAS (Dynamic Accelerator Slicer) solves GPU underutilization by creating NVIDIA MIG partitions just-in-time based on pod resource requests. Instead of allocating a full A100 (40GB) to a job that only needs 5GB, DAS provisions a 1g.5gb slice, enabling 25 concurrent jobs where only 8 could run before. Individual jobs run ~20% slower on slices, but the 3x parallelism gain produces a net 36% reduction in mean completion time. DAS coordinates with the NVIDIA GPU Operator and binds scheduling to physical slice allocation.
GAIE + llm-d tackles online inference latency through inference-aware request routing. The Endpoint Picker Plugin (EPP) in GAIE inspects each request's prompt tokens and routes it to the vLLM replica whose KV-cache already contains the longest matching prefix. This "prefix-cache-aware" scheduling achieves a 6.25x improvement in TTFT (500ms -> 80ms) compared to random routing, with P99 tail latency improving by up to 90%. The overhead of the routing layer itself is only 3-11ms.
Assess the workload type: Classify each stage of the pipeline as batch inference (finite jobs processing a dataset) or online inference (request-response serving). Batch stages use Kueue + DAS; online stages use GAIE + llm-d.
Define ResourceFlavors for your GPU inventory: Create ResourceFlavor CRDs that describe each distinct GPU type or MIG profile available in the cluster (e.g., nvidia-a100-40gb, nvidia-a100-1g5gb). Map these to node labels and taints.
Configure Kueue's queue hierarchy: Create a ClusterQueue with resource quotas matching your GPU capacity. For multi-tenant or multi-priority workloads, create multiple ClusterQueues with borrowing limits. Attach LocalQueues in each namespace where jobs will be submitted.
Assign priority classes to job types: Create PriorityClass resources for each workload tier (e.g., large-model-high, medium-model-default). Configure preemption policies on the ClusterQueue so higher-priority jobs can reclaim resources from lower-priority ones.
Configure DAS profiles for GPU slicing: Define DASProfile CRDs specifying which MIG profiles to use for each workload size. For example: medium Whisper (769M params) -> 1g.5gb; large Whisper (1.55B params) -> 3g.20gb. Ensure the NVIDIA GPU Operator is installed with MIG support enabled.
Create batch job templates: Write Kubernetes Job manifests that request specific GPU slice resources (e.g., nvidia.com/mig-1g.5gb: 1). Set the Kueue queue label (kueue.x-k8s.io/queue-name) so jobs are admitted through the queue system rather than directly scheduled.
Deploy the LLM serving layer with llm-d: Deploy vLLM replicas serving your target model (e.g., Qwen3-8B) with prefix caching enabled. Create InferencePool and InferenceModel CRDs to register the model endpoint with GAIE.
Configure GAIE routing with the Endpoint Picker Plugin: Set up the InferenceGateway with an Envoy-based service mesh. Configure the EPP to use prefix-cache-aware scheduling ("precise" mode) rather than random or round-robin routing.
Wire the pipeline stages together: Implement a controller or workflow engine (e.g., Argo Workflows, Tekton) that submits batch transcription jobs, collects outputs, then sends them as requests to the LLM serving endpoint for summarization.
Benchmark and tune: Use kube-burner for batch workload orchestration and GuideLLM for online inference benchmarking. Monitor Kueue admission latency, DAS slice utilization, and GAIE TTFT/E2E latency. Adjust queue borrowing limits, MIG profiles, and vLLM replica count based on results.
Example 1: Configuring Kueue for GPU batch transcription
User: "I have 8 A100 GPUs and need to run 32 Whisper transcription jobs. Some are large model, some medium. How do I set up Kueue to prioritize the large jobs?"
Approach:
whisper-large (value: 1000) and whisper-medium (value: 500)nvidia.com/gpu on the GPU nodeReclaimWithinCohortOutput:
apiVersion: kueue.x-k8s.io/v1beta1
kind: ClusterQueue
metadata:
name: gpu-queue
spec:
queueingStrategy: BestEffortFIFO
preemption:
reclaimWithinCohort: Any
withinClusterQueue: LowerPriority
resourceGroups:
- coveredResources: ["nvidia.com/gpu"]
flavors:
- name: a100-40gb
resources:
- name: nvidia.com/gpu
nominalQuota: 8
---
apiVersion: kueue.x-k8s.io/v1beta1
kind: LocalQueue
metadata:
name: transcription-queue
namespace: asr-pipeline
spec:
clusterQueue: gpu-queue
---
apiVersion: batch/v1
kind: Job
metadata:
name: whisper-large-001
namespace: asr-pipeline
labels:
kueue.x-k8s.io/queue-name: transcription-queue
spec:
template:
spec:
priorityClassName: whisper-large
containers:
- name: whisper
image: openai/whisper:large-v3
resources:
limits:
nvidia.com/gpu: 1
restartPolicy: Never
Expected behavior: Kueue admits jobs up to the 8-GPU quota. When a whisper-large job arrives and all GPUs are occupied by whisper-medium jobs, Kueue preempts a medium job to free a GPU. Admission latency stays under 25ms.
Example 2: Enabling DAS GPU slicing for higher parallelism
User: "My Whisper medium model only needs 5GB of GPU memory but each job gets a full 40GB A100. Can I run more jobs in parallel?"
Approach:
1g.5gb MIG slicenvidia.com/mig-1g.5gb instead of nvidia.com/gpuOutput:
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: DASProfile
metadata:
name: whisper-medium-profile
spec:
migProfile: 1g.5gb
accelerator: nvidia-a100-40gb
---
apiVersion: batch/v1
kind: Job
metadata:
name: whisper-medium-017
labels:
kueue.x-k8s.io/queue-name: transcription-queue
spec:
template:
spec:
containers:
- name: whisper
image: openai/whisper:medium
resources:
limits:
nvidia.com/mig-1g.5gb: 1
restartPolicy: Never
Expected behavior: Each A100 is split into up to 7 MIG slices. Cluster capacity rises from 8 concurrent jobs (full GPUs) to ~25 (sliced). Individual jobs take ~17 min instead of ~14 min, but mean completion across all 32 jobs drops from 28 min to 18 min (36% reduction).
Example 3: Prefix-cache-aware LLM routing with GAIE and llm-d
User: "I have 8 vLLM replicas serving Qwen3-8B for summarization. Under load, TTFT spikes to 500ms+. How do I reduce it?"
Approach:
Output:
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferencePool
metadata:
name: summarization-pool
spec:
targetPortNumber: 8000
selector:
matchLabels:
app: vllm-qwen3-8b
extensionRef:
name: endpoint-picker
---
apiVersion: inference.networking.x-k8s.io/v1alpha2
kind: InferenceModel
metadata:
name: qwen3-8b-summarizer
spec:
modelName: Qwen/Qwen3-8B
poolRef:
name: summarization-pool
criticality: Critical
Expected behavior: The EPP inspects each incoming request's prompt tokens and routes to the replica with the longest matching prefix in its KV-cache. TTFT drops from ~500ms (random routing) to ~80ms (prefix-aware). P99 tail latency improves by up to 90%. The routing layer adds only 3-11ms overhead, which is far outweighed by cache hit gains.
1g.5gb; don't allocate 3g.20gb unless the model requires it.--enable-prefix-caching) before configuring GAIE's prefix-aware routing -- the routing is useless without cached prefixes.withinClusterQueue: LowerPriority preemption for production queues.nvidia-smi mig -lgip on the target node.p4d.24xlarge (8x A100). Multi-node GPU scheduling with DAS and Kueue introduces additional complexity around topology-aware placement.Evaluating Kubernetes Performance for GenAI Inference: From Automatic Speech Recognition to LLM Summarization -- Malleni et al., 2026. Focus on Sections 3-5 for Kueue queue hierarchy design, DAS MIG profile configuration, and GAIE prefix-cache-aware routing benchmarks. The paper's GitHub repository contains reproducible manifests for all experiments.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".