Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

bagelhole/llm-inference-scaling

Name: llm-inference-scaling
Author: bagelhole

infrastructure/local-ai/llm-inference-scaling/SKILL.md

npx skillsauth add bagelhole/devops-security-agent-skills llm-inference-scaling

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

LLM Inference Scaling

Scale LLM inference horizontally on Kubernetes with GPU-aware autoscaling, request queuing, and cost-efficient spot instance strategies.

When to Use This Skill

Use this skill when:

LLM API traffic is unpredictable and you need to scale up/down automatically
Managing a fleet of vLLM or TGI inference pods on Kubernetes
Reducing inference costs with spot/preemptible GPU instances
Implementing queue-based autoscaling for batch inference jobs
Building a multi-model serving platform that shares GPU resources

Prerequisites

Kubernetes cluster with GPU nodes (NVIDIA operator installed)
KEDA (Kubernetes Event-Driven Autoscaler) installed
Prometheus with GPU metrics (dcgm-exporter or gpu-operator)
Helm 3+ for chart deployments

GPU Node Setup

# Install NVIDIA GPU Operator (handles drivers, container toolkit, DCGM)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set dcgm.enabled=true \
  --set devicePlugin.enabled=true

# Verify GPU nodes are recognized
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe node <gpu-node> | grep nvidia

vLLM Deployment with GPU Resources

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-8b
  labels:
    app: vllm
    model: llama-3.1-8b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
      model: llama-3.1-8b
  template:
    metadata:
      labels:
        app: vllm
        model: llama-3.1-8b
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "meta-llama/Llama-3.1-8B-Instruct"
        - "--tensor-parallel-size"
        - "1"
        - "--gpu-memory-utilization"
        - "0.90"
        - "--max-num-seqs"
        - "128"
        resources:
          requests:
            nvidia.com/gpu: "1"
            memory: "20Gi"
            cpu: "4"
          limits:
            nvidia.com/gpu: "1"
            memory: "24Gi"
            cpu: "8"
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token

KEDA Autoscaling on Prometheus Metrics

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaledobject
spec:
  scaleTargetRef:
    name: vllm-llama-8b
  minReplicaCount: 1
  maxReplicaCount: 8
  cooldownPeriod: 300          # 5 min before scale-down
  pollingInterval: 15
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring:9090
      metricName: vllm_num_requests_waiting
      threshold: "10"           # scale up if >10 requests waiting
      query: |
        sum(vllm:num_requests_waiting{deployment="vllm-llama-8b"})
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring:9090
      metricName: vllm_gpu_cache_usage
      threshold: "0.8"          # scale up if KV cache >80% full
      query: |
        avg(vllm:gpu_cache_usage_perc{deployment="vllm-llama-8b"})

Queue-Based Scaling (Redis + KEDA)

# ScaledJob for async batch inference
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: llm-batch-inference
spec:
  jobTargetRef:
    template:
      spec:
        containers:
        - name: inference-worker
          image: myapp/inference-worker:latest
          env:
          - name: REDIS_URL
            value: redis://redis:6379
          - name: QUEUE_NAME
            value: inference-jobs
        restartPolicy: OnFailure
  minReplicaCount: 0
  maxReplicaCount: 20
  pollingInterval: 5
  successfulJobsHistoryLimit: 3
  triggers:
  - type: redis
    metadata:
      address: redis:6379
      listName: inference-jobs
      listLength: "5"           # 1 worker per 5 queued jobs

Spot Instance Strategy

# Mixed node pool: on-demand + spot GPUs
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-config
data:
  priorities: |
    10:  # low priority = prefer
    - .*spot.*
    50:
    - .*on-demand.*
---
# Node affinity for spot with on-demand fallback
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: node.kubernetes.io/lifecycle
            operator: In
            values: [spot]
      - weight: 20
        preference:
          matchExpressions:
          - key: node.kubernetes.io/lifecycle
            operator: In
            values: [on-demand]

Cluster Autoscaler for GPU Nodes

# AWS EKS — enable cluster autoscaler for GPU node group
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  --set autoDiscovery.clusterName=my-cluster \
  --set awsRegion=us-east-1 \
  --set rbac.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::ACCOUNT:role/ClusterAutoscalerRole \
  --set extraArgs.skip-nodes-with-local-storage=false \
  --set extraArgs.expander=least-waste

# Annotate GPU node group for autoscaler
kubectl annotate node <node> \
  cluster-autoscaler.kubernetes.io/safe-to-evict="false"

Scaling Metrics to Monitor

# Prometheus queries for scaling decisions
# Requests waiting in vLLM queue
sum(vllm:num_requests_waiting) by (model)

# GPU KV cache utilization (>80% = bottleneck)
avg(vllm:gpu_cache_usage_perc) by (pod)

# Tokens per second throughput
sum(rate(vllm:generation_tokens_total[5m])) by (model)

# P99 time-to-first-token
histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m]))

Common Issues

| Issue | Cause | Fix | |-------|-------|-----| | Pods stuck in Pending | No GPU nodes available | Check cluster autoscaler logs; verify node group limits | | Scale-up too slow | Cluster autoscaler delay + model load time | Pre-warm replicas; increase minReplicaCount | | GPU fragmentation | Multiple small models on large GPUs | Use MIG partitioning or consolidate model sizes | | Spot eviction causes errors | Spot instance reclamation | Add PodDisruptionBudget; use graceful shutdown | | KEDA not scaling | Prometheus query returns no data | Test query in Prometheus UI first |

Best Practices

Set minReplicaCount: 1 to avoid cold starts; scale to 0 only for batch jobs.
Use PodDisruptionBudget with minAvailable: 1 to survive spot evictions.
Pre-pull model weights into a shared PVC to speed up pod startup by 5–10×.
Separate model families across node pools (A10G for 7B, A100 for 70B).
Use Kubernetes VPA for CPU/memory right-sizing alongside KEDA for replica count.

Related Skills

vllm-server - vLLM configuration and tuning
gpu-server-management - GPU node setup
model-serving-kubernetes - KServe
kubernetes-ops - Core Kubernetes
llm-cost-optimization - Cost strategies

bagelhole/llm-inference-scaling

infrastructure/local-ai/llm-inference-scaling/SKILL.md

Auto-scale LLM inference clusters on Kubernetes using KEDA, custom GPU metrics, and horizontal pod autoscaling. Handle traffic spikes, implement queue-based scaling, and optimize cost with spot instances for AI workloads.

18 stars

devops

Updated Apr 3, 2026

$ install --global

skillsauth

npx skillsauth add bagelhole/devops-security-agent-skills llm-inference-scaling

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 3, 2026, 5:05 PM9.3s1 file scanned

SKILL.md

name:: llm-inference-scaling
description:: Auto-scale LLM inference clusters on Kubernetes using KEDA, custom GPU metrics, and horizontal pod autoscaling. Handle traffic spikes, implement queue-based scaling, and optimize cost with spot instances for AI workloads.
license:: MIT
author:: devops-skills
version:: 1.0

LLM Inference Scaling

Scale LLM inference horizontally on Kubernetes with GPU-aware autoscaling, request queuing, and cost-efficient spot instance strategies.

When to Use This Skill

Use this skill when:

LLM API traffic is unpredictable and you need to scale up/down automatically
Managing a fleet of vLLM or TGI inference pods on Kubernetes
Reducing inference costs with spot/preemptible GPU instances
Implementing queue-based autoscaling for batch inference jobs
Building a multi-model serving platform that shares GPU resources

Prerequisites

Kubernetes cluster with GPU nodes (NVIDIA operator installed)
KEDA (Kubernetes Event-Driven Autoscaler) installed
Prometheus with GPU metrics (dcgm-exporter or gpu-operator)
Helm 3+ for chart deployments

GPU Node Setup

# Install NVIDIA GPU Operator (handles drivers, container toolkit, DCGM)
helm repo add nvidia https://helm.ngc.nvidia.com/nvidia
helm repo update

helm install gpu-operator nvidia/gpu-operator \
  --namespace gpu-operator \
  --create-namespace \
  --set driver.enabled=true \
  --set dcgm.enabled=true \
  --set devicePlugin.enabled=true

# Verify GPU nodes are recognized
kubectl get nodes -l nvidia.com/gpu.present=true
kubectl describe node <gpu-node> | grep nvidia

vLLM Deployment with GPU Resources

apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama-8b
  labels:
    app: vllm
    model: llama-3.1-8b
spec:
  replicas: 1
  selector:
    matchLabels:
      app: vllm
      model: llama-3.1-8b
  template:
    metadata:
      labels:
        app: vllm
        model: llama-3.1-8b
    spec:
      nodeSelector:
        nvidia.com/gpu.present: "true"
      tolerations:
      - key: nvidia.com/gpu
        operator: Exists
        effect: NoSchedule
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - "--model"
        - "meta-llama/Llama-3.1-8B-Instruct"
        - "--tensor-parallel-size"
        - "1"
        - "--gpu-memory-utilization"
        - "0.90"
        - "--max-num-seqs"
        - "128"
        resources:
          requests:
            nvidia.com/gpu: "1"
            memory: "20Gi"
            cpu: "4"
          limits:
            nvidia.com/gpu: "1"
            memory: "24Gi"
            cpu: "8"
        ports:
        - containerPort: 8000
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 60
          periodSeconds: 10
        env:
        - name: HUGGING_FACE_HUB_TOKEN
          valueFrom:
            secretKeyRef:
              name: hf-token
              key: token

KEDA Autoscaling on Prometheus Metrics

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: vllm-scaledobject
spec:
  scaleTargetRef:
    name: vllm-llama-8b
  minReplicaCount: 1
  maxReplicaCount: 8
  cooldownPeriod: 300          # 5 min before scale-down
  pollingInterval: 15
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring:9090
      metricName: vllm_num_requests_waiting
      threshold: "10"           # scale up if >10 requests waiting
      query: |
        sum(vllm:num_requests_waiting{deployment="vllm-llama-8b"})
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring:9090
      metricName: vllm_gpu_cache_usage
      threshold: "0.8"          # scale up if KV cache >80% full
      query: |
        avg(vllm:gpu_cache_usage_perc{deployment="vllm-llama-8b"})

Queue-Based Scaling (Redis + KEDA)

# ScaledJob for async batch inference
apiVersion: keda.sh/v1alpha1
kind: ScaledJob
metadata:
  name: llm-batch-inference
spec:
  jobTargetRef:
    template:
      spec:
        containers:
        - name: inference-worker
          image: myapp/inference-worker:latest
          env:
          - name: REDIS_URL
            value: redis://redis:6379
          - name: QUEUE_NAME
            value: inference-jobs
        restartPolicy: OnFailure
  minReplicaCount: 0
  maxReplicaCount: 20
  pollingInterval: 5
  successfulJobsHistoryLimit: 3
  triggers:
  - type: redis
    metadata:
      address: redis:6379
      listName: inference-jobs
      listLength: "5"           # 1 worker per 5 queued jobs

Spot Instance Strategy

# Mixed node pool: on-demand + spot GPUs
apiVersion: v1
kind: ConfigMap
metadata:
  name: cluster-autoscaler-priority-config
data:
  priorities: |
    10:  # low priority = prefer
    - .*spot.*
    50:
    - .*on-demand.*
---
# Node affinity for spot with on-demand fallback
spec:
  affinity:
    nodeAffinity:
      preferredDuringSchedulingIgnoredDuringExecution:
      - weight: 80
        preference:
          matchExpressions:
          - key: node.kubernetes.io/lifecycle
            operator: In
            values: [spot]
      - weight: 20
        preference:
          matchExpressions:
          - key: node.kubernetes.io/lifecycle
            operator: In
            values: [on-demand]

Cluster Autoscaler for GPU Nodes

# AWS EKS — enable cluster autoscaler for GPU node group
helm install cluster-autoscaler autoscaler/cluster-autoscaler \
  --namespace kube-system \
  --set autoDiscovery.clusterName=my-cluster \
  --set awsRegion=us-east-1 \
  --set rbac.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"=arn:aws:iam::ACCOUNT:role/ClusterAutoscalerRole \
  --set extraArgs.skip-nodes-with-local-storage=false \
  --set extraArgs.expander=least-waste

# Annotate GPU node group for autoscaler
kubectl annotate node <node> \
  cluster-autoscaler.kubernetes.io/safe-to-evict="false"

Scaling Metrics to Monitor

# Prometheus queries for scaling decisions
# Requests waiting in vLLM queue
sum(vllm:num_requests_waiting) by (model)

# GPU KV cache utilization (>80% = bottleneck)
avg(vllm:gpu_cache_usage_perc) by (pod)

# Tokens per second throughput
sum(rate(vllm:generation_tokens_total[5m])) by (model)

# P99 time-to-first-token
histogram_quantile(0.99, rate(vllm:time_to_first_token_seconds_bucket[5m]))

Common Issues

Best Practices

Set minReplicaCount: 1 to avoid cold starts; scale to 0 only for batch jobs.
Use PodDisruptionBudget with minAvailable: 1 to survive spot evictions.
Pre-pull model weights into a shared PVC to speed up pod startup by 5–10×.
Separate model families across node pools (A10G for 7B, A100 for 70B).
Use Kubernetes VPA for CPU/memory right-sizing alongside KEDA for replica count.

Related Skills

vllm-server - vLLM configuration and tuning
gpu-server-management - GPU node setup
model-serving-kubernetes - KServe
kubernetes-ops - Core Kubernetes
llm-cost-optimization - Cost strategies

Related Skills

bagelhole/sre-dashboards

development

VerifiedTrustedCommunity

Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.

28SKILL.mdUpdated May 23, 2026

bagelhole/sre-dashboards

bagelhole/openclaw-security-hardening

testing

VerifiedTrustedCommunity

Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.

28SKILL.mdUpdated Apr 3, 2026

bagelhole/openclaw-security-hardening

bagelhole/vector-database-ops

devops

VerifiedTrustedCommunity

Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.

28SKILL.mdUpdated Apr 3, 2026

bagelhole/vector-database-ops

bagelhole/model-serving-kubernetes

testing

VerifiedTrustedCommunity

Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.

28SKILL.mdUpdated Apr 3, 2026

bagelhole/model-serving-kubernetes

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/bagelhole/devops-security-agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r devops-security-agent-skills/infrastructure/local-ai/llm-inference-scaling ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

bagelhole/devops-security-agent-skills

18 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT