devops/orchestration/model-serving-kubernetes/SKILL.md
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.
npx skillsauth add bagelhole/devops-security-agent-skills model-serving-kubernetesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Production ML model serving with KServe and Triton — canary deployments, autoscaling, and GPU-aware scheduling.
Use this skill when:
kubectl and helm configured# Install KServe with Helm
helm repo add kserve https://kserve.github.io/helm-charts
helm repo update
helm install kserve kserve/kserve \
--namespace kserve \
--create-namespace \
--set kserve.controller.gateway.ingressGateway.className=nginx
# Verify
kubectl get pods -n kserve
kubectl get crd | grep kserve
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: sklearn-iris
namespace: models
spec:
predictor:
sklearn:
storageUri: gs://kfserving-examples/models/sklearn/1.0/model
resources:
requests:
cpu: "1"
memory: 2Gi
limits:
cpu: "2"
memory: 4Gi
kubectl apply -f inference-service.yaml
# Get inference service URL
kubectl get inferenceservice sklearn-iris -n models
# NAME URL READY ...
# sklearn-iris http://sklearn-iris.models.example.com True
# Test prediction
curl -X POST http://sklearn-iris.models.example.com/v1/models/sklearn-iris:predict \
-H "Content-Type: application/json" \
-d '{"instances": [[6.8, 2.8, 4.8, 1.4]]}'
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-3-8b
namespace: models
annotations:
serving.kserve.io/enable-prometheus-scraping: "true"
spec:
predictor:
containers:
- name: vllm-container
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct"
- "--tensor-parallel-size"
- "1"
- "--gpu-memory-utilization"
- "0.90"
ports:
- containerPort: 8080
protocol: TCP
resources:
requests:
nvidia.com/gpu: "1"
memory: "20Gi"
cpu: "4"
limits:
nvidia.com/gpu: "1"
memory: "24Gi"
cpu: "8"
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 60
periodSeconds: 10
env:
- name: HUGGING_FACE_HUB_TOKEN
valueFrom:
secretKeyRef:
name: hf-token
key: token
nodeSelector:
nvidia.com/gpu.present: "true"
transformer:
containers:
- name: kserve-container
image: kserve/kserve-transformer:latest
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
name: llama-3-8b
namespace: models
spec:
predictor:
canaryTrafficPercent: 20 # 20% to new version, 80% to stable
containers:
- name: vllm-container
image: vllm/vllm-openai:latest
args:
- "--model"
- "meta-llama/Llama-3.1-8B-Instruct-v2" # new model version
resources:
limits:
nvidia.com/gpu: "1"
# Gradually increase canary traffic
kubectl patch inferenceservice llama-3-8b -n models \
--type='json' \
-p='[{"op":"replace","path":"/spec/predictor/canaryTrafficPercent","value":50}]'
# Promote canary to stable
kubectl patch inferenceservice llama-3-8b -n models \
--type='json' \
-p='[{"op":"remove","path":"/spec/predictor/canaryTrafficPercent"}]'
apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
name: llama-scaler
namespace: models
spec:
scaleTargetRef:
apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
name: llama-3-8b
minReplicaCount: 1
maxReplicaCount: 5
triggers:
- type: prometheus
metadata:
serverAddress: http://prometheus-server.monitoring:9090
metricName: kserve_request_count
threshold: "10"
query: |
sum(rate(kserve_request_count_total{namespace="models",
service="llama-3-8b"}[1m]))
apiVersion: apps/v1
kind: Deployment
metadata:
name: triton-server
namespace: models
spec:
replicas: 2
selector:
matchLabels:
app: triton
template:
metadata:
labels:
app: triton
spec:
containers:
- name: triton
image: nvcr.io/nvidia/tritonserver:24.05-py3
args:
- "tritonserver"
- "--model-store=s3://my-model-store/models"
- "--model-control-mode=poll" # auto-load new model versions
- "--repository-poll-secs=30"
- "--metrics-port=8002"
ports:
- containerPort: 8000 # HTTP
- containerPort: 8001 # gRPC
- containerPort: 8002 # Metrics
resources:
limits:
nvidia.com/gpu: "1"
readinessProbe:
httpGet:
path: /v2/health/ready
port: 8000
initialDelaySeconds: 30
s3://my-model-store/models/
├── text-classifier/
│ ├── config.pbtxt
│ ├── 1/
│ │ └── model.onnx
│ └── 2/
│ └── model.onnx # new version; auto-loaded
├── embedding-model/
│ ├── config.pbtxt
│ └── 1/
│ └── model.onnx
# config.pbtxt for ONNX model
name: "text-classifier"
backend: "onnxruntime"
max_batch_size: 64
dynamic_batching {
preferred_batch_size: [16, 32]
max_queue_delay_microseconds: 1000
}
input [
{ name: "input_ids" data_type: TYPE_INT64 dims: [-1] }
{ name: "attention_mask" data_type: TYPE_INT64 dims: [-1] }
]
output [
{ name: "logits" data_type: TYPE_FP32 dims: [-1] }
]
instance_group [
{ kind: KIND_GPU count: 2 } # 2 model instances on GPU
]
# List loaded models (Triton)
curl http://triton:8000/v2/models
# Load a new model version
curl -X POST http://triton:8000/v2/repository/models/text-classifier/load
# Unload a model
curl -X POST http://triton:8000/v2/repository/models/text-classifier/unload
# KServe — watch rollout status
kubectl rollout status deployment/llama-3-8b-predictor -n models
kubectl get inferenceservice llama-3-8b -n models -w
| Issue | Cause | Fix |
|-------|-------|-----|
| InferenceService not ready | Model loading or OOM | Check predictor pod logs; increase memory limits |
| Canary stuck at 0% | KNative routing issue | Check kubectl get ksvc -n models |
| Triton missing model | S3 permissions or path | Verify IAM role; check --model-store path |
| Low GPU utilization | Dynamic batching off | Enable dynamic_batching in Triton config |
| Autoscaler not triggering | Prometheus query wrong | Test query in Prometheus UI |
s3://bucket/model/v1/, v2/).development
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
development
Reduce LLM API and infrastructure costs through model selection, prompt caching, batching, caching, quantization, and self-hosting strategies. Track spend by team and model, set budgets, and implement cost-aware routing.