Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

bagelhole/model-serving-kubernetes

Name: model-serving-kubernetes
Author: bagelhole

devops/orchestration/model-serving-kubernetes/SKILL.md

npx skillsauth add bagelhole/devops-security-agent-skills model-serving-kubernetes

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Model Serving on Kubernetes

Production ML model serving with KServe and Triton — canary deployments, autoscaling, and GPU-aware scheduling.

When to Use This Skill

Use this skill when:

Serving scikit-learn, PyTorch, TensorFlow, or ONNX models at scale
Implementing canary deployments and A/B testing for ML models
Autoscaling inference pods based on request rate or GPU metrics
Deploying LLMs with Triton or KServe on Kubernetes
Managing multiple model versions with traffic splitting

Prerequisites

Kubernetes 1.28+ with GPU nodes
KServe installed (or Triton standalone)
kubectl and helm configured
NVIDIA GPU Operator installed on cluster

KServe Installation

# Install KServe with Helm
helm repo add kserve https://kserve.github.io/helm-charts
helm repo update

helm install kserve kserve/kserve \
  --namespace kserve \
  --create-namespace \
  --set kserve.controller.gateway.ingressGateway.className=nginx

# Verify
kubectl get pods -n kserve
kubectl get crd | grep kserve

Basic InferenceService (KServe)

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
  namespace: models
spec:
  predictor:
    sklearn:
      storageUri: gs://kfserving-examples/models/sklearn/1.0/model
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "2"
          memory: 4Gi

kubectl apply -f inference-service.yaml

# Get inference service URL
kubectl get inferenceservice sklearn-iris -n models
# NAME           URL                                          READY   ...
# sklearn-iris   http://sklearn-iris.models.example.com       True

# Test prediction
curl -X POST http://sklearn-iris.models.example.com/v1/models/sklearn-iris:predict \
  -H "Content-Type: application/json" \
  -d '{"instances": [[6.8, 2.8, 4.8, 1.4]]}'

GPU-Enabled LLM InferenceService

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-8b
  namespace: models
  annotations:
    serving.kserve.io/enable-prometheus-scraping: "true"
spec:
  predictor:
    containers:
    - name: vllm-container
      image: vllm/vllm-openai:latest
      args:
      - "--model"
      - "meta-llama/Llama-3.1-8B-Instruct"
      - "--tensor-parallel-size"
      - "1"
      - "--gpu-memory-utilization"
      - "0.90"
      ports:
      - containerPort: 8080
        protocol: TCP
      resources:
        requests:
          nvidia.com/gpu: "1"
          memory: "20Gi"
          cpu: "4"
        limits:
          nvidia.com/gpu: "1"
          memory: "24Gi"
          cpu: "8"
      readinessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 60
        periodSeconds: 10
      env:
      - name: HUGGING_FACE_HUB_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-token
            key: token
    nodeSelector:
      nvidia.com/gpu.present: "true"
  transformer:
    containers:
    - name: kserve-container
      image: kserve/kserve-transformer:latest

Canary Deployment (Traffic Splitting)

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-8b
  namespace: models
spec:
  predictor:
    canaryTrafficPercent: 20    # 20% to new version, 80% to stable
    containers:
    - name: vllm-container
      image: vllm/vllm-openai:latest
      args:
      - "--model"
      - "meta-llama/Llama-3.1-8B-Instruct-v2"  # new model version
      resources:
        limits:
          nvidia.com/gpu: "1"

# Gradually increase canary traffic
kubectl patch inferenceservice llama-3-8b -n models \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/predictor/canaryTrafficPercent","value":50}]'

# Promote canary to stable
kubectl patch inferenceservice llama-3-8b -n models \
  --type='json' \
  -p='[{"op":"remove","path":"/spec/predictor/canaryTrafficPercent"}]'

Autoscaling with KEDA

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llama-scaler
  namespace: models
spec:
  scaleTargetRef:
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    name: llama-3-8b
  minReplicaCount: 1
  maxReplicaCount: 5
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring:9090
      metricName: kserve_request_count
      threshold: "10"
      query: |
        sum(rate(kserve_request_count_total{namespace="models",
            service="llama-3-8b"}[1m]))

NVIDIA Triton Inference Server

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
  namespace: models
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.05-py3
        args:
        - "tritonserver"
        - "--model-store=s3://my-model-store/models"
        - "--model-control-mode=poll"        # auto-load new model versions
        - "--repository-poll-secs=30"
        - "--metrics-port=8002"
        ports:
        - containerPort: 8000   # HTTP
        - containerPort: 8001   # gRPC
        - containerPort: 8002   # Metrics
        resources:
          limits:
            nvidia.com/gpu: "1"
        readinessProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          initialDelaySeconds: 30

Triton Model Repository Structure

s3://my-model-store/models/
├── text-classifier/
│   ├── config.pbtxt
│   ├── 1/
│   │   └── model.onnx
│   └── 2/
│       └── model.onnx          # new version; auto-loaded
├── embedding-model/
│   ├── config.pbtxt
│   └── 1/
│       └── model.onnx

# config.pbtxt for ONNX model
name: "text-classifier"
backend: "onnxruntime"
max_batch_size: 64
dynamic_batching {
  preferred_batch_size: [16, 32]
  max_queue_delay_microseconds: 1000
}
input [
  { name: "input_ids" data_type: TYPE_INT64 dims: [-1] }
  { name: "attention_mask" data_type: TYPE_INT64 dims: [-1] }
]
output [
  { name: "logits" data_type: TYPE_FP32 dims: [-1] }
]
instance_group [
  { kind: KIND_GPU count: 2 }   # 2 model instances on GPU
]

Model Management Commands

# List loaded models (Triton)
curl http://triton:8000/v2/models

# Load a new model version
curl -X POST http://triton:8000/v2/repository/models/text-classifier/load

# Unload a model
curl -X POST http://triton:8000/v2/repository/models/text-classifier/unload

# KServe — watch rollout status
kubectl rollout status deployment/llama-3-8b-predictor -n models
kubectl get inferenceservice llama-3-8b -n models -w

Common Issues

| Issue | Cause | Fix | |-------|-------|-----| | InferenceService not ready | Model loading or OOM | Check predictor pod logs; increase memory limits | | Canary stuck at 0% | KNative routing issue | Check kubectl get ksvc -n models | | Triton missing model | S3 permissions or path | Verify IAM role; check --model-store path | | Low GPU utilization | Dynamic batching off | Enable dynamic_batching in Triton config | | Autoscaler not triggering | Prometheus query wrong | Test query in Prometheus UI |

Best Practices

Use canary deployments for all model updates — roll back in seconds if metrics degrade.
Enable Triton dynamic batching — it can increase GPU throughput 5–10× for small models.
Store models in S3/GCS with versioned paths (s3://bucket/model/v1/, v2/).
Pin GPU node selectors to prevent model pods landing on CPU-only nodes.
Monitor p99 latency and error rates per model version during canary rollouts.

Related Skills

vllm-server - vLLM for LLM serving
llm-inference-scaling - KEDA autoscaling
kubernetes-ops - Core Kubernetes operations
gpu-server-management - GPU nodes

bagelhole/model-serving-kubernetes

devops/orchestration/model-serving-kubernetes/SKILL.md

Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.

28 stars

testing

Updated May 23, 2026

$ install --global

skillsauth

npx skillsauth add bagelhole/devops-security-agent-skills model-serving-kubernetes

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 23, 2026, 3:02 AM95.3s1 file scanned

SKILL.md

name:: model-serving-kubernetes
description:: Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.
license:: MIT
author:: devops-skills
version:: 1.0

Model Serving on Kubernetes

Production ML model serving with KServe and Triton — canary deployments, autoscaling, and GPU-aware scheduling.

When to Use This Skill

Use this skill when:

Serving scikit-learn, PyTorch, TensorFlow, or ONNX models at scale
Implementing canary deployments and A/B testing for ML models
Autoscaling inference pods based on request rate or GPU metrics
Deploying LLMs with Triton or KServe on Kubernetes
Managing multiple model versions with traffic splitting

Prerequisites

Kubernetes 1.28+ with GPU nodes
KServe installed (or Triton standalone)
kubectl and helm configured
NVIDIA GPU Operator installed on cluster

KServe Installation

# Install KServe with Helm
helm repo add kserve https://kserve.github.io/helm-charts
helm repo update

helm install kserve kserve/kserve \
  --namespace kserve \
  --create-namespace \
  --set kserve.controller.gateway.ingressGateway.className=nginx

# Verify
kubectl get pods -n kserve
kubectl get crd | grep kserve

Basic InferenceService (KServe)

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: sklearn-iris
  namespace: models
spec:
  predictor:
    sklearn:
      storageUri: gs://kfserving-examples/models/sklearn/1.0/model
      resources:
        requests:
          cpu: "1"
          memory: 2Gi
        limits:
          cpu: "2"
          memory: 4Gi

kubectl apply -f inference-service.yaml

# Get inference service URL
kubectl get inferenceservice sklearn-iris -n models
# NAME           URL                                          READY   ...
# sklearn-iris   http://sklearn-iris.models.example.com       True

# Test prediction
curl -X POST http://sklearn-iris.models.example.com/v1/models/sklearn-iris:predict \
  -H "Content-Type: application/json" \
  -d '{"instances": [[6.8, 2.8, 4.8, 1.4]]}'

GPU-Enabled LLM InferenceService

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-8b
  namespace: models
  annotations:
    serving.kserve.io/enable-prometheus-scraping: "true"
spec:
  predictor:
    containers:
    - name: vllm-container
      image: vllm/vllm-openai:latest
      args:
      - "--model"
      - "meta-llama/Llama-3.1-8B-Instruct"
      - "--tensor-parallel-size"
      - "1"
      - "--gpu-memory-utilization"
      - "0.90"
      ports:
      - containerPort: 8080
        protocol: TCP
      resources:
        requests:
          nvidia.com/gpu: "1"
          memory: "20Gi"
          cpu: "4"
        limits:
          nvidia.com/gpu: "1"
          memory: "24Gi"
          cpu: "8"
      readinessProbe:
        httpGet:
          path: /health
          port: 8080
        initialDelaySeconds: 60
        periodSeconds: 10
      env:
      - name: HUGGING_FACE_HUB_TOKEN
        valueFrom:
          secretKeyRef:
            name: hf-token
            key: token
    nodeSelector:
      nvidia.com/gpu.present: "true"
  transformer:
    containers:
    - name: kserve-container
      image: kserve/kserve-transformer:latest

Canary Deployment (Traffic Splitting)

apiVersion: serving.kserve.io/v1beta1
kind: InferenceService
metadata:
  name: llama-3-8b
  namespace: models
spec:
  predictor:
    canaryTrafficPercent: 20    # 20% to new version, 80% to stable
    containers:
    - name: vllm-container
      image: vllm/vllm-openai:latest
      args:
      - "--model"
      - "meta-llama/Llama-3.1-8B-Instruct-v2"  # new model version
      resources:
        limits:
          nvidia.com/gpu: "1"

# Gradually increase canary traffic
kubectl patch inferenceservice llama-3-8b -n models \
  --type='json' \
  -p='[{"op":"replace","path":"/spec/predictor/canaryTrafficPercent","value":50}]'

# Promote canary to stable
kubectl patch inferenceservice llama-3-8b -n models \
  --type='json' \
  -p='[{"op":"remove","path":"/spec/predictor/canaryTrafficPercent"}]'

Autoscaling with KEDA

apiVersion: keda.sh/v1alpha1
kind: ScaledObject
metadata:
  name: llama-scaler
  namespace: models
spec:
  scaleTargetRef:
    apiVersion: serving.kserve.io/v1beta1
    kind: InferenceService
    name: llama-3-8b
  minReplicaCount: 1
  maxReplicaCount: 5
  triggers:
  - type: prometheus
    metadata:
      serverAddress: http://prometheus-server.monitoring:9090
      metricName: kserve_request_count
      threshold: "10"
      query: |
        sum(rate(kserve_request_count_total{namespace="models",
            service="llama-3-8b"}[1m]))

NVIDIA Triton Inference Server

apiVersion: apps/v1
kind: Deployment
metadata:
  name: triton-server
  namespace: models
spec:
  replicas: 2
  selector:
    matchLabels:
      app: triton
  template:
    metadata:
      labels:
        app: triton
    spec:
      containers:
      - name: triton
        image: nvcr.io/nvidia/tritonserver:24.05-py3
        args:
        - "tritonserver"
        - "--model-store=s3://my-model-store/models"
        - "--model-control-mode=poll"        # auto-load new model versions
        - "--repository-poll-secs=30"
        - "--metrics-port=8002"
        ports:
        - containerPort: 8000   # HTTP
        - containerPort: 8001   # gRPC
        - containerPort: 8002   # Metrics
        resources:
          limits:
            nvidia.com/gpu: "1"
        readinessProbe:
          httpGet:
            path: /v2/health/ready
            port: 8000
          initialDelaySeconds: 30

Triton Model Repository Structure

s3://my-model-store/models/
├── text-classifier/
│   ├── config.pbtxt
│   ├── 1/
│   │   └── model.onnx
│   └── 2/
│       └── model.onnx          # new version; auto-loaded
├── embedding-model/
│   ├── config.pbtxt
│   └── 1/
│       └── model.onnx

# config.pbtxt for ONNX model
name: "text-classifier"
backend: "onnxruntime"
max_batch_size: 64
dynamic_batching {
  preferred_batch_size: [16, 32]
  max_queue_delay_microseconds: 1000
}
input [
  { name: "input_ids" data_type: TYPE_INT64 dims: [-1] }
  { name: "attention_mask" data_type: TYPE_INT64 dims: [-1] }
]
output [
  { name: "logits" data_type: TYPE_FP32 dims: [-1] }
]
instance_group [
  { kind: KIND_GPU count: 2 }   # 2 model instances on GPU
]

Model Management Commands

# List loaded models (Triton)
curl http://triton:8000/v2/models

# Load a new model version
curl -X POST http://triton:8000/v2/repository/models/text-classifier/load

# Unload a model
curl -X POST http://triton:8000/v2/repository/models/text-classifier/unload

# KServe — watch rollout status
kubectl rollout status deployment/llama-3-8b-predictor -n models
kubectl get inferenceservice llama-3-8b -n models -w

Common Issues

Best Practices

Use canary deployments for all model updates — roll back in seconds if metrics degrade.
Enable Triton dynamic batching — it can increase GPU throughput 5–10× for small models.
Store models in S3/GCS with versioned paths (s3://bucket/model/v1/, v2/).
Pin GPU node selectors to prevent model pods landing on CPU-only nodes.
Monitor p99 latency and error rates per model version during canary rollouts.

Related Skills

vllm-server - vLLM for LLM serving
llm-inference-scaling - KEDA autoscaling
kubernetes-ops - Core Kubernetes operations
gpu-server-management - GPU nodes

Related Skills

bagelhole/sre-dashboards

development

VerifiedTrustedCommunity

Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.

28SKILL.mdUpdated May 23, 2026

bagelhole/sre-dashboards

bagelhole/openclaw-security-hardening

testing

VerifiedTrustedCommunity

Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.

28SKILL.mdUpdated Apr 3, 2026

bagelhole/openclaw-security-hardening

bagelhole/vector-database-ops

devops

VerifiedTrustedCommunity

Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.

28SKILL.mdUpdated Apr 3, 2026

bagelhole/vector-database-ops

bagelhole/llm-cost-optimization

development

VerifiedTrustedCommunity

Reduce LLM API and infrastructure costs through model selection, prompt caching, batching, caching, quantization, and self-hosting strategies. Track spend by team and model, set budgets, and implement cost-aware routing.

28SKILL.mdUpdated Apr 3, 2026

bagelhole/llm-cost-optimization

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/bagelhole/devops-security-agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r devops-security-agent-skills/devops/orchestration/model-serving-kubernetes ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

bagelhole/devops-security-agent-skills

28 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT