infrastructure/networking/ai-inference-service-mesh/SKILL.md
Use service mesh patterns for AI inference traffic management, mTLS, canary releases, policy enforcement, and cross-cluster resilience.
npx skillsauth add bagelhole/devops-security-agent-skills ai-inference-service-meshInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Apply Istio/Linkerd mesh controls to secure and optimize east-west AI traffic across inference microservices.
# Install Istio with production profile
istioctl install --set profile=default \
--set meshConfig.accessLogFile=/dev/stdout \
--set meshConfig.defaultConfig.holdApplicationUntilProxyStarts=true
# Label inference namespace for sidecar injection
kubectl create namespace ai-inference
kubectl label namespace ai-inference istio-injection=enabled
# Verify installation
istioctl verify-install
istioctl analyze -n ai-inference
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: default
namespace: istio-system
spec:
mtls:
mode: STRICT
---
# Namespace-level override if needed for gradual rollout
apiVersion: security.istio.io/v1beta1
kind: PeerAuthentication
metadata:
name: ai-inference-mtls
namespace: ai-inference
spec:
mtls:
mode: STRICT
portLevelMtls:
# gRPC inference port
8081:
mode: STRICT
# Prometheus metrics port - allow plaintext scraping
9090:
mode: PERMISSIVE
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: model-server-access
namespace: ai-inference
spec:
selector:
matchLabels:
app: model-server
action: ALLOW
rules:
- from:
- source:
principals:
- "cluster.local/ns/ai-inference/sa/api-gateway"
- "cluster.local/ns/ai-inference/sa/orchestrator"
to:
- operation:
methods: ["POST"]
paths: ["/v1/predict", "/v1/embeddings", "/v2/models/*/infer"]
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: deny-external-to-retriever
namespace: ai-inference
spec:
selector:
matchLabels:
app: vector-retriever
action: DENY
rules:
- from:
- source:
notNamespaces: ["ai-inference"]
apiVersion: networking.istio.io/v1alpha3
kind: ServiceEntry
metadata:
name: openai-api
namespace: ai-inference
spec:
hosts:
- api.openai.com
ports:
- number: 443
name: https
protocol: TLS
resolution: DNS
location: MESH_EXTERNAL
---
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: openai-api-tls
namespace: ai-inference
spec:
host: api.openai.com
trafficPolicy:
tls:
mode: SIMPLE
connectionPool:
http:
h2UpgradePolicy: UPGRADE
tcp:
maxConnections: 50
---
apiVersion: security.istio.io/v1beta1
kind: AuthorizationPolicy
metadata:
name: restrict-egress
namespace: ai-inference
spec:
action: ALLOW
rules:
- to:
- operation:
hosts:
- "api.openai.com"
- "models.anthropic.com"
- "*.blob.core.windows.net"
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: model-server
namespace: ai-inference
spec:
hosts:
- model-server
http:
# Route by header for explicit model version selection
- match:
- headers:
x-model-version:
exact: "v2-experimental"
route:
- destination:
host: model-server
subset: v2-experimental
timeout: 120s
# Route by header for A/B test cohort
- match:
- headers:
x-ab-cohort:
exact: "treatment"
route:
- destination:
host: model-server
subset: v2-experimental
weight: 100
timeout: 120s
# Default traffic split: 90/10 canary
- route:
- destination:
host: model-server
subset: v1-stable
weight: 90
- destination:
host: model-server
subset: v2-experimental
weight: 10
timeout: 60s
retries:
attempts: 2
perTryTimeout: 30s
retryOn: unavailable,resource-exhausted
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: model-server
namespace: ai-inference
spec:
host: model-server
trafficPolicy:
connectionPool:
http:
h2UpgradePolicy: UPGRADE
maxRequestsPerConnection: 100
tcp:
maxConnections: 200
connectTimeout: 5s
loadBalancer:
simple: LEAST_REQUEST
subsets:
- name: v1-stable
labels:
version: v1
trafficPolicy:
connectionPool:
http:
maxRequestsPerConnection: 50
- name: v2-experimental
labels:
version: v2
trafficPolicy:
connectionPool:
http:
maxRequestsPerConnection: 20
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: model-server-circuit-breaker
namespace: ai-inference
spec:
host: model-server
trafficPolicy:
connectionPool:
tcp:
maxConnections: 100
connectTimeout: 10s
http:
http1MaxPendingRequests: 50
http2MaxRequests: 200
maxRequestsPerConnection: 10
maxRetries: 3
outlierDetection:
consecutive5xxErrors: 3
interval: 15s
baseEjectionTime: 30s
maxEjectionPercent: 50
minHealthPercent: 30
splitExternalLocalOriginErrors: true
---
# Separate circuit breaker for the vector retriever
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: vector-retriever-circuit-breaker
namespace: ai-inference
spec:
host: vector-retriever
trafficPolicy:
connectionPool:
tcp:
maxConnections: 300
http:
http1MaxPendingRequests: 200
http2MaxRequests: 500
outlierDetection:
consecutive5xxErrors: 5
interval: 10s
baseEjectionTime: 15s
maxEjectionPercent: 30
apiVersion: networking.istio.io/v1alpha3
kind: VirtualService
metadata:
name: streaming-inference
namespace: ai-inference
spec:
hosts:
- model-server
http:
# Streaming endpoint: no retries, long timeout
- match:
- uri:
prefix: /v1/stream
route:
- destination:
host: model-server
subset: v1-stable
timeout: 300s
retries:
attempts: 0
# Embeddings endpoint: safe to retry, short timeout
- match:
- uri:
prefix: /v1/embeddings
route:
- destination:
host: model-server
subset: v1-stable
timeout: 15s
retries:
attempts: 3
perTryTimeout: 5s
retryOn: 5xx,reset,connect-failure,retriable-status-codes
apiVersion: networking.istio.io/v1alpha3
kind: DestinationRule
metadata:
name: model-server-locality
namespace: ai-inference
spec:
host: model-server
trafficPolicy:
loadBalancer:
localityLbSetting:
enabled: true
distribute:
- from: "us-east-1/us-east-1a/*"
to:
"us-east-1/us-east-1a/*": 80
"us-east-1/us-east-1b/*": 20
failover:
- from: us-east-1
to: us-west-2
outlierDetection:
consecutive5xxErrors: 3
interval: 10s
baseEjectionTime: 30s
# Telemetry resource for custom metrics on inference services
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: inference-telemetry
namespace: ai-inference
spec:
metrics:
- providers:
- name: prometheus
overrides:
- match:
metric: REQUEST_DURATION
mode: CLIENT_AND_SERVER
tagOverrides:
model_name:
operation: UPSERT
value: "request.headers['x-model-name']"
tenant_id:
operation: UPSERT
value: "request.headers['x-tenant-id']"
tracing:
- providers:
- name: zipkin
randomSamplingPercentage: 10.0
# Port-forward Kiali
kubectl port-forward svc/kiali -n istio-system 20001:20001 &
# Verify mesh health via API
curl -s http://localhost:20001/kiali/api/namespaces/ai-inference/health | jq .
# Check proxy sync status
istioctl proxy-status -n ai-inference
# Debug a specific pod sidecar config
istioctl proxy-config routes deploy/model-server -n ai-inference -o json
istioctl proxy-config cluster deploy/model-server -n ai-inference
holdApplicationUntilProxyStarts causing race conditions on startupdevelopment
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
testing
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.