devops/release/blue-green-deploy/SKILL.md
Configure zero-downtime deployment strategies including blue-green, canary, and rolling deployments. Implement traffic shifting, health checks, and rollback procedures. Use when implementing production deployment strategies or zero-downtime releases.
npx skillsauth add bagelhole/devops-security-agent-skills blue-green-deployInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Implement zero-downtime deployment patterns for production systems.
Use this skill when:
┌─────────────────────────────────────────────────────────────┐
│ DEPLOYMENT STRATEGIES │
├─────────────┬─────────────┬─────────────┬──────────────────┤
│ Blue-Green │ Canary │ Rolling │ Recreate │
├─────────────┼─────────────┼─────────────┼──────────────────┤
│ Full env │ Gradual % │ Pod by pod │ All at once │
│ swap │ rollout │ replacement │ │
├─────────────┼─────────────┼─────────────┼──────────────────┤
│ Instant │ Slow, safe │ Moderate │ Fast, risky │
│ rollback │ rollback │ rollback │ │
├─────────────┼─────────────┼─────────────┼──────────────────┤
│ 2x resources│ +10-25% │ Same │ Same │
│ needed │ resources │ resources │ │
└─────────────┴─────────────┴─────────────┴──────────────────┘
Before:
┌─────────┐ ┌───────────────┐
│ Users │────▶│ Blue (v1) │ ◀── Active
└─────────┘ └───────────────┘
┌───────────────┐
│ Green (v2) │ ◀── Staging
└───────────────┘
After Switch:
┌─────────┐ ┌───────────────┐
│ Users │ │ Blue (v1) │ ◀── Standby
└─────────┘ └───────────────┘
│ ┌───────────────┐
└────────▶│ Green (v2) │ ◀── Active
└───────────────┘
# blue-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-blue
labels:
app: myapp
version: blue
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: blue
template:
metadata:
labels:
app: myapp
version: blue
spec:
containers:
- name: myapp
image: myapp:v1.0.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
---
# green-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp-green
labels:
app: myapp
version: green
spec:
replicas: 3
selector:
matchLabels:
app: myapp
version: green
template:
metadata:
labels:
app: myapp
version: green
spec:
containers:
- name: myapp
image: myapp:v2.0.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
---
# service.yaml - Switch by changing selector
apiVersion: v1
kind: Service
metadata:
name: myapp
spec:
selector:
app: myapp
version: blue # Change to 'green' to switch
ports:
- port: 80
targetPort: 8080
#!/bin/bash
# blue-green-switch.sh
CURRENT=$(kubectl get svc myapp -o jsonpath='{.spec.selector.version}')
NEW_VERSION=$1
echo "Current version: $CURRENT"
echo "Switching to: $NEW_VERSION"
# Verify new deployment is ready
kubectl rollout status deployment/myapp-$NEW_VERSION
# Check health
HEALTH=$(kubectl exec -it deployment/myapp-$NEW_VERSION -- curl -s localhost:8080/health)
if [ "$HEALTH" != "ok" ]; then
echo "Health check failed"
exit 1
fi
# Switch traffic
kubectl patch svc myapp -p "{\"spec\":{\"selector\":{\"version\":\"$NEW_VERSION\"}}}"
echo "Switched to $NEW_VERSION"
# AWS CodeDeploy appspec.yml
version: 0.0
Resources:
- TargetService:
Type: AWS::ECS::Service
Properties:
TaskDefinition: "arn:aws:ecs:region:account:task-definition/myapp:2"
LoadBalancerInfo:
ContainerName: "myapp"
ContainerPort: 8080
Hooks:
- BeforeInstall: "LambdaFunctionToValidateBeforeTrafficShift"
- AfterInstall: "LambdaFunctionToValidateAfterTrafficShift"
- AfterAllowTestTraffic: "LambdaFunctionToValidateTestTraffic"
- BeforeAllowTraffic: "LambdaFunctionToValidateBeforeAllowTraffic"
- AfterAllowTraffic: "LambdaFunctionToValidateAfterAllowTraffic"
# VirtualService for traffic splitting
apiVersion: networking.istio.io/v1beta1
kind: VirtualService
metadata:
name: myapp
spec:
hosts:
- myapp
http:
- match:
- headers:
x-canary:
exact: "true"
route:
- destination:
host: myapp
subset: canary
- route:
- destination:
host: myapp
subset: stable
weight: 90
- destination:
host: myapp
subset: canary
weight: 10
---
apiVersion: networking.istio.io/v1beta1
kind: DestinationRule
metadata:
name: myapp
spec:
host: myapp
subsets:
- name: stable
labels:
version: stable
- name: canary
labels:
version: canary
apiVersion: argoproj.io/v1alpha1
kind: Rollout
metadata:
name: myapp
spec:
replicas: 5
strategy:
canary:
steps:
- setWeight: 10
- pause: {duration: 5m}
- setWeight: 25
- pause: {duration: 5m}
- setWeight: 50
- pause: {duration: 5m}
- setWeight: 75
- pause: {duration: 5m}
analysis:
templates:
- templateName: success-rate
startingStep: 2
args:
- name: service-name
value: myapp
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: myapp:v2.0.0
ports:
- containerPort: 8080
---
apiVersion: argoproj.io/v1alpha1
kind: AnalysisTemplate
metadata:
name: success-rate
spec:
args:
- name: service-name
metrics:
- name: success-rate
interval: 1m
successCondition: result[0] >= 0.95
failureLimit: 3
provider:
prometheus:
address: http://prometheus:9090
query: |
sum(rate(http_requests_total{service="{{args.service-name}}",status=~"2.*"}[5m]))
/
sum(rate(http_requests_total{service="{{args.service-name}}"}[5m]))
apiVersion: apps/v1
kind: Deployment
metadata:
name: myapp
spec:
replicas: 5
strategy:
type: RollingUpdate
rollingUpdate:
maxSurge: 1 # Max pods above desired
maxUnavailable: 0 # Max pods unavailable
selector:
matchLabels:
app: myapp
template:
metadata:
labels:
app: myapp
spec:
containers:
- name: myapp
image: myapp:v2.0.0
ports:
- containerPort: 8080
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
livenessProbe:
httpGet:
path: /health
port: 8080
initialDelaySeconds: 10
periodSeconds: 10
# Update image
kubectl set image deployment/myapp myapp=myapp:v2.0.0
# Watch rollout
kubectl rollout status deployment/myapp
# Pause rollout
kubectl rollout pause deployment/myapp
# Resume rollout
kubectl rollout resume deployment/myapp
# Rollback
kubectl rollout undo deployment/myapp
# Rollback to specific revision
kubectl rollout undo deployment/myapp --to-revision=2
# View history
kubectl rollout history deployment/myapp
# Flask health endpoint
from flask import Flask, jsonify
import psycopg2
import redis
app = Flask(__name__)
@app.route('/health')
def health():
"""Liveness probe - is the app running?"""
return jsonify({'status': 'healthy'}), 200
@app.route('/ready')
def ready():
"""Readiness probe - can the app serve traffic?"""
checks = {}
# Database check
try:
conn = psycopg2.connect(DATABASE_URL)
conn.close()
checks['database'] = 'ok'
except Exception as e:
checks['database'] = str(e)
return jsonify({'status': 'unhealthy', 'checks': checks}), 503
# Redis check
try:
r = redis.from_url(REDIS_URL)
r.ping()
checks['redis'] = 'ok'
except Exception as e:
checks['redis'] = str(e)
return jsonify({'status': 'unhealthy', 'checks': checks}), 503
return jsonify({'status': 'healthy', 'checks': checks}), 200
#!/bin/bash
# auto-rollback.sh
DEPLOYMENT=$1
THRESHOLD=0.95
INTERVAL=60
echo "Monitoring deployment $DEPLOYMENT"
while true; do
# Get success rate from Prometheus
SUCCESS_RATE=$(curl -s "http://prometheus:9090/api/v1/query?query=sum(rate(http_requests_total{status=~\"2.*\"}[5m]))/sum(rate(http_requests_total[5m]))" | jq -r '.data.result[0].value[1]')
echo "Current success rate: $SUCCESS_RATE"
if (( $(echo "$SUCCESS_RATE < $THRESHOLD" | bc -l) )); then
echo "Success rate below threshold! Rolling back..."
kubectl rollout undo deployment/$DEPLOYMENT
exit 1
fi
sleep $INTERVAL
done
## Rollback Checklist
### Before Rollback
- [ ] Confirm issue is deployment-related
- [ ] Document current error rates
- [ ] Notify team in #deployments channel
### During Rollback
- [ ] Execute rollback command
- [ ] Monitor rollback progress
- [ ] Verify old version is serving traffic
### After Rollback
- [ ] Confirm error rates normalized
- [ ] Update incident ticket
- [ ] Schedule post-mortem
Problem: Rollout takes too long Solution: Increase maxSurge, decrease minReadySeconds
Problem: Pods not becoming ready Solution: Check probe endpoints, increase timeouts
Problem: Errors during switch Solution: Use connection draining, implement graceful shutdown
development
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
testing
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.