compliance/continuity/runbook-creation/SKILL.md
Create operational runbooks and standard operating procedures. Document troubleshooting guides and recovery procedures. Use when documenting operational knowledge.
npx skillsauth add bagelhole/devops-security-agent-skills runbook-creationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Create effective operational runbooks, standard operating procedures, and troubleshooting guides that any on-call engineer can follow under pressure.
# Runbook: [Service / Process Name]
**Owner:** [Team or individual]
**Last Reviewed:** YYYY-MM-DD
**Version:** X.Y
**Severity if unavailable:** SEV[1-4]
---
## Overview
Brief description of the service, why this runbook exists, and when to
use it.
## Prerequisites
- [ ] Required access / IAM role: [details]
- [ ] Tools installed: [kubectl, aws-cli, psql, etc.]
- [ ] VPN connected to [environment]
- [ ] Communication channel open: [Slack #channel]
## Procedure
### Step 1 — [Action Name]
[Explanation of what this step does and why.]
```bash
# command here
```
**Expected output:** [describe what success looks like]
### Step 2 — [Action Name]
```bash
# command here
```
**Expected output:** [description]
*(Continue with numbered steps...)*
## Verification
How to confirm the procedure succeeded:
- [ ] [Check 1 — e.g., health endpoint returns 200]
- [ ] [Check 2 — e.g., no errors in logs for 5 minutes]
- [ ] [Check 3 — e.g., metrics return to baseline]
## Rollback
If the procedure fails or causes unexpected issues:
### Rollback Step 1
```bash
# rollback command
```
### Rollback Step 2
```bash
# rollback command
```
## Troubleshooting
| Symptom | Likely Cause | Resolution |
|---------|-------------|------------|
| [symptom 1] | [cause] | [fix] |
| [symptom 2] | [cause] | [fix] |
## Escalation
If unresolved after [X] minutes:
- **Primary:** @[team-lead] — [phone/Slack]
- **Secondary:** @[manager] — [phone/Slack]
## Related Runbooks
- [Link to related runbook 1]
- [Link to related runbook 2]
## Change Log
| Date | Author | Change |
|------|--------|--------|
| YYYY-MM-DD | [Name] | Initial version |
# Runbook: PostgreSQL Database Failover
**Owner:** Platform / DBA team
**Last Reviewed:** 2025-06-15
**Version:** 2.1
**Severity if unavailable:** SEV1
---
## Overview
Failover the primary PostgreSQL instance to the synchronous replica when
the primary is unreachable or degraded. This runbook covers both planned
(maintenance) and unplanned (emergency) failover.
## Prerequisites
- [ ] DBA or SRE-level access to primary and replica hosts
- [ ] `psql` client installed (v14+)
- [ ] VPN connected to production network
- [ ] Slack channel #db-ops open
- [ ] Confirm replica is in sync: replication lag < 1 MB
## Procedure
### Step 1 — Verify Replica Health
```bash
psql -h replica.db.internal -U dba -d postgres -c \
"SELECT pg_is_in_recovery(), pg_last_wal_replay_lsn();"
```
**Expected output:** `pg_is_in_recovery = t`, LSN advancing.
### Step 2 — Stop Application Writes
```bash
kubectl scale deployment api-server --replicas=0 -n production
kubectl scale deployment worker --replicas=0 -n production
```
**Expected output:** Deployments scaled to 0 pods.
### Step 3 — Confirm Write Quiesce
```bash
psql -h primary.db.internal -U dba -d postgres -c \
"SELECT count(*) FROM pg_stat_activity WHERE state = 'active' AND query !~ 'pg_stat';"
```
**Expected output:** Count = 0 (no active queries).
### Step 4 — Promote Replica
```bash
psql -h replica.db.internal -U dba -d postgres -c "SELECT pg_promote();"
```
Wait up to 30 seconds, then confirm:
```bash
psql -h replica.db.internal -U dba -d postgres -c "SELECT pg_is_in_recovery();"
```
**Expected output:** `pg_is_in_recovery = f` (no longer a replica).
### Step 5 — Update DNS
```bash
aws route53 change-resource-record-sets \
--hosted-zone-id Z1234567890 \
--change-batch '{
"Changes": [{
"Action": "UPSERT",
"ResourceRecordSet": {
"Name": "db.internal.example.com",
"Type": "CNAME",
"TTL": 60,
"ResourceRecords": [{"Value": "replica.db.internal"}]
}
}]
}'
```
### Step 6 — Restart Application
```bash
kubectl scale deployment api-server --replicas=6 -n production
kubectl scale deployment worker --replicas=4 -n production
```
## Verification
- [ ] `psql -h db.internal.example.com -c "SELECT 1;"` returns successfully
- [ ] Application logs show successful DB connections (no errors for 5 min)
- [ ] Transaction throughput returns to baseline on Grafana dashboard
- [ ] No replication-lag alerts firing
## Rollback
If the promoted replica has issues, restore from the most recent backup:
```bash
# Restore latest automated snapshot (RDS example)
aws rds restore-db-instance-from-db-snapshot \
--db-instance-identifier prod-db-restored \
--db-snapshot-identifier prod-db-latest-snapshot
```
## Escalation
If unresolved after 15 minutes:
- **Primary:** @dba-lead — +1-555-0101
- **Secondary:** @platform-oncall — +1-555-0102
#!/usr/bin/env bash
# health-check.sh — Check health of critical services
set -euo pipefail
SERVICES=(
"https://api.example.com/healthz"
"https://app.example.com/healthz"
"https://admin.example.com/healthz"
)
EXIT_CODE=0
for url in "${SERVICES[@]}"; do
HTTP_CODE=$(curl -so /dev/null -w '%{http_code}' --max-time 5 "$url" 2>/dev/null || echo "000")
if [ "$HTTP_CODE" -eq 200 ]; then
printf " OK %s\n" "$url"
else
printf " FAIL %s (HTTP %s)\n" "$url" "$HTTP_CODE"
EXIT_CODE=1
fi
done
exit $EXIT_CODE
#!/usr/bin/env bash
# collect-logs.sh — Gather logs from multiple sources for incident review
set -euo pipefail
INCIDENT_ID="${1:?Usage: collect-logs.sh <incident-id>}"
OUTDIR="/tmp/incident-${INCIDENT_ID}"
mkdir -p "$OUTDIR"
echo "Collecting logs for incident $INCIDENT_ID..."
# Kubernetes pod logs (last 30 min)
kubectl logs -l app=api-server -n production --since=30m \
> "${OUTDIR}/api-server-pods.log" 2>&1
# CloudWatch Logs (last 30 min)
aws logs filter-log-events \
--log-group-name /ecs/production/api \
--start-time "$(date -d '30 minutes ago' +%s)000" \
--output text > "${OUTDIR}/cloudwatch-api.log" 2>&1
# Database slow query log
psql -h db.internal -U dba -d postgres -c \
"SELECT * FROM pg_stat_activity WHERE state != 'idle' ORDER BY query_start;" \
> "${OUTDIR}/db-active-queries.log" 2>&1
# System resource snapshot
kubectl top pods -n production > "${OUTDIR}/pod-resources.log" 2>&1
echo "Logs saved to $OUTDIR"
tar czf "${OUTDIR}.tar.gz" -C /tmp "incident-${INCIDENT_ID}"
echo "Archive: ${OUTDIR}.tar.gz"
#!/usr/bin/env bash
# cert-check.sh — Warn if TLS certificates expire within 30 days
set -euo pipefail
DOMAINS=(
"api.example.com"
"app.example.com"
"admin.example.com"
)
WARN_DAYS=30
TODAY=$(date +%s)
EXIT_CODE=0
for domain in "${DOMAINS[@]}"; do
EXPIRY=$(echo | openssl s_client -servername "$domain" -connect "${domain}:443" 2>/dev/null \
| openssl x509 -noout -enddate 2>/dev/null | cut -d= -f2)
EXPIRY_EPOCH=$(date -d "$EXPIRY" +%s 2>/dev/null || echo 0)
DAYS_LEFT=$(( (EXPIRY_EPOCH - TODAY) / 86400 ))
if [ "$DAYS_LEFT" -lt "$WARN_DAYS" ]; then
printf " WARN %s expires in %d days (%s)\n" "$domain" "$DAYS_LEFT" "$EXPIRY"
EXIT_CODE=1
else
printf " OK %s — %d days remaining\n" "$domain" "$DAYS_LEFT"
fi
done
exit $EXIT_CODE
#!/usr/bin/env bash
# disk-cleanup.sh — Free disk space on a host
set -euo pipefail
echo "=== Disk Usage Before ==="
df -h /
# Remove old journal logs (> 7 days)
journalctl --vacuum-time=7d 2>/dev/null || true
# Clean Docker artifacts
docker system prune -f --volumes 2>/dev/null || true
# Remove old log files
find /var/log -name "*.gz" -mtime +7 -delete 2>/dev/null || true
find /tmp -type f -mtime +3 -delete 2>/dev/null || true
echo "=== Disk Usage After ==="
df -h /
Use this checklist every time a runbook is created or updated.
content_review:
- [ ] Title clearly identifies the service and operation
- [ ] Overview explains WHEN and WHY to use this runbook
- [ ] Prerequisites list all required access, tools, and setup
- [ ] Every step has a concrete command (no vague instructions)
- [ ] Expected output is documented for each step
- [ ] Verification section confirms success with specific checks
- [ ] Rollback section exists and has been tested
- [ ] Escalation contacts are current (names, phones, Slack handles)
- [ ] Troubleshooting table covers the top 3-5 known failure modes
usability_review:
- [ ] A new team member can follow the runbook without tribal knowledge
- [ ] Steps are numbered and sequential (no branching without clear labels)
- [ ] Commands can be copy-pasted (no placeholder values without explanation)
- [ ] Time estimates included for long-running steps
- [ ] No jargon or acronyms used without definition
maintenance_review:
- [ ] Owner and last-reviewed date are set
- [ ] Version number incremented
- [ ] Change log entry added
- [ ] Related runbooks section is up to date
- [ ] Links to dashboards and docs are valid (not broken)
testing_strategy:
dry_run:
frequency: "Every time a runbook is created or substantially edited"
method: "Walk through each step in a staging environment"
goal: "Verify commands work and output matches documentation"
peer_review:
frequency: "Every edit"
method: "Another engineer follows the runbook in staging without help"
goal: "Confirm the runbook is self-contained and unambiguous"
scheduled_validation:
frequency: "Quarterly"
method: "SRE team picks 5 runbooks at random, executes in staging"
goal: "Catch runbooks that have drifted from production reality"
incident_triggered:
trigger: "Any time a runbook is used in a real incident"
method: "Post-mortem includes runbook accuracy assessment"
goal: "Capture improvements while the experience is fresh"
automation_testing:
method: "CI pipeline validates bash scripts with shellcheck and dry-run"
example: |
# .github/workflows/runbook-lint.yml
name: Lint Runbook Scripts
on: [pull_request]
jobs:
shellcheck:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- name: ShellCheck
run: |
find runbooks/ -name "*.sh" -exec shellcheck {} +
versioning:
storage: "Git repository — one directory per service, one file per runbook"
naming: "runbooks/<service>/<operation>.md"
branching: "PRs required for all changes; reviewed by service owner"
version_scheme:
format: "MAJOR.MINOR"
major_bump: "Procedure changes that alter the steps or their order"
minor_bump: "Clarifications, typo fixes, updated contact info"
directory_layout: |
runbooks/
api-server/
deploy.md
rollback.md
scale-up.md
database/
failover.md
backup-restore.md
vacuum-maintenance.md
infrastructure/
dns-update.md
certificate-renewal.md
disk-cleanup.md
review_requirements:
- PR must be approved by the service owner
- CI must pass (shellcheck for scripts, markdown lint)
- Reviewer confirms they can follow the steps independently
retention: "Git history serves as full audit trail — never delete old versions"
Keep a top-level index so engineers can find the right runbook quickly.
# Runbook Index
| Service | Runbook | Severity | Owner | Last Tested |
|---------|---------|----------|-------|-------------|
| API Server | [Deploy](api-server/deploy.md) | — | @platform | 2025-05-01 |
| API Server | [Rollback](api-server/rollback.md) | SEV1 | @platform | 2025-05-01 |
| Database | [Failover](database/failover.md) | SEV1 | @dba | 2025-04-15 |
| Database | [Backup Restore](database/backup-restore.md) | SEV2 | @dba | 2025-04-15 |
| Infra | [DNS Update](infrastructure/dns-update.md) | SEV2 | @sre | 2025-06-01 |
| Infra | [Cert Renewal](infrastructure/certificate-renewal.md) | SEV3 | @sre | 2025-06-01 |
development
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
testing
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.