.archive/ops-team/skills/ops-disaster-recovery/SKILL.md
Structured workflow for disaster recovery planning, implementation, and testing including RTO/RPO definition, DR strategy selection, and failover procedures.
npx skillsauth add lerianstudio/ring ops-disaster-recoveryInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill defines the structured process for disaster recovery planning and testing. Use it for comprehensive DR strategy development and validation.
| Phase | Focus | Output | |-------|-------|--------| | 1. Business Impact | Define criticality and requirements | BIA document | | 2. Strategy Selection | Choose appropriate DR strategy | DR strategy | | 3. Architecture Design | Design DR infrastructure | DR architecture | | 4. Runbook Development | Document failover procedures | DR runbooks | | 5. Testing | Validate DR capabilities | Test report | | 6. Maintenance | Keep DR current | Update schedule |
Classify services by business criticality:
| Tier | Definition | RTO | RPO | Example Services | |------|------------|-----|-----|------------------| | Tier 1 | Critical - business cannot operate | <15 min | <1 min | Payment processing | | Tier 2 | Important - significant impact | <1 hour | <15 min | Customer portal | | Tier 3 | Standard - moderate impact | <4 hours | <1 hour | Internal tools | | Tier 4 | Low - minimal impact | <24 hours | <24 hours | Dev environments |
## Business Impact Analysis
**Assessment Date:** YYYY-MM-DD
**Assessed By:** [name]
### Service Classification
| Service | Business Function | Revenue Impact | Tier | RTO | RPO |
|---------|------------------|----------------|------|-----|-----|
| payment-api | Process transactions | $X,XXX/hour | 1 | 15 min | 1 min |
| customer-portal | Customer access | $XXX/hour | 2 | 1 hour | 15 min |
| admin-tools | Internal operations | $0/hour | 3 | 4 hours | 1 hour |
### Data Classification
| Data Type | Classification | Backup Frequency | Retention |
|-----------|---------------|------------------|-----------|
| Transaction data | Critical | Continuous | 7 years |
| Customer data | Important | Hourly | 3 years |
| Application logs | Standard | Daily | 90 days |
### Dependencies
| Service | Dependencies | DR Impact |
|---------|--------------|-----------|
| payment-api | Database, payment-gateway | All must fail over together |
| customer-portal | Database, auth-service | Sequential failover possible |
| Strategy | RTO | RPO | Cost | Complexity | Best For | |----------|-----|-----|------|------------|----------| | Backup & Restore | Hours | Hours | $ | Low | Tier 4 services | | Pilot Light | 30-60 min | Minutes | $$ | Medium | Tier 3 services | | Warm Standby | 10-30 min | Seconds-Minutes | $$$ | Medium-High | Tier 2 services | | Hot Standby | <10 min | Seconds | $$$$ | High | Tier 1 services | | Multi-Active | Near-zero | Near-zero | $$$$$ | Very High | Ultra-critical |
## DR Strategy Selection
### Requirements Summary
| Requirement | Value |
|-------------|-------|
| Target RTO | [X minutes/hours] |
| Target RPO | [X minutes/hours] |
| Budget | $[X,XXX]/month for DR |
| Compliance | [frameworks] |
### Strategy Decision
**Selected Strategy:** [Pilot Light / Warm Standby / Hot Standby]
**Rationale:**
1. RTO requirement of [X] achieved by [strategy]
2. RPO requirement of [X] achieved with [replication method]
3. Budget of $[X]/month supports [strategy] (~XX% of production cost)
4. Compliance requirement for [X] met with [features]
### Trade-offs Accepted
| Trade-off | Impact | Mitigation |
|-----------|--------|------------|
| Higher DR cost | +$X/month | Justified by RTO requirement |
| Manual failover steps | 5-10 min added | Automation planned Q2 |
| Component | Primary | DR | Replication | |-----------|---------|----|----| | DNS | Route53 | Route53 | Global service | | Load Balancer | ALB (us-east-1) | ALB (us-west-2) | Configuration sync | | Compute | EKS (us-east-1) | EKS (us-west-2) | GitOps deployment | | Database | Aurora (us-east-1) | Aurora Global (us-west-2) | Async replication | | Storage | S3 (us-east-1) | S3 (us-west-2) | Cross-region replication | | Secrets | Secrets Manager | Secrets Manager | Manual sync |
Primary Region (us-east-1) DR Region (us-west-2)
┌─────────────────────────┐ ┌─────────────────────────┐
│ │ │ │
│ ┌─────────────────┐ │ │ ┌─────────────────┐ │
│ │ ALB │ │ │ │ ALB │ │
│ └────────┬────────┘ │ │ └────────┬────────┘ │
│ │ │ │ │ (standby) │
│ ┌────────┴────────┐ │ │ ┌────────┴────────┐ │
│ │ EKS Cluster │ │ │ │ EKS Cluster │ │
│ │ (Active) │ │ │ │ (Standby) │ │
│ └────────┬────────┘ │ │ └────────┬────────┘ │
│ │ │ │ │ │
│ ┌────────┴────────┐ │ async │ ┌────────┴────────┐ │
│ │ Aurora │────┼────────►│ │ Aurora │ │
│ │ (Primary) │ │ │ │ (Replica) │ │
│ └─────────────────┘ │ │ └─────────────────┘ │
│ │ │ │
└─────────────────────────┘ └─────────────────────────┘
│ │
└───────────┬───────────────────┘
│
┌──────┴──────┐
│ Route53 │
│ (Global) │
└─────────────┘
## Failover Runbook: [Service Name]
**Version:** 1.0
**Last Updated:** YYYY-MM-DD
**Owner:** [team]
### Pre-Conditions
- [ ] DR region healthy (check dashboard)
- [ ] Replication lag <[X seconds/minutes]
- [ ] On-call personnel available
- [ ] Communication channels ready
### Failover Decision Criteria
| Criteria | Automatic | Manual |
|----------|-----------|--------|
| Primary region unavailable >5 min | Yes | - |
| Replication lag >15 min | - | Yes |
| Data corruption detected | - | Yes |
| Planned maintenance | - | Yes |
### Failover Steps
1. **Verify DR Readiness** (2 min)
```bash
# Check DR database status
aws rds describe-db-clusters --region us-west-2
# Check EKS cluster status
kubectl --context=dr get nodes
Stop Writes to Primary (1 min)
# Scale down primary services
kubectl --context=primary scale deployment/api --replicas=0
Promote DR Database (5 min)
# Promote Aurora replica
aws rds failover-global-cluster \
--global-cluster-identifier my-global-cluster \
--target-db-cluster-identifier dr-cluster
Activate DR Services (2 min)
# Scale up DR services
kubectl --context=dr scale deployment/api --replicas=10
Update DNS (1-5 min propagation)
# Update Route53 health check
aws route53 update-health-check \
--health-check-id xxx \
--disabled
Verify Service (5 min)
# Health check
curl https://api.example.com/health
# Synthetic transaction
./scripts/synthetic-test.sh
[If failover causes issues, steps to return to primary]
Internal:
DR failover initiated for [service] at [time UTC]. Estimated completion: [X minutes]. IC: [name]
External (if customer-facing):
We are currently experiencing issues with [service]. Our team is working to restore service. Status page: [url]
---
## Phase 5: Testing
### DR Test Types
| Test Type | Frequency | Scope | Impact |
|-----------|-----------|-------|--------|
| **Tabletop** | Quarterly | Full scenario walkthrough | None |
| **Component** | Monthly | Individual component failover | Minimal |
| **Partial** | Quarterly | Non-production failover | Low |
| **Full** | Annually | Production failover | Moderate |
### DR Test Template
```markdown
## DR Test Report
**Test Date:** YYYY-MM-DD
**Test Type:** [Tabletop/Component/Partial/Full]
**Scope:** [services tested]
### Test Objectives
1. Validate RTO of <[X minutes]
2. Validate RPO of <[X minutes]
3. Verify runbook accuracy
4. Identify gaps in DR readiness
### Test Results
| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| RTO | 15 min | 12 min | PASS |
| RPO | 1 min | 45 sec | PASS |
| Data integrity | 100% | 100% | PASS |
| Runbook accuracy | 100% | 85% | PARTIAL |
### Timeline
| Time | Action | Status |
|------|--------|--------|
| 10:00 | Test initiated | OK |
| 10:02 | Primary shutdown simulated | OK |
| 10:08 | DR database promoted | OK |
| 10:12 | DR services activated | OK |
| 10:15 | Service verified | OK |
### Issues Found
| Issue | Severity | Action Required |
|-------|----------|-----------------|
| Step 4 command incorrect | Medium | Update runbook |
| DNS propagation slower | Low | Reduce TTL |
### Lessons Learned
1. [Lesson 1]
2. [Lesson 2]
### Action Items
| Item | Owner | Due Date |
|------|-------|----------|
| Update runbook step 4 | @ops | YYYY-MM-DD |
| Reduce DNS TTL | @platform | YYYY-MM-DD |
| Activity | Frequency | Owner | |----------|-----------|-------| | Runbook review | Quarterly | Platform team | | DR test | Per test schedule | SRE team | | Replication monitoring | Daily (automated) | Monitoring | | Cost review | Monthly | FinOps | | Architecture review | Annually | Architecture team |
| Rationalization | Why It's WRONG | Required Action | |-----------------|----------------|-----------------| | "DR can be added later" | DR added later is rarely tested | DR is day-1 requirement | | "Backups are good enough" | Backups != DR. RTO is hours vs minutes. | Design proper DR strategy | | "Too expensive for DR" | DR cost << outage cost | Calculate business impact | | "We'll figure it out during incident" | Panic != good decisions | Document runbooks NOW | | "Tested last year, still good" | Systems change constantly | Test regularly |
For DR planning tasks, dispatch:
Task tool:
subagent_type: "ring:infrastructure-architect"
prompt: |
DR PLANNING REQUEST
Services: [services requiring DR]
RTO Requirement: [target]
RPO Requirement: [target]
Current State: [existing DR if any]
REQUEST: [design/review/test planning]
development
Analyzes a Go service using lib-commons v2/v3 and generates a visual migration report showing every change needed to upgrade to lib-commons v4. Produces an interactive HTML page (via ring:visualize) and optionally generates refactoring tasks for ring:dev-cycle.
documentation
Patterns and structure for writing functional documentation including guides, conceptual explanations, tutorials, and best practices documentation.
development
Patterns and structure for writing API reference documentation including endpoint descriptions, request/response schemas, and error documentation.
documentation
Voice and tone guidelines for technical documentation. Ensures consistent, clear, and human writing across all documentation.