Disaster Recovery Workflow

This skill defines the structured process for disaster recovery planning and testing. Use it for comprehensive DR strategy development and validation.

DR Planning Phases

| Phase | Focus | Output | |-------|-------|--------| | 1. Business Impact | Define criticality and requirements | BIA document | | 2. Strategy Selection | Choose appropriate DR strategy | DR strategy | | 3. Architecture Design | Design DR infrastructure | DR architecture | | 4. Runbook Development | Document failover procedures | DR runbooks | | 5. Testing | Validate DR capabilities | Test report | | 6. Maintenance | Keep DR current | Update schedule |

Phase 1: Business Impact Analysis

Service Classification

Classify services by business criticality:

| Tier | Definition | RTO | RPO | Example Services | |------|------------|-----|-----|------------------| | Tier 1 | Critical - business cannot operate | <15 min | <1 min | Payment processing | | Tier 2 | Important - significant impact | <1 hour | <15 min | Customer portal | | Tier 3 | Standard - moderate impact | <4 hours | <1 hour | Internal tools | | Tier 4 | Low - minimal impact | <24 hours | <24 hours | Dev environments |

BIA Template

## Business Impact Analysis

**Assessment Date:** YYYY-MM-DD
**Assessed By:** [name]

### Service Classification

| Service | Business Function | Revenue Impact | Tier | RTO | RPO |
|---------|------------------|----------------|------|-----|-----|
| payment-api | Process transactions | $X,XXX/hour | 1 | 15 min | 1 min |
| customer-portal | Customer access | $XXX/hour | 2 | 1 hour | 15 min |
| admin-tools | Internal operations | $0/hour | 3 | 4 hours | 1 hour |

### Data Classification

| Data Type | Classification | Backup Frequency | Retention |
|-----------|---------------|------------------|-----------|
| Transaction data | Critical | Continuous | 7 years |
| Customer data | Important | Hourly | 3 years |
| Application logs | Standard | Daily | 90 days |

### Dependencies

| Service | Dependencies | DR Impact |
|---------|--------------|-----------|
| payment-api | Database, payment-gateway | All must fail over together |
| customer-portal | Database, auth-service | Sequential failover possible |

Phase 2: Strategy Selection

DR Strategy Comparison

| Strategy | RTO | RPO | Cost | Complexity | Best For | |----------|-----|-----|------|------------|----------| | Backup & Restore | Hours | Hours | $ | Low | Tier 4 services | | Pilot Light | 30-60 min | Minutes | $$ | Medium | Tier 3 services | | Warm Standby | 10-30 min | Seconds-Minutes | $$$ | Medium-High | Tier 2 services | | Hot Standby | <10 min | Seconds | $$$$ | High | Tier 1 services | | Multi-Active | Near-zero | Near-zero | $$$$$ | Very High | Ultra-critical |

Strategy Selection Matrix

## DR Strategy Selection

### Requirements Summary

| Requirement | Value |
|-------------|-------|
| Target RTO | [X minutes/hours] |
| Target RPO | [X minutes/hours] |
| Budget | $[X,XXX]/month for DR |
| Compliance | [frameworks] |

### Strategy Decision

**Selected Strategy:** [Pilot Light / Warm Standby / Hot Standby]

**Rationale:**
1. RTO requirement of [X] achieved by [strategy]
2. RPO requirement of [X] achieved with [replication method]
3. Budget of $[X]/month supports [strategy] (~XX% of production cost)
4. Compliance requirement for [X] met with [features]

### Trade-offs Accepted

| Trade-off | Impact | Mitigation |
|-----------|--------|------------|
| Higher DR cost | +$X/month | Justified by RTO requirement |
| Manual failover steps | 5-10 min added | Automation planned Q2 |

Phase 3: Architecture Design

DR Architecture Components

| Component | Primary | DR | Replication | |-----------|---------|----|----| | DNS | Route53 | Route53 | Global service | | Load Balancer | ALB (us-east-1) | ALB (us-west-2) | Configuration sync | | Compute | EKS (us-east-1) | EKS (us-west-2) | GitOps deployment | | Database | Aurora (us-east-1) | Aurora Global (us-west-2) | Async replication | | Storage | S3 (us-east-1) | S3 (us-west-2) | Cross-region replication | | Secrets | Secrets Manager | Secrets Manager | Manual sync |

Architecture Diagram Template

Primary Region (us-east-1)          DR Region (us-west-2)
┌─────────────────────────┐         ┌─────────────────────────┐
│                         │         │                         │
│  ┌─────────────────┐    │         │  ┌─────────────────┐    │
│  │     ALB         │    │         │  │     ALB         │    │
│  └────────┬────────┘    │         │  └────────┬────────┘    │
│           │             │         │           │ (standby)   │
│  ┌────────┴────────┐    │         │  ┌────────┴────────┐    │
│  │  EKS Cluster    │    │         │  │  EKS Cluster    │    │
│  │  (Active)       │    │         │  │  (Standby)      │    │
│  └────────┬────────┘    │         │  └────────┬────────┘    │
│           │             │         │           │             │
│  ┌────────┴────────┐    │  async  │  ┌────────┴────────┐    │
│  │  Aurora         │────┼────────►│  │  Aurora         │    │
│  │  (Primary)      │    │         │  │  (Replica)      │    │
│  └─────────────────┘    │         │  └─────────────────┘    │
│                         │         │                         │
└─────────────────────────┘         └─────────────────────────┘
              │                               │
              └───────────┬───────────────────┘
                          │
                   ┌──────┴──────┐
                   │   Route53   │
                   │   (Global)  │
                   └─────────────┘

Phase 4: Runbook Development

Failover Runbook Structure

## Failover Runbook: [Service Name]

**Version:** 1.0
**Last Updated:** YYYY-MM-DD
**Owner:** [team]

### Pre-Conditions

- [ ] DR region healthy (check dashboard)
- [ ] Replication lag <[X seconds/minutes]
- [ ] On-call personnel available
- [ ] Communication channels ready

### Failover Decision Criteria

| Criteria | Automatic | Manual |
|----------|-----------|--------|
| Primary region unavailable >5 min | Yes | - |
| Replication lag >15 min | - | Yes |
| Data corruption detected | - | Yes |
| Planned maintenance | - | Yes |

### Failover Steps

1. **Verify DR Readiness** (2 min)
   ```bash
   # Check DR database status
   aws rds describe-db-clusters --region us-west-2

   # Check EKS cluster status
   kubectl --context=dr get nodes

Stop Writes to Primary (1 min)

# Scale down primary services
kubectl --context=primary scale deployment/api --replicas=0

Promote DR Database (5 min)

# Promote Aurora replica
aws rds failover-global-cluster \
  --global-cluster-identifier my-global-cluster \
  --target-db-cluster-identifier dr-cluster

Activate DR Services (2 min)

# Scale up DR services
kubectl --context=dr scale deployment/api --replicas=10

Update DNS (1-5 min propagation)

# Update Route53 health check
aws route53 update-health-check \
  --health-check-id xxx \
  --disabled

Verify Service (5 min)

# Health check
curl https://api.example.com/health

# Synthetic transaction
./scripts/synthetic-test.sh

Rollback Steps

[If failover causes issues, steps to return to primary]

Communication Template

Internal:

DR failover initiated for [service] at [time UTC]. Estimated completion: [X minutes]. IC: [name]

External (if customer-facing):

We are currently experiencing issues with [service]. Our team is working to restore service. Status page: [url]


---

## Phase 5: Testing

### DR Test Types

| Test Type | Frequency | Scope | Impact |
|-----------|-----------|-------|--------|
| **Tabletop** | Quarterly | Full scenario walkthrough | None |
| **Component** | Monthly | Individual component failover | Minimal |
| **Partial** | Quarterly | Non-production failover | Low |
| **Full** | Annually | Production failover | Moderate |

### DR Test Template

```markdown
## DR Test Report

**Test Date:** YYYY-MM-DD
**Test Type:** [Tabletop/Component/Partial/Full]
**Scope:** [services tested]

### Test Objectives

1. Validate RTO of <[X minutes]
2. Validate RPO of <[X minutes]
3. Verify runbook accuracy
4. Identify gaps in DR readiness

### Test Results

| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| RTO | 15 min | 12 min | PASS |
| RPO | 1 min | 45 sec | PASS |
| Data integrity | 100% | 100% | PASS |
| Runbook accuracy | 100% | 85% | PARTIAL |

### Timeline

| Time | Action | Status |
|------|--------|--------|
| 10:00 | Test initiated | OK |
| 10:02 | Primary shutdown simulated | OK |
| 10:08 | DR database promoted | OK |
| 10:12 | DR services activated | OK |
| 10:15 | Service verified | OK |

### Issues Found

| Issue | Severity | Action Required |
|-------|----------|-----------------|
| Step 4 command incorrect | Medium | Update runbook |
| DNS propagation slower | Low | Reduce TTL |

### Lessons Learned

1. [Lesson 1]
2. [Lesson 2]

### Action Items

| Item | Owner | Due Date |
|------|-------|----------|
| Update runbook step 4 | @ops | YYYY-MM-DD |
| Reduce DNS TTL | @platform | YYYY-MM-DD |

Phase 6: Maintenance

DR Maintenance Schedule

| Activity | Frequency | Owner | |----------|-----------|-------| | Runbook review | Quarterly | Platform team | | DR test | Per test schedule | SRE team | | Replication monitoring | Daily (automated) | Monitoring | | Cost review | Monthly | FinOps | | Architecture review | Annually | Architecture team |

Anti-Rationalization Table

| Rationalization | Why It's WRONG | Required Action | |-----------------|----------------|-----------------| | "DR can be added later" | DR added later is rarely tested | DR is day-1 requirement | | "Backups are good enough" | Backups != DR. RTO is hours vs minutes. | Design proper DR strategy | | "Too expensive for DR" | DR cost << outage cost | Calculate business impact | | "We'll figure it out during incident" | Panic != good decisions | Document runbooks NOW | | "Tested last year, still good" | Systems change constantly | Test regularly |

Dispatch Specialist

For DR planning tasks, dispatch:

Task tool:
  subagent_type: "ring:infrastructure-architect"
  prompt: |
    DR PLANNING REQUEST
    Services: [services requiring DR]
    RTO Requirement: [target]
    RPO Requirement: [target]
    Current State: [existing DR if any]
    REQUEST: [design/review/test planning]

Disaster Recovery Workflow

This skill defines the structured process for disaster recovery planning and testing. Use it for comprehensive DR strategy development and validation.

DR Planning Phases

Phase 1: Business Impact Analysis

Service Classification

Classify services by business criticality:

BIA Template

## Business Impact Analysis

**Assessment Date:** YYYY-MM-DD
**Assessed By:** [name]

### Service Classification

| Service | Business Function | Revenue Impact | Tier | RTO | RPO |
|---------|------------------|----------------|------|-----|-----|
| payment-api | Process transactions | $X,XXX/hour | 1 | 15 min | 1 min |
| customer-portal | Customer access | $XXX/hour | 2 | 1 hour | 15 min |
| admin-tools | Internal operations | $0/hour | 3 | 4 hours | 1 hour |

### Data Classification

| Data Type | Classification | Backup Frequency | Retention |
|-----------|---------------|------------------|-----------|
| Transaction data | Critical | Continuous | 7 years |
| Customer data | Important | Hourly | 3 years |
| Application logs | Standard | Daily | 90 days |

### Dependencies

| Service | Dependencies | DR Impact |
|---------|--------------|-----------|
| payment-api | Database, payment-gateway | All must fail over together |
| customer-portal | Database, auth-service | Sequential failover possible |

Phase 2: Strategy Selection

DR Strategy Comparison

Strategy Selection Matrix

## DR Strategy Selection

### Requirements Summary

| Requirement | Value |
|-------------|-------|
| Target RTO | [X minutes/hours] |
| Target RPO | [X minutes/hours] |
| Budget | $[X,XXX]/month for DR |
| Compliance | [frameworks] |

### Strategy Decision

**Selected Strategy:** [Pilot Light / Warm Standby / Hot Standby]

**Rationale:**
1. RTO requirement of [X] achieved by [strategy]
2. RPO requirement of [X] achieved with [replication method]
3. Budget of $[X]/month supports [strategy] (~XX% of production cost)
4. Compliance requirement for [X] met with [features]

### Trade-offs Accepted

| Trade-off | Impact | Mitigation |
|-----------|--------|------------|
| Higher DR cost | +$X/month | Justified by RTO requirement |
| Manual failover steps | 5-10 min added | Automation planned Q2 |

Phase 3: Architecture Design

DR Architecture Components

Architecture Diagram Template

Primary Region (us-east-1)          DR Region (us-west-2)
┌─────────────────────────┐         ┌─────────────────────────┐
│                         │         │                         │
│  ┌─────────────────┐    │         │  ┌─────────────────┐    │
│  │     ALB         │    │         │  │     ALB         │    │
│  └────────┬────────┘    │         │  └────────┬────────┘    │
│           │             │         │           │ (standby)   │
│  ┌────────┴────────┐    │         │  ┌────────┴────────┐    │
│  │  EKS Cluster    │    │         │  │  EKS Cluster    │    │
│  │  (Active)       │    │         │  │  (Standby)      │    │
│  └────────┬────────┘    │         │  └────────┬────────┘    │
│           │             │         │           │             │
│  ┌────────┴────────┐    │  async  │  ┌────────┴────────┐    │
│  │  Aurora         │────┼────────►│  │  Aurora         │    │
│  │  (Primary)      │    │         │  │  (Replica)      │    │
│  └─────────────────┘    │         │  └─────────────────┘    │
│                         │         │                         │
└─────────────────────────┘         └─────────────────────────┘
              │                               │
              └───────────┬───────────────────┘
                          │
                   ┌──────┴──────┐
                   │   Route53   │
                   │   (Global)  │
                   └─────────────┘

Phase 4: Runbook Development

Failover Runbook Structure

## Failover Runbook: [Service Name]

**Version:** 1.0
**Last Updated:** YYYY-MM-DD
**Owner:** [team]

### Pre-Conditions

- [ ] DR region healthy (check dashboard)
- [ ] Replication lag <[X seconds/minutes]
- [ ] On-call personnel available
- [ ] Communication channels ready

### Failover Decision Criteria

| Criteria | Automatic | Manual |
|----------|-----------|--------|
| Primary region unavailable >5 min | Yes | - |
| Replication lag >15 min | - | Yes |
| Data corruption detected | - | Yes |
| Planned maintenance | - | Yes |

### Failover Steps

1. **Verify DR Readiness** (2 min)
   ```bash
   # Check DR database status
   aws rds describe-db-clusters --region us-west-2

   # Check EKS cluster status
   kubectl --context=dr get nodes

Stop Writes to Primary (1 min)

# Scale down primary services
kubectl --context=primary scale deployment/api --replicas=0

Promote DR Database (5 min)

# Promote Aurora replica
aws rds failover-global-cluster \
  --global-cluster-identifier my-global-cluster \
  --target-db-cluster-identifier dr-cluster

Activate DR Services (2 min)

# Scale up DR services
kubectl --context=dr scale deployment/api --replicas=10

Update DNS (1-5 min propagation)

# Update Route53 health check
aws route53 update-health-check \
  --health-check-id xxx \
  --disabled

Verify Service (5 min)

# Health check
curl https://api.example.com/health

# Synthetic transaction
./scripts/synthetic-test.sh

Rollback Steps

[If failover causes issues, steps to return to primary]

Communication Template

Internal:

DR failover initiated for [service] at [time UTC]. Estimated completion: [X minutes]. IC: [name]

External (if customer-facing):

We are currently experiencing issues with [service]. Our team is working to restore service. Status page: [url]


---

## Phase 5: Testing

### DR Test Types

| Test Type | Frequency | Scope | Impact |
|-----------|-----------|-------|--------|
| **Tabletop** | Quarterly | Full scenario walkthrough | None |
| **Component** | Monthly | Individual component failover | Minimal |
| **Partial** | Quarterly | Non-production failover | Low |
| **Full** | Annually | Production failover | Moderate |

### DR Test Template

```markdown
## DR Test Report

**Test Date:** YYYY-MM-DD
**Test Type:** [Tabletop/Component/Partial/Full]
**Scope:** [services tested]

### Test Objectives

1. Validate RTO of <[X minutes]
2. Validate RPO of <[X minutes]
3. Verify runbook accuracy
4. Identify gaps in DR readiness

### Test Results

| Metric | Target | Actual | Status |
|--------|--------|--------|--------|
| RTO | 15 min | 12 min | PASS |
| RPO | 1 min | 45 sec | PASS |
| Data integrity | 100% | 100% | PASS |
| Runbook accuracy | 100% | 85% | PARTIAL |

### Timeline

| Time | Action | Status |
|------|--------|--------|
| 10:00 | Test initiated | OK |
| 10:02 | Primary shutdown simulated | OK |
| 10:08 | DR database promoted | OK |
| 10:12 | DR services activated | OK |
| 10:15 | Service verified | OK |

### Issues Found

| Issue | Severity | Action Required |
|-------|----------|-----------------|
| Step 4 command incorrect | Medium | Update runbook |
| DNS propagation slower | Low | Reduce TTL |

### Lessons Learned

1. [Lesson 1]
2. [Lesson 2]

### Action Items

| Item | Owner | Due Date |
|------|-------|----------|
| Update runbook step 4 | @ops | YYYY-MM-DD |
| Reduce DNS TTL | @platform | YYYY-MM-DD |

Phase 6: Maintenance

DR Maintenance Schedule

Anti-Rationalization Table

Dispatch Specialist

For DR planning tasks, dispatch:

Task tool:
  subagent_type: "ring:infrastructure-architect"
  prompt: |
    DR PLANNING REQUEST
    Services: [services requiring DR]
    RTO Requirement: [target]
    RPO Requirement: [target]
    Current State: [existing DR if any]
    REQUEST: [design/review/test planning]

Adoption

lerianstudio/ops-disaster-recovery

$ install --global

Security Scan Results

SKILL.md

Disaster Recovery Workflow

DR Planning Phases

Phase 1: Business Impact Analysis

Service Classification

BIA Template

Phase 2: Strategy Selection

DR Strategy Comparison

Strategy Selection Matrix

Phase 3: Architecture Design

DR Architecture Components

Architecture Diagram Template

Phase 4: Runbook Development

Failover Runbook Structure

Rollback Steps

Communication Template

Phase 6: Maintenance

DR Maintenance Schedule

Anti-Rationalization Table

Dispatch Specialist

Related Skills

lerianstudio/ring:migrate-v4

lerianstudio/ring:writing-functional-docs

lerianstudio/ring:writing-api-docs

lerianstudio/ring:voice-and-tone

lerianstudio/ops-disaster-recovery

$ install --global

Security Scan Results

SKILL.md

Disaster Recovery Workflow

DR Planning Phases

Phase 1: Business Impact Analysis

Service Classification

BIA Template

Phase 2: Strategy Selection

DR Strategy Comparison

Strategy Selection Matrix

Phase 3: Architecture Design

DR Architecture Components

Architecture Diagram Template

Phase 4: Runbook Development

Failover Runbook Structure

Rollback Steps

Communication Template

Phase 6: Maintenance

DR Maintenance Schedule

Anti-Rationalization Table

Dispatch Specialist

Related Skills

lerianstudio/ring:migrate-v4

lerianstudio/ring:writing-functional-docs

lerianstudio/ring:writing-api-docs

lerianstudio/ring:voice-and-tone