Knowledge Pack: Disaster Recovery

Purpose

Provides comprehensive disaster recovery patterns for {{LANGUAGE}} {{FRAMEWORK}} services deployed in containerized environments. Covers DR strategy selection, RPO/RTO definitions, failover automation, testing cadence, multi-region data patterns, communication plans, and per-component recovery procedures.

Quick Reference (always in context)

See references/dr-strategy-decision-tree.md for DR strategy selection based on RPO/RTO requirements. See references/rpo-rto-calculator.md for RPO/RTO calculation based on SLA alignment.

Detailed References

Read these files for comprehensive disaster recovery guidance:

| Reference | Content | |-----------|---------| | references/dr-strategy-decision-tree.md | DR strategy selection flowchart based on RPO/RTO requirements, cost constraints, and complexity tolerance | | references/rpo-rto-calculator.md | RPO/RTO calculation based on SLA alignment, business impact analysis, and cost justification |

DR Strategies

| Strategy | RPO | RTO | Cost | Complexity | Description | |----------|-----|-----|------|------------|-------------| | Active-Active | ~0 | ~0 | Very High | Very High | Multi-region active, all regions serve traffic simultaneously | | Active-Passive | Minutes | Minutes-Hours | High | High | Standby region activated on primary failure | | Warm Standby | Minutes | Minutes | Medium | Medium | Scaled-down replica in standby, fast activation | | Pilot Light | Hours | Hours | Low | Medium | Minimal resources in standby, activated and scaled on failure |

Active-Active

All regions serve production traffic simultaneously
Data replicated synchronously across regions
Lowest RPO/RTO but highest cost and complexity
Requires conflict resolution for concurrent writes
Best for: mission-critical, zero-downtime requirements

Active-Passive

Primary region serves all traffic; standby on hot standby
Asynchronous data replication to standby
Moderate RPO (minutes of data loss) and RTO (minutes to hours)
Failover requires DNS switch and health check validation
Best for: important services with moderate downtime tolerance

Warm Standby

Scaled-down replica running in secondary region
Receives replicated data continuously
Fast activation: scale up resources and switch traffic
Best for: services needing fast recovery at moderate cost

Pilot Light

Minimal infrastructure in secondary region (database replicas only)
Application infrastructure provisioned on demand during failover
Longest RTO but lowest ongoing cost
Best for: non-critical services with hours of acceptable downtime

RPO/RTO Definitions

Recovery Point Objective (RPO)

Maximum acceptable data loss measured in time. Defines how frequently backups or replication must occur.

RPO = 0: Zero data loss (synchronous replication required)
RPO = 5 min: Up to 5 minutes of transactions may be lost
RPO = 1 hour: Hourly backup/replication acceptable
RPO = 24 hours: Daily backup sufficient

Recovery Time Objective (RTO)

Maximum acceptable downtime from failure detection to service restoration.

RTO < 1 min: Automated failover required (active-active)
RTO < 15 min: Automated failover with health checks (active-passive)
RTO < 1 hour: Semi-automated with manual verification
RTO < 4 hours: Manual failover procedures acceptable

SLA Alignment

Map RPO/RTO to SLA commitments (99.9% = ~8.7h/year downtime)
Document gap between current capability and SLA requirement
Cost-justify DR investment based on business impact analysis

Failover Automation

DNS Failover

Health check-based DNS routing (Route53, CloudFlare, Azure Traffic Manager)
TTL configuration: low TTL (30-60s) for fast failover
Weighted routing for gradual traffic shift during planned failovers

Load Balancer Failover

Health check-based backend selection
Cross-region load balancing with geographic affinity
Connection draining during failover transitions

Database Failover

Read replica promotion to primary
Multi-master replication with automatic leader election
Point-in-time recovery for data corruption scenarios
Connection string management (DNS-based endpoint or service discovery)

Automated Runbook Execution

Event-driven failover triggers (monitoring alerts)
Step-by-step automated procedures with human approval gates
Rollback automation for failed failover attempts
Post-failover validation checks

DR Testing Cadence

| Test Type | Frequency | Scope | Confidence | |-----------|-----------|-------|------------| | Tabletop Exercise | Quarterly | Communication and procedures | Low cost, validates communication | | Simulation Test | Bi-annually | Partial failover of components | Medium cost, validates automation | | Full Failover Test | Annually | Complete region switchover | High cost, highest confidence |

Tabletop Exercise

Walk through DR procedures without executing them. Validate that documentation is current, roles are clear, and communication channels work.

Simulation Test

Execute partial failover: fail over a single service or database. Validate that automation scripts work and recovery meets RTO/RPO targets.

Full Failover Test

Complete region failover with production traffic. Maximum confidence but highest risk. Requires careful planning, monitoring, and rollback readiness.

Multi-Region Patterns

Data Replication

Synchronous: Zero data loss, higher latency, limited by network distance
Asynchronous: Low latency, potential data loss during failover
Semi-synchronous: Acknowledge after at least one replica confirms

Eventual Consistency

Convergence guarantees: all replicas reach same state eventually
Read-your-writes consistency: user sees their own updates immediately
Causal consistency: operations that depend on each other are ordered

Conflict Resolution

Last-Writer-Wins (LWW): Simple, may lose data, suitable for non-critical fields
Merge: Application-level merge logic for concurrent updates
CRDT: Conflict-free Replicated Data Types for automatic resolution

Communication Plan

Internal Communication

Engineering: Real-time updates in incident channel, status page
Management: Executive summary every 30 minutes during incident
Support: Customer-facing FAQ and scripted responses

External Communication

Customers: Status page updates, email notifications for affected users
Partners: Direct notification for API consumers with estimated recovery
Regulatory: Incident reports per compliance requirements (timing, content)

Templates per Severity

SEV1 (Critical): Immediate notification to all stakeholders, war room established
SEV2 (Major): Engineering and management notified, regular updates
SEV3 (Minor): Engineering notified, status page updated

Recovery Procedures per Component

Application

Container restart with health check validation
Horizontal scaling to handle backlog after recovery
Redeployment from known-good image if corruption suspected
Cache warmup before accepting full traffic

Database

Automated failover to read replica (promote to primary)
Restore from backup with point-in-time recovery
Data integrity validation post-recovery
Connection pool reset and application restart

Cache

Cache warmup from database (lazy or eager loading)
Cache rebuild from persistent storage if available
Fallback to database during warmup period
Monitor cache hit rate during recovery

Message Broker

Replay unprocessed messages from dead letter queue
Partition recovery and rebalancing
Consumer group offset management
Message deduplication during replay

Related Knowledge Packs

skills/sre-practices/ — incident management, error budgets, and change management
skills/infrastructure/ — Kubernetes manifests, health probes, and graceful shutdown
skills/data-management/ — backup/restore strategies and data replication patterns

Knowledge Pack: Disaster Recovery

Purpose

Quick Reference (always in context)

See references/dr-strategy-decision-tree.md for DR strategy selection based on RPO/RTO requirements. See references/rpo-rto-calculator.md for RPO/RTO calculation based on SLA alignment.

Detailed References

Read these files for comprehensive disaster recovery guidance:

DR Strategies

Active-Active

All regions serve production traffic simultaneously
Data replicated synchronously across regions
Lowest RPO/RTO but highest cost and complexity
Requires conflict resolution for concurrent writes
Best for: mission-critical, zero-downtime requirements

Active-Passive

Primary region serves all traffic; standby on hot standby
Asynchronous data replication to standby
Moderate RPO (minutes of data loss) and RTO (minutes to hours)
Failover requires DNS switch and health check validation
Best for: important services with moderate downtime tolerance

Warm Standby

Scaled-down replica running in secondary region
Receives replicated data continuously
Fast activation: scale up resources and switch traffic
Best for: services needing fast recovery at moderate cost

Pilot Light

Minimal infrastructure in secondary region (database replicas only)
Application infrastructure provisioned on demand during failover
Longest RTO but lowest ongoing cost
Best for: non-critical services with hours of acceptable downtime

RPO/RTO Definitions

Recovery Point Objective (RPO)

Maximum acceptable data loss measured in time. Defines how frequently backups or replication must occur.

RPO = 0: Zero data loss (synchronous replication required)
RPO = 5 min: Up to 5 minutes of transactions may be lost
RPO = 1 hour: Hourly backup/replication acceptable
RPO = 24 hours: Daily backup sufficient

Recovery Time Objective (RTO)

Maximum acceptable downtime from failure detection to service restoration.

RTO < 1 min: Automated failover required (active-active)
RTO < 15 min: Automated failover with health checks (active-passive)
RTO < 1 hour: Semi-automated with manual verification
RTO < 4 hours: Manual failover procedures acceptable

SLA Alignment

Map RPO/RTO to SLA commitments (99.9% = ~8.7h/year downtime)
Document gap between current capability and SLA requirement
Cost-justify DR investment based on business impact analysis

Failover Automation

DNS Failover

Health check-based DNS routing (Route53, CloudFlare, Azure Traffic Manager)
TTL configuration: low TTL (30-60s) for fast failover
Weighted routing for gradual traffic shift during planned failovers

Load Balancer Failover

Health check-based backend selection
Cross-region load balancing with geographic affinity
Connection draining during failover transitions

Database Failover

Read replica promotion to primary
Multi-master replication with automatic leader election
Point-in-time recovery for data corruption scenarios
Connection string management (DNS-based endpoint or service discovery)

Automated Runbook Execution

Event-driven failover triggers (monitoring alerts)
Step-by-step automated procedures with human approval gates
Rollback automation for failed failover attempts
Post-failover validation checks

DR Testing Cadence

Tabletop Exercise

Walk through DR procedures without executing them. Validate that documentation is current, roles are clear, and communication channels work.

Simulation Test

Execute partial failover: fail over a single service or database. Validate that automation scripts work and recovery meets RTO/RPO targets.

Full Failover Test

Complete region failover with production traffic. Maximum confidence but highest risk. Requires careful planning, monitoring, and rollback readiness.

Multi-Region Patterns

Data Replication

Synchronous: Zero data loss, higher latency, limited by network distance
Asynchronous: Low latency, potential data loss during failover
Semi-synchronous: Acknowledge after at least one replica confirms

Eventual Consistency

Convergence guarantees: all replicas reach same state eventually
Read-your-writes consistency: user sees their own updates immediately
Causal consistency: operations that depend on each other are ordered

Conflict Resolution

Last-Writer-Wins (LWW): Simple, may lose data, suitable for non-critical fields
Merge: Application-level merge logic for concurrent updates
CRDT: Conflict-free Replicated Data Types for automatic resolution

Communication Plan

Internal Communication

Engineering: Real-time updates in incident channel, status page
Management: Executive summary every 30 minutes during incident
Support: Customer-facing FAQ and scripted responses

External Communication

Customers: Status page updates, email notifications for affected users
Partners: Direct notification for API consumers with estimated recovery
Regulatory: Incident reports per compliance requirements (timing, content)

Templates per Severity

SEV1 (Critical): Immediate notification to all stakeholders, war room established
SEV2 (Major): Engineering and management notified, regular updates
SEV3 (Minor): Engineering notified, status page updated

Recovery Procedures per Component

Application

Container restart with health check validation
Horizontal scaling to handle backlog after recovery
Redeployment from known-good image if corruption suspected
Cache warmup before accepting full traffic

Database

Automated failover to read replica (promote to primary)
Restore from backup with point-in-time recovery
Data integrity validation post-recovery
Connection pool reset and application restart

Cache

Cache warmup from database (lazy or eager loading)
Cache rebuild from persistent storage if available
Fallback to database during warmup period
Monitor cache hit rate during recovery

Message Broker

Replay unprocessed messages from dead letter queue
Partition recovery and rebalancing
Consumer group offset management
Message deduplication during replay

Related Knowledge Packs

skills/sre-practices/ — incident management, error budgets, and change management
skills/infrastructure/ — Kubernetes manifests, health probes, and graceful shutdown
skills/data-management/ — backup/restore strategies and data replication patterns

Adoption

edercnj/disaster-recovery

$ install --global

Security Scan Results

SKILL.md

Knowledge Pack: Disaster Recovery

Purpose

Quick Reference (always in context)

Detailed References

DR Strategies

Active-Active

Active-Passive

Warm Standby

Pilot Light

RPO/RTO Definitions

Recovery Point Objective (RPO)

Recovery Time Objective (RTO)

SLA Alignment

Failover Automation

DNS Failover

Load Balancer Failover

Database Failover

Automated Runbook Execution

DR Testing Cadence

Tabletop Exercise

Simulation Test

Full Failover Test

Multi-Region Patterns

Data Replication

Eventual Consistency

Conflict Resolution

Communication Plan

Internal Communication

External Communication

Templates per Severity

Recovery Procedures per Component

Application

Database

Cache

Message Broker

Related Knowledge Packs

Related Skills

edercnj/x-validate-docs

edercnj/x-validate-dependency-policy

edercnj/x-update-architecture

edercnj/x-scan-secrets

edercnj/disaster-recovery

$ install --global

Security Scan Results

SKILL.md

Knowledge Pack: Disaster Recovery

Purpose

Quick Reference (always in context)

Detailed References

DR Strategies

Active-Active

Active-Passive

Warm Standby

Pilot Light

RPO/RTO Definitions

Recovery Point Objective (RPO)

Recovery Time Objective (RTO)

SLA Alignment

Failover Automation

DNS Failover

Load Balancer Failover

Database Failover

Automated Runbook Execution

DR Testing Cadence

Tabletop Exercise

Simulation Test

Full Failover Test

Multi-Region Patterns

Data Replication

Eventual Consistency

Conflict Resolution

Communication Plan

Internal Communication

External Communication

Templates per Severity