skills/backup/SKILL.md
Backup and disaster recovery. backup strategy, disaster recovery, RPO/RTO, data integrity, durability, runbook.
npx skillsauth add arbazkhan971/godmode backupInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
/godmode:backup/godmode:infra provisions stateful services (databases, storage, queues)Identify all data that needs protection:
DATA ASSET INVENTORY:
| Asset | Type | Size | Growth | Critical |
Set recovery objectives for each data tier:
RECOVERY OBJECTIVES:
| Data Tier | RPO | RTO | Justification |
Design backup approach for each data tier:
TIER 1 BACKUP STRATEGY:
Primary database:
TIER 2 BACKUP STRATEGY:
File uploads (S3/GCS):
TIER 3 BACKUP STRATEGY:
Application logs:
IF backup age >24h: trigger immediate backup. IF restore test fails: alert and re-run.
Automated checks that backups are actually working:
BACKUP VERIFICATION SCHEDULE:
| Check | Frequency | Method | Alert |
### Step 5: Data Integrity Verification
Verify backed-up data is consistent and usable:
DATA INTEGRITY CHECKS: | Check | Method | Frequency | |--|--|--| | Row count consistency | Compare source vs | After each | | | restored backup | restore test | | Checksum verification | SHA-256 of backup | Every backup | | | file | | | Foreign key integrity | Run FK constraint | Weekly | | | check on restored DB | restore | | Application smoke test | Run app against | Monthly | | | restored DB | restore | | Point-in-time accuracy | Restore to specific | Quarterly | | | time, verify records | | | Cross-region consistency | Compare checksums | Weekly | | | across regions | |
Integrity verification queries:
-- Row count comparison (run against source and restored)
SELECT table_name, n_live_tup
FROM pg_stat_user_tables
### Step 6: Recovery Procedures
Document step-by-step recovery for each failure scenario:
RECOVERY: Primary Database Failure Severity: CRITICAL RPO: < 1 minute (via streaming replication) RTO: < 15 minutes (via automated failover)
AUTOMATED RESPONSE:
MANUAL RESPONSE (if automated failover fails):
POST-RECOVERY:
#### Scenario 2: Data Corruption
RECOVERY: Data Corruption — Severity: HIGH, RTO: 15min-2h
#### Scenario 3: Complete Region Failure
RECOVERY: Region Failure — Severity: CRITICAL, RPO: <1min, RTO: <30min
### Step 7: Disaster Recovery Runbook
Generate a comprehensive runbook document:
DISASTER RECOVERY RUNBOOK Last tested: <date> Next test scheduled: <date> Owner: <team/person> On-call escalation: <contact chain> Recovery objectives: | Tier 1 RPO: < 1 min | RTO: < 15 min | |--|--| | Tier 2 RPO: < 1 hour | RTO: < 1 hour | | Tier 3 RPO: < 24 hour | RTO: < 4 hours | Scenarios covered:
### Step 8: Backup & DR Report
BACKUP & DISASTER RECOVERY REPORT Data assets inventoried: <N> Backup strategies defined: <N> Recovery procedures documented: <N> scenarios Coverage: Tier 1 (critical): <PROTECTED | GAPS | UNPROTECTED> Tier 2 (important): <PROTECTED | GAPS | UNPROTECTED> Tier 3 (operational): <PROTECTED | GAPS | UNPROTECTED> Verification: Automated checks: <CONFIGURED | PARTIAL | MISSING> Last restore test: <date | NEVER> Last DR test: <date | NEVER> Gaps identified: <N>
Verdict: <PROTECTED | PARTIAL | AT RISK>
### Step 9: Commit and Transition
Save runbook as `docs/dr/<date>-disaster-recovery-runbook.md`
```bash
# Test backup and restore procedures
pg_dump -Fc mydb > backup_test.dump
pg_restore -d mydb_test backup_test.dump
psql mydb_test -c "SELECT count(*) FROM users;"
# Verify backup and test restore
pg_dump --format=custom -f backup.dump $DATABASE_URL
pg_restore --list backup.dump
curl -s http://localhost:8080/health
1. Data stores: grep for postgres, mysql, mongodb, redis connection strings
2. Object storage: grep for S3, GCS, Azure Blob configs
3. Existing backups: check crontab, CI jobs, WAL archiving configs
4. No backups → CRITICAL gap. No verification → HIGH gap.
BACKUP VERIFICATION LOOP:
current_iteration = 0
max_iterations = 10
gaps_remaining = total_gaps_found
WHILE gaps_remaining > 0 AND current_iteration < max_iterations:
current_iteration += 1
1. SELECT highest-severity backup gap
2. IMPLEMENT fix:
- Missing backup → create backup job/config
- No verification → add automated integrity check
- No cross-region → configure replication
- No restore test → create and run restore test
3. git commit: "backup: fix <gap> (iter {current_iteration})"
4. VERIFY the fix:
- Backup job runs successfully
- Backup file is valid (checksum, header check)
- Restore test passes (if applicable)
5. IF verification fails:
- Debug configuration
- Retry with adjusted parameters
6. UPDATE gaps_remaining
IF current_iteration % 3 == 0:
PRINT STATUS:
"Iteration {current_iteration}/{max_iterations}"
"Gaps fixed: {total_gaps - gaps_remaining}/{total_gaps}"
"Tier 1 coverage: {tier1_status}"
"Tier 2 coverage: {tier2_status}"
"Last restore test: {last_restore_result}"
<!-- tier-3 -->
Never ask to continue. Loop autonomously until all backup gaps are resolved or budget exhausted.
MECHANICAL CONSTRAINTS — NON-NEGOTIABLE:
1. NEVER treat a backup as valid until a restore test has succeeded.
2. NEVER store backups in the same failure domain as production (same region, same account).
3. ENCRYPT EVERY backup at rest — no exceptions for any data tier.
4. EVERY backup job MUST alert on failure — silent backup failures are the worst kind.
5. EVERY backup MUST have a TTL/retention policy — no infinite storage growth.
6. DEFINE RPO and RTO BEFORE designing backup strategy — business drives engineering.
7. git commit backup configurations BEFORE testing — baseline for debugging.
8. Automatic revert on regression: if backup config change causes production issues, revert immediately.
9. NEVER skip quarterly DR tests — schedule them and treat them as P1 obligations.
10. Log all backup operations in TSV:
TIMESTAMP\tASSET\tOPERATION\tSIZE\tDURATION\tSTATUS\tCHECKSUM
Print on completion: Backup: {asset_count} assets covered. RPO: {rpo}. RTO: {rto}. Last restore test: {last_test_date}. Encryption: {encryption_status}. Cross-region: {cross_region}. Verdict: {verdict}.
timestamp asset operation size duration_s status checksum
2024-01-15T03:00:00Z postgres-prod backup 12GB 180 success sha256:abc123
2024-01-15T03:05:00Z redis-prod backup 2GB 30 success sha256:def456
2024-01-15T04:00:00Z postgres-prod restore-test 12GB 300 success verified
Columns: timestamp, asset, operation(backup/restore-test/dr-drill), size, duration_s, status(success/failed/partial), checksum.
## Keep/Discard
KEEP if: improvement verified. DISCARD if: regression or no change. Revert discards immediately.
## Stop Conditions
Stop when: target reached, budget exhausted, or >5 consecutive discards.
development
Web performance optimization. Lighthouse, bundle analysis, code splitting, image optimization, critical CSS, fonts, service workers, CDN.
development
Webhook design, delivery, retry, HMAC verification, event subscriptions, dead letter queues.
development
Vue.js mastery. Composition API, Pinia, Vue Router, Nuxt SSR/SSG, Vite optimization, testing.
development
Evidence gate. Run command, read full output, confirm or deny claim. No trust, only proof.