skills/production-readiness/SKILL.md
Comprehensive checklist and guidance for preparing applications for production deployment. Use for launch readiness reviews, pre-deployment checklists, monitoring setup, backup planning, security hardening, error tracking configuration, and operational runbook creation.
npx skillsauth add simplerick0/com.ackhax.configs production-readinessInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Ensure your application is ready for real users and real stakes.
Application Health
├── Error rate (% of requests)
├── Response time (p50, p95, p99)
├── Request throughput (req/min)
└── Active users (concurrent)
Infrastructure
├── CPU utilization
├── Memory usage
├── Disk space
├── Network I/O
└── Database connections
Business Metrics
├── Signups / conversions
├── Feature usage
└── User retention signals
# Minimal health check
@app.get("/health")
def health():
return {"status": "healthy"}
# Detailed readiness check
@app.get("/health/ready")
def readiness():
checks = {
"database": check_db_connection(),
"cache": check_redis(),
"disk": check_disk_space(),
}
all_healthy = all(checks.values())
return JSONResponse(
status_code=200 if all_healthy else 503,
content={"ready": all_healthy, "checks": checks}
)
CRITICAL (page immediately)
- Error rate > 5%
- Response time p95 > 5s
- Service down > 1 minute
- Disk space < 10%
WARNING (notify, don't page)
- Error rate > 1%
- Response time p95 > 2s
- Memory > 80%
- Certificate expiring < 14 days
Every error should include:
- Stack trace
- Request URL and method
- User ID (if authenticated)
- Request body (sanitized)
- Environment (prod/staging)
- Release version
- Browser/client info
Group errors by:
import sentry_sdk
sentry_sdk.init(
dsn="https://...",
environment="production",
release="1.0.0",
traces_sample_rate=0.1, # 10% of transactions
)
# Add user context
sentry_sdk.set_user({"id": user.id, "email": user.email})
Database
- Full backup: Daily
- Point-in-time recovery: Enable
- Retention: 30 days minimum
User uploads
- Replicate to separate storage
- Consider versioning
Configuration
- Infrastructure as code (committed)
- Secrets in vault (backed separately)
## Monthly Backup Test
1. Create test environment
2. Restore latest backup
3. Verify data integrity
4. Test critical user flows
5. Document results and issues
Define for your application:
- RTO (Recovery Time Objective): Max downtime acceptable
- RPO (Recovery Point Objective): Max data loss acceptable
Example (typical SaaS):
- RTO: 4 hours
- RPO: 1 hour
Strict-Transport-Security: max-age=31536000; includeSubDomains
Content-Security-Policy: default-src 'self'
X-Content-Type-Options: nosniff
X-Frame-Options: DENY
Referrer-Policy: strict-origin-when-cross-origin
Permissions-Policy: geolocation=(), microphone=()
Never in code:
- API keys
- Database passwords
- JWT secrets
- OAuth credentials
Use instead:
- Environment variables
- Secret managers (AWS Secrets Manager, Vault)
- Encrypted config files
# Python
pip-audit
# Node
npm audit
# Run in CI, block on high/critical
1. All tests passing
2. Code reviewed and approved
3. Staging tested
4. Database migrations reviewed
5. Feature flags configured
6. Rollback plan ready
1. Announce deployment (if needed)
2. Enable maintenance mode (if needed)
3. Run database migrations
4. Deploy new code
5. Run smoke tests
6. Monitor error rates
7. Disable maintenance mode
8. Announce completion
Roll back immediately if:
1. Revert to previous deployment
2. Revert database migrations (if safe)
3. Notify stakeholders
4. Investigate root cause
5. Fix and re-attempt with more testing
# Runbook: [Service Name]
## Service Overview
- Purpose: [What it does]
- Dependencies: [Other services, databases]
- Owner: [Contact]
## Access
- Production URL: [URL]
- Admin panel: [URL]
- Logs: [Location]
- Metrics: [Dashboard URL]
## Common Issues
### Issue: High Error Rate
Symptoms: Error rate > 5%, alerts firing
Diagnosis:
1. Check error tracking for new errors
2. Check recent deployments
3. Check dependency health
Resolution:
- If new deployment: Rollback
- If dependency: Check their status page
- If unknown: Page on-call
### Issue: Slow Response Times
Symptoms: p95 > 2s, users complaining
Diagnosis:
1. Check database query times
2. Check external API latency
3. Check CPU/memory
Resolution:
- Database: Identify slow queries, add index
- External API: Check circuit breaker, add timeout
- Resources: Scale up or optimize
## Maintenance Procedures
- Database maintenance: [Steps]
- Certificate renewal: [Steps]
- Secret rotation: [Steps]
development
Manage VSCode/Cursor configuration in this dotfiles repository. Use when working with settings.json, keybindings.json, or tasks.json files, or when asked about VSCode/Cursor configuration structure.
tools
Design user interfaces and experiences for web applications without requiring design tools. Use for wireframing in text/ASCII, defining user flows, creating component hierarchies, establishing design systems, planning responsive layouts, and making accessibility decisions.
development
Testing specialist focused on comprehensive test coverage for Python applications. Use for pytest patterns, unit/integration/E2E testing, fixtures, mocking, property-based testing with Hypothesis, and factory patterns.
development
Project management adapted for solo developers working without a team. Use for personal project planning, time-boxing work sessions, managing scope creep alone, maintaining momentum on side projects, tracking progress without overhead, making decisions without external input, and staying accountable to yourself.