.claude/skills/devops-review/SKILL.md
DevOps infrastructure review agent: Terraform audit, Docker/Compose hardening, Coolify deployment assessment, networking, backup strategy, monitoring, and CI/CD pipeline review. Evaluates infrastructure like a senior DevOps engineer performing production readiness assessment — with evidence, severity ratings, and actionable recommendations.
npx skillsauth add Coignite-ApS/businesslogic devops-reviewInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are acting as a senior DevOps engineer performing a production readiness assessment of this project's infrastructure. Your job is to evaluate everything an infrastructure specialist would scrutinize: Terraform configuration, Docker hardening, deployment pipeline, networking, secrets management, backup strategy, monitoring, and disaster recovery.
Mindset: You are NOT the developer who wrote this infrastructure. You are the independent DevOps specialist brought in to ensure this infrastructure won't go down at 3 AM, won't leak secrets, won't lose data, and can scale when needed. Be thorough. Be direct. Back every finding with evidence.
When to invoke: After infrastructure changes, before production deployment, when adding new services, or whenever an infrastructure assessment is needed.
This skill MUST be run as a sub-agent using the Agent tool. This ensures:
How to invoke from the main conversation:
Agent tool → prompt: "You are a DevOps Infrastructure Review Agent. Read and follow ALL
instructions in .claude/skills/devops-review/SKILL.md. Project root: [cwd].
Review scope: [arguments]. Execute the review. Save the report to
docs/reports/devops-review-[DATE].md. Return an executive summary with top findings."
The calling conversation receives only the executive summary. The full detailed report is saved to disk for the team to review.
security: Infrastructure security review onlyterraform: Terraform-focused review onlydocker: Docker/Compose hardening review onlycoolify: Coolify deployment assessment onlynetworking: Network topology and firewall review onlybackups: Backup and disaster recovery review onlyservice <name>: Infrastructure review for a single service (e.g., service ai-api)quick: High-level scan — top findings per category, skip deep file analysisdiff: Review only infrastructure files changed since last commit/mergereport: Generate formal report in docs/reports/Execute each section in order. For each finding, assign:
A real DevOps engineer doesn't just run checklists — they investigate, question, and form independent opinions backed by current evidence. This section defines WHEN and HOW the review agent must conduct its own research.
You MUST perform web research in these situations:
Infrastructure tool versions — Before judging any tool version (Terraform, Docker, Coolify, PostgreSQL, Redis, Nginx), search for its current stable version, known CVEs, and end-of-life dates.
Cloud provider changes — Hetzner, Cloudflare, and other providers update features, pricing, and security controls. Research current state before making recommendations about provider-specific configurations.
Container base image vulnerabilities — When reviewing Dockerfiles, research whether the specific base images have known CVEs. Check Docker Hub advisories, Chainguard, and Trivy databases.
Terraform provider updates — Terraform providers release breaking changes and security fixes. Research current provider versions and any deprecation notices.
Deployment platform limitations — Coolify evolves rapidly. Research current Coolify version capabilities, known limitations, and community-reported issues before recommending for/against specific features.
Backup tool comparison — When evaluating backup strategy, research current best-of-breed tools (pgBackRest, Barman, WAL-G) and their production track records.
Networking and TLS standards — TLS versions, cipher suites, and header security best practices evolve. Research current recommendations from Mozilla, Cloudflare, and OWASP.
Use the same structured approach as the CTO Review Agent:
Step 1: Define the question precisely
"Does Coolify v4 support Docker Compose secrets natively?"
NOT "Is Coolify good for secrets?"
Step 2: Search authoritative sources first
WebSearch: "[tool] [version] security advisory [year]"
WebSearch: "[tool] production issues [year]"
WebSearch: "[tool] vs [alternative] [year]"
Step 3: Cross-reference with official docs
WebFetch: Official documentation, GitHub releases, changelogs
WebFetch: Provider-specific security bulletins
Step 4: Document with source
Every research-based finding must include:
- The source URL
- Date of the information
- Relevance to our specific infrastructure
For each review, conduct AT MINIMUM:
# 1. Check infrastructure tool versions and CVEs
WebSearch: "Terraform [version] CVE 2025 2026"
WebSearch: "Docker [version] security vulnerability 2025 2026"
WebSearch: "Coolify security issues 2025 2026"
WebSearch: "Hetzner cloud security incident 2025 2026"
# 2. Check base image vulnerabilities
WebSearch: "node:22-alpine Docker vulnerability"
WebSearch: "postgres:16 Docker image CVE"
WebSearch: "redis:7-alpine security advisory"
# 3. Validate deployment patterns
WebSearch: "Coolify production deployment best practices [year]"
WebSearch: "Docker Compose production vs Kubernetes at scale [year]"
WebSearch: "Hetzner Terraform provider latest version"
# 4. Check backup and DR tools
WebSearch: "pgBackRest vs pg_dump production PostgreSQL [year]"
WebSearch: "PostgreSQL point-in-time recovery setup guide"
For full reviews, use the Agent tool to parallelize research-heavy tasks:
Agent 1: "Research current CVEs and security advisories for Docker Engine,
Docker Compose, Terraform, Coolify v4, and Hetzner Cloud.
Report each with severity and whether our versions are affected."
Agent 2: "Research current best practices for Docker Compose in production
2025-2026. Focus on: secrets management, networking isolation,
health checks, resource limits, logging drivers, and restart policies."
Agent 3: "Research Hetzner Cloud firewall configuration best practices,
Cloudflare integration security, and TLS/SSL certificate management
for self-hosted platforms. Include current Mozilla SSL config recommendations."
Agent 4: "Research PostgreSQL backup strategies for production in 2025-2026.
Compare pgBackRest, Barman, WAL-G, and pg_dump. Recommend strategy
for a 5-service platform with ~50GB database."
# Check for remote state configuration
grep -rn "backend\s" --include="*.tf" infrastructure/terraform/
grep -rn "terraform {" -A 10 --include="*.tf" infrastructure/terraform/
# Check for state encryption
grep -rn "encrypt\s*=" --include="*.tf" infrastructure/terraform/
# Check for state locking
grep -rn "dynamodb_table\|lock\s*=" --include="*.tf" infrastructure/terraform/
# CRITICAL: Check if .tfstate files are committed to git
git ls-files --cached | grep -i "\.tfstate"
find infrastructure/terraform/ -name "*.tfstate" -o -name "*.tfstate.backup"
# Check for hardcoded secrets in .tf files
grep -rn "password\s*=\s*\"" --include="*.tf" infrastructure/terraform/
grep -rn "secret\s*=\s*\"" --include="*.tf" infrastructure/terraform/
grep -rn "token\s*=\s*\"" --include="*.tf" infrastructure/terraform/
grep -rn "api_key\s*=\s*\"" --include="*.tf" infrastructure/terraform/
# Check for sensitive variables marked properly
grep -rn "sensitive\s*=\s*true" --include="*.tf" infrastructure/terraform/
# Check for .tfvars files in git
git ls-files --cached | grep -i "\.tfvars$" | grep -v "\.example"
# Check gitignore covers terraform secrets
grep -i "tfvars\|tfstate" .gitignore
# Check Hetzner server configurations
grep -rn "hcloud_server\|server_type" --include="*.tf" infrastructure/terraform/
# Check firewall rules — should be restrictive
grep -rn "hcloud_firewall" -A 20 --include="*.tf" infrastructure/terraform/
# Check for open 0.0.0.0/0 ingress rules (dangerous)
grep -rn "0\.0\.0\.0/0\|::/0" --include="*.tf" infrastructure/terraform/
# Check network configuration
grep -rn "hcloud_network\|subnet" --include="*.tf" infrastructure/terraform/
# Check SSH key management
grep -rn "hcloud_ssh_key\|ssh_keys" --include="*.tf" infrastructure/terraform/
# Check for pinned provider versions (not using >= or ~>)
grep -rn "required_providers" -A 20 --include="*.tf" infrastructure/terraform/
# Check for module versioning
grep -rn "source\s*=" --include="*.tf" infrastructure/terraform/ | grep -v "\./"
# Check for resource naming conventions (consistent naming)
grep -rn "resource\s" --include="*.tf" infrastructure/terraform/ | head -20
# Check for outputs (should expose useful values)
grep -rn "output\s" --include="*.tf" infrastructure/terraform/
# Check for variable descriptions (documentation)
grep -rn "variable\s" -A 5 --include="*.tf" infrastructure/terraform/ | grep -E "variable|description"
# Run terraform validate if available
if command -v terraform &>/dev/null; then
cd infrastructure/terraform && terraform validate 2>&1 || echo "terraform validate failed"
fi
# Check for tfsec/checkov/tflint configuration
find infrastructure/terraform/ -name ".tfsec*" -o -name ".checkov*" -o -name ".tflint*"
# Check when terraform was last applied
find infrastructure/terraform/ -name "*.tfstate" -exec stat -c '%Y %n' {} \; 2>/dev/null || \
find infrastructure/terraform/ -name "*.tfstate" -exec stat -f '%m %N' {} \; 2>/dev/null
# Check for plan files (should not be committed)
find infrastructure/terraform/ -name "*.tfplan" -o -name "plan.out"
git ls-files --cached | grep -i "\.tfplan\|plan\.out"
# Check ALL Dockerfiles
find . -name "Dockerfile*" | grep -v node_modules | grep -v target | grep -v .git
# For each Dockerfile, check:
# Base image pinning (should use specific versions, not :latest)
grep -rn "^FROM " --include="Dockerfile*" . | grep -v node_modules
# Non-root user (should have USER directive)
for df in $(find . -name "Dockerfile*" | grep -v node_modules | grep -v target | grep -v .git); do
if ! grep -q "^USER " "$df"; then
echo "MISSING USER directive: $df"
fi
done
# Multi-stage builds (should separate build from runtime)
for df in $(find . -name "Dockerfile*" | grep -v node_modules | grep -v target | grep -v .git); do
STAGES=$(grep -c "^FROM " "$df")
if [ "$STAGES" -lt 2 ]; then
echo "Single-stage build (review needed): $df"
fi
done
# .dockerignore files (should exist alongside Dockerfiles)
for df in $(find . -name "Dockerfile*" | grep -v node_modules | grep -v target | grep -v .git); do
DIR=$(dirname "$df")
if [ ! -f "$DIR/.dockerignore" ]; then
echo "MISSING .dockerignore: $DIR"
fi
done
# Check for COPY of sensitive files
grep -rn "COPY.*\.env\|COPY.*\.pem\|COPY.*\.key\|COPY.*secret" --include="Dockerfile*" . | grep -v node_modules
# Check for apt-get/apk without cleanup
grep -rn "apt-get install\|apk add" --include="Dockerfile*" . | grep -v node_modules
# Should be paired with: && rm -rf /var/lib/apt/lists/* or --no-cache
# Check for HEALTHCHECK directives
for df in $(find . -name "Dockerfile*" | grep -v node_modules | grep -v target | grep -v .git); do
if ! grep -q "HEALTHCHECK" "$df"; then
echo "MISSING HEALTHCHECK: $df"
fi
done
# Find all compose files
find infrastructure/ -name "docker-compose*" -o -name "compose*" | grep -v node_modules
# Check for privileged mode (CRITICAL — should NEVER be used)
grep -rn "privileged:\s*true" --include="*.yml" --include="*.yaml" infrastructure/
# Check for dangerous capabilities
grep -rn "cap_add\|SYS_ADMIN\|NET_ADMIN\|SYS_PTRACE" --include="*.yml" --include="*.yaml" infrastructure/
# Check for resource limits (prevent runaway containers)
grep -rn "mem_limit\|memory:\|cpus:\|cpu_shares\|deploy:" --include="*.yml" --include="*.yaml" infrastructure/
# Check for read-only rootfs
grep -rn "read_only:\s*true" --include="*.yml" --include="*.yaml" infrastructure/
# Check for no-new-privileges
grep -rn "no-new-privileges\|security_opt" --include="*.yml" --include="*.yaml" infrastructure/
# Check for restart policies (should be defined for all services)
grep -rn "restart:" --include="*.yml" --include="*.yaml" infrastructure/
# Check for health checks in compose
grep -rn "healthcheck:" --include="*.yml" --include="*.yaml" infrastructure/
# Check for logging configuration
grep -rn "logging:" --include="*.yml" --include="*.yaml" infrastructure/
# Check network definitions
grep -rn "networks:" -A 10 --include="*.yml" --include="*.yaml" infrastructure/docker/
# Check for exposed ports (minimize exposure)
grep -rn "ports:" -A 5 --include="*.yml" --include="*.yaml" infrastructure/docker/
# Check if services bind to 0.0.0.0 (should bind to specific IPs in production)
grep -rn "0\.0\.0\.0:" --include="*.yml" --include="*.yaml" infrastructure/docker/
# Check for host networking mode (dangerous in production)
grep -rn "network_mode:\s*host" --include="*.yml" --include="*.yaml" infrastructure/
# Check internal-only networks (databases should NOT be externally accessible)
grep -rn "internal:\s*true" --include="*.yml" --include="*.yaml" infrastructure/docker/
# Check for secrets in compose (should use Docker secrets, not env vars for sensitive data)
grep -rn "secrets:" --include="*.yml" --include="*.yaml" infrastructure/docker/
# Check for env_file references
grep -rn "env_file:" --include="*.yml" --include="*.yaml" infrastructure/docker/
# Check for inline environment variables with sensitive values
grep -rn "environment:" -A 20 --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep -i "password\|secret\|key\|token"
# Verify .env file is gitignored
git ls-files --cached | grep "infrastructure/docker/\.env$"
# Check for hardcoded credentials in compose
grep -rn "POSTGRES_PASSWORD:\|REDIS_PASSWORD:" --include="*.yml" --include="*.yaml" infrastructure/docker/
# Check for image scanning configuration
find . -name ".trivyignore" -o -name "trivy*" -o -name ".snyk" | grep -v node_modules
# Check image sizes (bloated images = larger attack surface)
# List all images referenced in compose files
grep -rn "image:" --include="*.yml" --include="*.yaml" infrastructure/docker/
# Check for automatic image updates / pull policies
grep -rn "pull_policy" --include="*.yml" --include="*.yaml" infrastructure/docker/
# Check Coolify service configs
ls -la infrastructure/coolify/ 2>/dev/null || echo "No Coolify config directory"
# Check for Coolify-specific files
find infrastructure/coolify/ -type f 2>/dev/null
# Check environment-specific deployment configs
find infrastructure/ -name "*.live.*" -o -name "*.prod.*" -o -name "*.production.*" | grep -v node_modules
Evaluate against these Coolify production requirements:
| Check | What to verify | |-------|---------------| | Separate management server | Coolify should run on its own server, not sharing with application containers | | Backup configuration | Are Coolify backups configured? Database dumps scheduled? | | SSL/TLS | Are all services behind HTTPS? Certificate auto-renewal working? | | Resource isolation | Are services on the Coolify node properly resource-limited? | | Update strategy | How is Coolify itself updated? Is there a rollback plan? | | Webhook security | Are deployment webhooks authenticated? | | Environment variables | Are secrets stored in Coolify's encrypted environment, not in git? | | Persistent volumes | Are database volumes properly mounted and backed up? | | Health checks | Are Coolify health checks configured per service? | | Rollback capability | Can services be rolled back to previous versions quickly? |
Research current Coolify capabilities:
WebSearch: "Coolify v4 production deployment checklist"
WebSearch: "Coolify backup configuration guide"
WebSearch: "Coolify SSL certificate management"
# Check Terraform firewall definitions
grep -rn "hcloud_firewall" -A 30 --include="*.tf" infrastructure/terraform/
# Check for overly permissive rules
grep -rn "0\.0\.0\.0/0" --include="*.tf" infrastructure/terraform/ | grep -v "# allowed"
# Check SSH access rules (should be restricted to specific IPs)
grep -rn "22\|ssh" --include="*.tf" infrastructure/terraform/
Map which services are exposed and whether they should be:
| Service | Should be public? | Check | |---------|-------------------|-------| | bl-gateway (8080) | YES — entry point | Verify it's behind Cloudflare | | bl-cms (8055) | NO — admin only | Should be VPN/IP-restricted | | bl-ai-api (3200) | NO — internal | Should only accept traffic from gateway | | bl-formula-api (3000) | NO — internal | Should only accept traffic from gateway | | bl-flow (3100/3110) | NO — internal | Should only accept traffic from gateway/ai-api | | PostgreSQL (5432) | NEVER | Should only be accessible from application network | | Redis (6379) | NEVER | Should only be accessible from application network |
# Check which ports are exposed to host in Docker
grep -rn "ports:" -A 5 --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep -E "[0-9]+:[0-9]+"
# Check for published ports that should be internal only
# Databases should use 'expose:' not 'ports:'
grep -rn "5432\|6379\|3306" --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep "ports:"
# Check for Cloudflare configuration
find infrastructure/ -name "*cloudflare*" -o -name "*cf-*" | grep -v node_modules
# Check Terraform Cloudflare resources
grep -rn "cloudflare_" --include="*.tf" infrastructure/terraform/
# Check for Cloudflare proxy mode (orange cloud)
grep -rn "proxied\s*=" --include="*.tf" infrastructure/terraform/
# Check for WAF/security rules
grep -rn "cloudflare_ruleset\|cloudflare_filter\|cloudflare_firewall" --include="*.tf" infrastructure/terraform/
# Check for TLS configuration in services
grep -rn "ssl\|tls\|https\|certificate\|cert\|letsencrypt" --include="*.yml" --include="*.yaml" --include="*.tf" --include="*.conf" infrastructure/
# Check for HTTP to HTTPS redirect
grep -rn "redirect.*https\|force_ssl\|hsts" --include="*.conf" --include="*.yaml" --include="*.yml" infrastructure/
# Check for minimum TLS version (should be 1.2+)
grep -rn "tls_min_version\|ssl_protocols\|min_version.*tls" --include="*.conf" --include="*.yaml" --include="*.tf" infrastructure/
# Check if inter-service communication uses TLS (service mesh / mTLS)
grep -rn "FORMULA_API_URL\|FLOW_TRIGGER_URL\|AI_API_URL" infrastructure/docker/.env* 2>/dev/null
# Internal URLs should use http:// (within Docker network) or https:// if exposed
# Check for service discovery mechanism
grep -rn "depends_on:" --include="*.yml" --include="*.yaml" infrastructure/docker/
# Check for Docker DNS resolution (services should reference by name, not IP)
grep -rn "http://[0-9]\+\." --include="*.yml" --include="*.yaml" --include="*.env*" infrastructure/
# Check for backup configuration
ls -la infrastructure/db-snapshots/ 2>/dev/null
cat infrastructure/db-snapshots/README.md 2>/dev/null
# Check for automated backup scripts
find scripts/ infrastructure/ -name "*backup*" -o -name "*dump*" -o -name "*snapshot*" | grep -v node_modules
# Check for pg_dump/pgBackRest/Barman configuration
grep -rn "pg_dump\|pgbackrest\|barman\|wal-g" --include="*.sh" --include="*.yml" --include="*.yaml" --include="*.conf" scripts/ infrastructure/
# Check for WAL archiving (point-in-time recovery)
grep -rn "archive_mode\|archive_command\|wal_level" --include="*.conf" --include="*.yaml" --include="*.yml" --include="*.tf" infrastructure/
# Check backup retention policy
grep -rn "retention\|rotate\|keep\|copies" --include="*.sh" --include="*.yaml" --include="*.yml" scripts/ infrastructure/
Evaluate against production standards:
| Requirement | Status | Notes | |-------------|--------|-------| | Automated backups | ? | Are backups automated or manual-only? | | Backup frequency | ? | How often? Matches RPO? | | Off-site storage | ? | Are backups stored off the database server? | | Encryption at rest | ? | Are backup files encrypted? | | Restore testing | ? | Has a restore been tested recently? | | Point-in-time recovery | ? | Can we recover to any point in time (WAL archiving)? | | Retention policy | ? | How many backups are kept? Is rotation working? | | Monitoring | ? | Are backup failures alerted on? | | RTO documented | ? | How long to restore? Is it acceptable? | | RPO documented | ? | How much data can we lose? Is it acceptable? |
Research current best practices:
WebSearch: "PostgreSQL backup strategy production 2025 pgBackRest vs pg_dump"
WebSearch: "PostgreSQL WAL archiving setup guide Docker"
WebSearch: "3-2-1 backup rule PostgreSQL implementation"
# Check Redis persistence configuration
grep -rn "appendonly\|save\s\|rdb\|aof" --include="*.conf" --include="*.yaml" --include="*.yml" infrastructure/
# Check for Redis backup in compose volumes
grep -rn "redis" -A 10 --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep "volumes:"
# Check for DR documentation
find docs/ -name "*disaster*" -o -name "*recovery*" -o -name "*incident*" -o -name "*runbook*"
# Check for migration rollback scripts
find migrations/ -name "*rollback*" -o -name "*down*" -o -name "*revert*"
# Check for infrastructure recreation capability (IaC completeness)
# Can we rebuild the entire infrastructure from code?
ls infrastructure/terraform/*.tf 2>/dev/null | wc -l
# Check for centralized logging configuration
grep -rn "logging:" -A 5 --include="*.yml" --include="*.yaml" infrastructure/docker/
# Check log drivers (should not be json-file for production)
grep -rn "driver:" --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep -i log
# Check for log rotation
grep -rn "max-size\|max-file\|rotate" --include="*.yml" --include="*.yaml" infrastructure/docker/
# Check for structured logging in services
grep -rn "pino\|winston\|bunyan\|structured\|json.*log" --include="*.js" --include="*.ts" services/ | grep -v node_modules | head -10
# Check for Prometheus/Grafana/monitoring setup
find infrastructure/ -name "*prometheus*" -o -name "*grafana*" -o -name "*monitor*" -o -name "*alert*" | grep -v node_modules
# Check for health check endpoints
grep -rn "healthcheck:" --include="*.yml" --include="*.yaml" infrastructure/docker/
# Check for uptime monitoring
find infrastructure/ -name "*uptime*" -o -name "*status*" -o -name "*ping*" | grep -v node_modules
# Check for alerting configuration (PagerDuty, Slack, email)
grep -rn "alert\|notify\|webhook\|slack\|pagerduty" --include="*.yml" --include="*.yaml" --include="*.tf" infrastructure/
# Check for disk space monitoring
grep -rn "disk\|storage\|volume" --include="*.yml" --include="*.yaml" --include="*.tf" infrastructure/ | grep -i "monitor\|alert\|limit"
# Check Docker volume configuration (persistent data must survive restarts)
grep -rn "volumes:" -A 10 --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep -v "^--$"
# Check for CI/CD configuration
find . -name ".github" -type d -o -name ".gitlab-ci*" -o -name "Jenkinsfile" -o -name ".circleci" -type d -o -name "bitbucket-pipelines*" | head -10
# Check for deployment scripts
find scripts/ -name "*deploy*" -o -name "*release*" | grep -v node_modules
# Check for automated testing in pipeline
find .github/ -name "*.yml" -exec grep -l "test\|jest\|cargo test\|go test" {} \; 2>/dev/null
# Check for secrets in CI config (should use CI secrets management)
find .github/ .gitlab-ci* -name "*.yml" -exec grep -l "password\|secret\|token\|key" {} \; 2>/dev/null
# Check for pinned action versions (GitHub Actions should use SHA, not @main)
grep -rn "uses:" --include="*.yml" .github/ 2>/dev/null | grep -v "@v\|@[a-f0-9]\{40\}"
# Check for dependency caching
grep -rn "cache\|restore-keys" --include="*.yml" .github/ 2>/dev/null
Evaluate deployment patterns:
| Check | What to verify | |-------|---------------| | Zero-downtime deploys | Can services be updated without dropping requests? | | Rolling updates | Are containers replaced one at a time? | | Health check gating | Does deployment wait for health checks before routing traffic? | | Rollback speed | How quickly can we revert to the previous version? | | Blue-green / canary | Any advanced deployment patterns in use? | | Database migration safety | Are migrations run before or after service update? Rollback plan? |
# Check Hetzner server types
grep -rn "server_type" --include="*.tf" infrastructure/terraform/
# Check container resource limits
grep -rn "mem_limit\|memory:\|cpus:" --include="*.yml" --include="*.yaml" infrastructure/docker/
# Check if services are right-sized (not over/under-provisioned)
# Compare resource limits against actual usage patterns
Research current pricing:
WebSearch: "Hetzner cloud pricing [server type] 2026"
WebSearch: "Hetzner vs alternative VPS cost comparison 2025"
Evaluate:
Compile all findings into a structured report. Save to docs/reports/devops-review-YYYY-MM-DD.md:
# DevOps Infrastructure Review Report
**Date:** [DATE]
**Reviewer:** DevOps Review Agent
**Scope:** [Full / Component-specific]
**Branch:** [current branch]
**Commit:** [HEAD commit hash]
---
## Executive Summary
[2-3 sentences: Overall infrastructure health. What's the biggest risk?]
**Overall Risk Level:** CRITICAL / HIGH / MEDIUM / LOW
---
## Findings Summary
| Severity | Count | Top Categories |
|----------|-------|----------------|
| CRITICAL | X | [categories] |
| HIGH | X | [categories] |
| MEDIUM | X | [categories] |
| LOW | X | [categories] |
| INFO | X | [categories] |
---
## Critical & High Findings (Immediate Action Required)
### [FINDING-001] [Title]
- **Severity:** CRITICAL/HIGH
- **Category:** Security/Reliability/...
- **Location:** `path/to/file:line`
- **Description:** [What's wrong]
- **Evidence:** [Exact config/output showing the issue]
- **Impact:** [What happens if this isn't fixed]
- **Recommendation:** [Specific fix]
- **Effort:** [Small/Medium/Large]
---
## Infrastructure Component Assessment
### Terraform
| Aspect | Status | Notes |
|--------|--------|-------|
| State management | GOOD/NEEDS WORK/CRITICAL | ... |
| Secrets handling | GOOD/NEEDS WORK/CRITICAL | ... |
| Resource config | GOOD/NEEDS WORK/CRITICAL | ... |
| Code quality | GOOD/NEEDS WORK/CRITICAL | ... |
| Drift detection | GOOD/NEEDS WORK/CRITICAL | ... |
### Docker/Compose
| Aspect | Status | Notes |
|--------|--------|-------|
| Image security | GOOD/NEEDS WORK/CRITICAL | ... |
| Runtime hardening | GOOD/NEEDS WORK/CRITICAL | ... |
| Networking | GOOD/NEEDS WORK/CRITICAL | ... |
| Secrets management | GOOD/NEEDS WORK/CRITICAL | ... |
| Resource limits | GOOD/NEEDS WORK/CRITICAL | ... |
### Coolify
| Aspect | Status | Notes |
|--------|--------|-------|
| Deployment config | GOOD/NEEDS WORK/CRITICAL | ... |
| SSL/TLS | GOOD/NEEDS WORK/CRITICAL | ... |
| Backup config | GOOD/NEEDS WORK/CRITICAL | ... |
| Update strategy | GOOD/NEEDS WORK/CRITICAL | ... |
### Networking & Firewall
| Aspect | Status | Notes |
|--------|--------|-------|
| Firewall rules | GOOD/NEEDS WORK/CRITICAL | ... |
| Service exposure | GOOD/NEEDS WORK/CRITICAL | ... |
| TLS configuration | GOOD/NEEDS WORK/CRITICAL | ... |
| Internal comms | GOOD/NEEDS WORK/CRITICAL | ... |
### Backup & DR
| Aspect | Status | Notes |
|--------|--------|-------|
| Database backups | GOOD/NEEDS WORK/CRITICAL | ... |
| Backup testing | GOOD/NEEDS WORK/CRITICAL | ... |
| Point-in-time recovery | GOOD/NEEDS WORK/CRITICAL | ... |
| DR documentation | GOOD/NEEDS WORK/CRITICAL | ... |
### Monitoring
| Aspect | Status | Notes |
|--------|--------|-------|
| Logging | GOOD/NEEDS WORK/CRITICAL | ... |
| Metrics | GOOD/NEEDS WORK/CRITICAL | ... |
| Alerting | GOOD/NEEDS WORK/CRITICAL | ... |
---
## Recommendations (Priority Order)
### Must Fix Before Production
1. [Most critical finding with specific action]
2. ...
### Should Fix Soon
1. [Important improvements]
2. ...
### Nice to Have
1. [Quality of life improvements]
2. ...
---
## Next Review
**Recommended in:** [X days/weeks]
**Focus areas:** [What to check next time]
**Triggered by:** [What events should trigger an immediate re-review]
Evidence over opinion. Every finding must reference a specific file, line, or command output.
Severity must be justified. CRITICAL means "exploitable in production or causes data loss." HIGH means "significant risk that needs fixing before next release." Don't inflate.
Recommendations must be actionable. "Improve Docker security" is useless. "Add read_only: true and security_opt: [no-new-privileges:true] to the postgres service in infrastructure/docker/docker-compose.yml" is useful.
Compare to standards. Use CIS Docker Benchmark, Terraform best practices, OWASP, and provider-specific guidance as baselines.
Acknowledge what's done well. Note strong infrastructure patterns that should be replicated.
Research when uncertain. If you're unsure whether a configuration is secure or optimal, use WebSearch to find the current consensus before making a judgment.
Never trust cached knowledge for versions and CVEs. ALWAYS search for the current state of any infrastructure tool or vulnerability.
Cite your sources. Every research-based finding must include the URL.
Investigate before recommending. Before recommending an infrastructure change (e.g., "migrate to Kubernetes"), research the migration cost, operational complexity, and team skillset requirements.
Use sub-agents for thorough reviews. A full DevOps review should spawn parallel research agents (see Section 0.4) to investigate CVEs, best practices, and tool comparisons concurrently.
| Standard | URL | Covers | |----------|-----|--------| | CIS Docker Benchmark | cisecurity.org/benchmark/docker | Container hardening | | OWASP Docker Security | cheatsheetseries.owasp.org/cheatsheets/Docker_Security_Cheat_Sheet.html | Container security | | Terraform Security — HashiCorp | hashicorp.com/blog/terraform-security-5-foundational-practices | IaC security | | Terraform Security — Spacelift | spacelift.io/blog/terraform-security | IaC audit checklist | | Hetzner Cloud Docs | docs.hetzner.com/cloud/ | Provider-specific guidance | | Hetzner Firewall Docs | docs.hetzner.com/cloud/firewalls/ | Firewall configuration | | Coolify Docs | coolify.io/docs/ | Deployment platform | | Mozilla SSL Config | ssl-config.mozilla.org | TLS best practices | | PostgreSQL Backup Docs | postgresql.org/docs/current/backup.html | Database backup strategy | | pgBackRest Docs | pgbackrest.org | Enterprise PostgreSQL backup | | Docker Compose Docs | docs.docker.com/compose/ | Compose configuration | | 12 Factor App | 12factor.net | Service deployment principles |
tools
Review task backlog across all services, pick next project, update status, or add new ideas
testing
Complete project health check: verify tests, documentation, database snapshots, migration progress, and suggest next steps. Run after completing any task or iteration.
tools
Evidence-based frontend design: research, evaluate, and build production-grade interfaces. Audits existing UI with Chrome DevTools, documents design decisions with rationale, and applies modern design principles including AI-first patterns. Use for building, reviewing, or improving any frontend interface.
development
CTO-level technical review agent: security audit, architecture evaluation, code quality analysis, dependency assessment, and technology fitness review. Evaluates the project like a senior technical leader performing due diligence — with evidence, severity ratings, and actionable recommendations.