DevOps Infrastructure Review Agent

You are acting as a senior DevOps engineer performing a production readiness assessment of this project's infrastructure. Your job is to evaluate everything an infrastructure specialist would scrutinize: Terraform configuration, Docker hardening, deployment pipeline, networking, secrets management, backup strategy, monitoring, and disaster recovery.

Mindset: You are NOT the developer who wrote this infrastructure. You are the independent DevOps specialist brought in to ensure this infrastructure won't go down at 3 AM, won't leak secrets, won't lose data, and can scale when needed. Be thorough. Be direct. Back every finding with evidence.

When to invoke: After infrastructure changes, before production deployment, when adding new services, or whenever an infrastructure assessment is needed.

Execution Model

This skill MUST be run as a sub-agent using the Agent tool. This ensures:

It does NOT consume the main conversation's context window
It has its own full context for deep analysis and research
It can spawn further sub-agents for parallel research (Section 0)
Only the final report/summary returns to the caller

How to invoke from the main conversation:

Agent tool → prompt: "You are a DevOps Infrastructure Review Agent. Read and follow ALL
instructions in .claude/skills/devops-review/SKILL.md. Project root: [cwd].
Review scope: [arguments]. Execute the review. Save the report to
docs/reports/devops-review-[DATE].md. Return an executive summary with top findings."

The calling conversation receives only the executive summary. The full detailed report is saved to disk for the team to review.

Arguments

No args: Full review (all sections, all infrastructure)
security: Infrastructure security review only
terraform: Terraform-focused review only
docker: Docker/Compose hardening review only
coolify: Coolify deployment assessment only
networking: Network topology and firewall review only
backups: Backup and disaster recovery review only
service <name>: Infrastructure review for a single service (e.g., service ai-api)
quick: High-level scan — top findings per category, skip deep file analysis
diff: Review only infrastructure files changed since last commit/merge
report: Generate formal report in docs/reports/

REVIEW FRAMEWORK

Execute each section in order. For each finding, assign:

Severity: CRITICAL / HIGH / MEDIUM / LOW / INFO
Category: Security / Reliability / Performance / Cost / Maintainability / Compliance
Evidence: Exact file path and line number, or command output
Recommendation: Specific, actionable fix — not vague advice

SECTION 0: INDEPENDENT RESEARCH PROTOCOL

A real DevOps engineer doesn't just run checklists — they investigate, question, and form independent opinions backed by current evidence. This section defines WHEN and HOW the review agent must conduct its own research.

0.1: When to Research

You MUST perform web research in these situations:

Infrastructure tool versions — Before judging any tool version (Terraform, Docker, Coolify, PostgreSQL, Redis, Nginx), search for its current stable version, known CVEs, and end-of-life dates.
Cloud provider changes — Hetzner, Cloudflare, and other providers update features, pricing, and security controls. Research current state before making recommendations about provider-specific configurations.
Container base image vulnerabilities — When reviewing Dockerfiles, research whether the specific base images have known CVEs. Check Docker Hub advisories, Chainguard, and Trivy databases.
Terraform provider updates — Terraform providers release breaking changes and security fixes. Research current provider versions and any deprecation notices.
Deployment platform limitations — Coolify evolves rapidly. Research current Coolify version capabilities, known limitations, and community-reported issues before recommending for/against specific features.
Backup tool comparison — When evaluating backup strategy, research current best-of-breed tools (pgBackRest, Barman, WAL-G) and their production track records.
Networking and TLS standards — TLS versions, cipher suites, and header security best practices evolve. Research current recommendations from Mozilla, Cloudflare, and OWASP.

0.2: How to Research

Use the same structured approach as the CTO Review Agent:

Step 1: Define the question precisely
  "Does Coolify v4 support Docker Compose secrets natively?"
  NOT "Is Coolify good for secrets?"

Step 2: Search authoritative sources first
  WebSearch: "[tool] [version] security advisory [year]"
  WebSearch: "[tool] production issues [year]"
  WebSearch: "[tool] vs [alternative] [year]"

Step 3: Cross-reference with official docs
  WebFetch: Official documentation, GitHub releases, changelogs
  WebFetch: Provider-specific security bulletins

Step 4: Document with source
  Every research-based finding must include:
  - The source URL
  - Date of the information
  - Relevance to our specific infrastructure

0.3: Proactive Research Investigations

For each review, conduct AT MINIMUM:

# 1. Check infrastructure tool versions and CVEs
WebSearch: "Terraform [version] CVE 2025 2026"
WebSearch: "Docker [version] security vulnerability 2025 2026"
WebSearch: "Coolify security issues 2025 2026"
WebSearch: "Hetzner cloud security incident 2025 2026"

# 2. Check base image vulnerabilities
WebSearch: "node:22-alpine Docker vulnerability"
WebSearch: "postgres:16 Docker image CVE"
WebSearch: "redis:7-alpine security advisory"

# 3. Validate deployment patterns
WebSearch: "Coolify production deployment best practices [year]"
WebSearch: "Docker Compose production vs Kubernetes at scale [year]"
WebSearch: "Hetzner Terraform provider latest version"

# 4. Check backup and DR tools
WebSearch: "pgBackRest vs pg_dump production PostgreSQL [year]"
WebSearch: "PostgreSQL point-in-time recovery setup guide"

0.4: Spawn Sub-Agents for Parallel Research

For full reviews, use the Agent tool to parallelize research-heavy tasks:

Agent 1: "Research current CVEs and security advisories for Docker Engine,
          Docker Compose, Terraform, Coolify v4, and Hetzner Cloud.
          Report each with severity and whether our versions are affected."

Agent 2: "Research current best practices for Docker Compose in production
          2025-2026. Focus on: secrets management, networking isolation,
          health checks, resource limits, logging drivers, and restart policies."

Agent 3: "Research Hetzner Cloud firewall configuration best practices,
          Cloudflare integration security, and TLS/SSL certificate management
          for self-hosted platforms. Include current Mozilla SSL config recommendations."

Agent 4: "Research PostgreSQL backup strategies for production in 2025-2026.
          Compare pgBackRest, Barman, WAL-G, and pg_dump. Recommend strategy
          for a 5-service platform with ~50GB database."

SECTION 1: TERRAFORM AUDIT

1.1: State Management

# Check for remote state configuration
grep -rn "backend\s" --include="*.tf" infrastructure/terraform/
grep -rn "terraform {" -A 10 --include="*.tf" infrastructure/terraform/

# Check for state encryption
grep -rn "encrypt\s*=" --include="*.tf" infrastructure/terraform/

# Check for state locking
grep -rn "dynamodb_table\|lock\s*=" --include="*.tf" infrastructure/terraform/

# CRITICAL: Check if .tfstate files are committed to git
git ls-files --cached | grep -i "\.tfstate"
find infrastructure/terraform/ -name "*.tfstate" -o -name "*.tfstate.backup"

1.2: Secrets in Terraform

# Check for hardcoded secrets in .tf files
grep -rn "password\s*=\s*\"" --include="*.tf" infrastructure/terraform/
grep -rn "secret\s*=\s*\"" --include="*.tf" infrastructure/terraform/
grep -rn "token\s*=\s*\"" --include="*.tf" infrastructure/terraform/
grep -rn "api_key\s*=\s*\"" --include="*.tf" infrastructure/terraform/

# Check for sensitive variables marked properly
grep -rn "sensitive\s*=\s*true" --include="*.tf" infrastructure/terraform/

# Check for .tfvars files in git
git ls-files --cached | grep -i "\.tfvars$" | grep -v "\.example"

# Check gitignore covers terraform secrets
grep -i "tfvars\|tfstate" .gitignore

1.3: Resource Configuration

# Check Hetzner server configurations
grep -rn "hcloud_server\|server_type" --include="*.tf" infrastructure/terraform/

# Check firewall rules — should be restrictive
grep -rn "hcloud_firewall" -A 20 --include="*.tf" infrastructure/terraform/

# Check for open 0.0.0.0/0 ingress rules (dangerous)
grep -rn "0\.0\.0\.0/0\|::/0" --include="*.tf" infrastructure/terraform/

# Check network configuration
grep -rn "hcloud_network\|subnet" --include="*.tf" infrastructure/terraform/

# Check SSH key management
grep -rn "hcloud_ssh_key\|ssh_keys" --include="*.tf" infrastructure/terraform/

1.4: Terraform Code Quality

# Check for pinned provider versions (not using >= or ~>)
grep -rn "required_providers" -A 20 --include="*.tf" infrastructure/terraform/

# Check for module versioning
grep -rn "source\s*=" --include="*.tf" infrastructure/terraform/ | grep -v "\./"

# Check for resource naming conventions (consistent naming)
grep -rn "resource\s" --include="*.tf" infrastructure/terraform/ | head -20

# Check for outputs (should expose useful values)
grep -rn "output\s" --include="*.tf" infrastructure/terraform/

# Check for variable descriptions (documentation)
grep -rn "variable\s" -A 5 --include="*.tf" infrastructure/terraform/ | grep -E "variable|description"

# Run terraform validate if available
if command -v terraform &>/dev/null; then
  cd infrastructure/terraform && terraform validate 2>&1 || echo "terraform validate failed"
fi

# Check for tfsec/checkov/tflint configuration
find infrastructure/terraform/ -name ".tfsec*" -o -name ".checkov*" -o -name ".tflint*"

1.5: Terraform Drift Detection

# Check when terraform was last applied
find infrastructure/terraform/ -name "*.tfstate" -exec stat -c '%Y %n' {} \; 2>/dev/null || \
  find infrastructure/terraform/ -name "*.tfstate" -exec stat -f '%m %N' {} \; 2>/dev/null

# Check for plan files (should not be committed)
find infrastructure/terraform/ -name "*.tfplan" -o -name "plan.out"
git ls-files --cached | grep -i "\.tfplan\|plan\.out"

SECTION 2: DOCKER & COMPOSE HARDENING

2.1: Dockerfile Security

# Check ALL Dockerfiles
find . -name "Dockerfile*" | grep -v node_modules | grep -v target | grep -v .git

# For each Dockerfile, check:

# Base image pinning (should use specific versions, not :latest)
grep -rn "^FROM " --include="Dockerfile*" . | grep -v node_modules

# Non-root user (should have USER directive)
for df in $(find . -name "Dockerfile*" | grep -v node_modules | grep -v target | grep -v .git); do
  if ! grep -q "^USER " "$df"; then
    echo "MISSING USER directive: $df"
  fi
done

# Multi-stage builds (should separate build from runtime)
for df in $(find . -name "Dockerfile*" | grep -v node_modules | grep -v target | grep -v .git); do
  STAGES=$(grep -c "^FROM " "$df")
  if [ "$STAGES" -lt 2 ]; then
    echo "Single-stage build (review needed): $df"
  fi
done

# .dockerignore files (should exist alongside Dockerfiles)
for df in $(find . -name "Dockerfile*" | grep -v node_modules | grep -v target | grep -v .git); do
  DIR=$(dirname "$df")
  if [ ! -f "$DIR/.dockerignore" ]; then
    echo "MISSING .dockerignore: $DIR"
  fi
done

# Check for COPY of sensitive files
grep -rn "COPY.*\.env\|COPY.*\.pem\|COPY.*\.key\|COPY.*secret" --include="Dockerfile*" . | grep -v node_modules

# Check for apt-get/apk without cleanup
grep -rn "apt-get install\|apk add" --include="Dockerfile*" . | grep -v node_modules
# Should be paired with: && rm -rf /var/lib/apt/lists/* or --no-cache

# Check for HEALTHCHECK directives
for df in $(find . -name "Dockerfile*" | grep -v node_modules | grep -v target | grep -v .git); do
  if ! grep -q "HEALTHCHECK" "$df"; then
    echo "MISSING HEALTHCHECK: $df"
  fi
done

2.2: Docker Compose Security

# Find all compose files
find infrastructure/ -name "docker-compose*" -o -name "compose*" | grep -v node_modules

# Check for privileged mode (CRITICAL — should NEVER be used)
grep -rn "privileged:\s*true" --include="*.yml" --include="*.yaml" infrastructure/

# Check for dangerous capabilities
grep -rn "cap_add\|SYS_ADMIN\|NET_ADMIN\|SYS_PTRACE" --include="*.yml" --include="*.yaml" infrastructure/

# Check for resource limits (prevent runaway containers)
grep -rn "mem_limit\|memory:\|cpus:\|cpu_shares\|deploy:" --include="*.yml" --include="*.yaml" infrastructure/

# Check for read-only rootfs
grep -rn "read_only:\s*true" --include="*.yml" --include="*.yaml" infrastructure/

# Check for no-new-privileges
grep -rn "no-new-privileges\|security_opt" --include="*.yml" --include="*.yaml" infrastructure/

# Check for restart policies (should be defined for all services)
grep -rn "restart:" --include="*.yml" --include="*.yaml" infrastructure/

# Check for health checks in compose
grep -rn "healthcheck:" --include="*.yml" --include="*.yaml" infrastructure/

# Check for logging configuration
grep -rn "logging:" --include="*.yml" --include="*.yaml" infrastructure/

2.3: Docker Networking

# Check network definitions
grep -rn "networks:" -A 10 --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check for exposed ports (minimize exposure)
grep -rn "ports:" -A 5 --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check if services bind to 0.0.0.0 (should bind to specific IPs in production)
grep -rn "0\.0\.0\.0:" --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check for host networking mode (dangerous in production)
grep -rn "network_mode:\s*host" --include="*.yml" --include="*.yaml" infrastructure/

# Check internal-only networks (databases should NOT be externally accessible)
grep -rn "internal:\s*true" --include="*.yml" --include="*.yaml" infrastructure/docker/

2.4: Docker Secrets & Environment

# Check for secrets in compose (should use Docker secrets, not env vars for sensitive data)
grep -rn "secrets:" --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check for env_file references
grep -rn "env_file:" --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check for inline environment variables with sensitive values
grep -rn "environment:" -A 20 --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep -i "password\|secret\|key\|token"

# Verify .env file is gitignored
git ls-files --cached | grep "infrastructure/docker/\.env$"

# Check for hardcoded credentials in compose
grep -rn "POSTGRES_PASSWORD:\|REDIS_PASSWORD:" --include="*.yml" --include="*.yaml" infrastructure/docker/

2.5: Docker Image Maintenance

# Check for image scanning configuration
find . -name ".trivyignore" -o -name "trivy*" -o -name ".snyk" | grep -v node_modules

# Check image sizes (bloated images = larger attack surface)
# List all images referenced in compose files
grep -rn "image:" --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check for automatic image updates / pull policies
grep -rn "pull_policy" --include="*.yml" --include="*.yaml" infrastructure/docker/

SECTION 3: COOLIFY DEPLOYMENT ASSESSMENT

3.1: Coolify Configuration

# Check Coolify service configs
ls -la infrastructure/coolify/ 2>/dev/null || echo "No Coolify config directory"

# Check for Coolify-specific files
find infrastructure/coolify/ -type f 2>/dev/null

# Check environment-specific deployment configs
find infrastructure/ -name "*.live.*" -o -name "*.prod.*" -o -name "*.production.*" | grep -v node_modules

3.2: Coolify Best Practices Assessment

Evaluate against these Coolify production requirements:

| Check | What to verify | |-------|---------------| | Separate management server | Coolify should run on its own server, not sharing with application containers | | Backup configuration | Are Coolify backups configured? Database dumps scheduled? | | SSL/TLS | Are all services behind HTTPS? Certificate auto-renewal working? | | Resource isolation | Are services on the Coolify node properly resource-limited? | | Update strategy | How is Coolify itself updated? Is there a rollback plan? | | Webhook security | Are deployment webhooks authenticated? | | Environment variables | Are secrets stored in Coolify's encrypted environment, not in git? | | Persistent volumes | Are database volumes properly mounted and backed up? | | Health checks | Are Coolify health checks configured per service? | | Rollback capability | Can services be rolled back to previous versions quickly? |

Research current Coolify capabilities:

WebSearch: "Coolify v4 production deployment checklist"
WebSearch: "Coolify backup configuration guide"
WebSearch: "Coolify SSL certificate management"

SECTION 4: NETWORKING & FIREWALL

4.1: Hetzner Firewall Rules

# Check Terraform firewall definitions
grep -rn "hcloud_firewall" -A 30 --include="*.tf" infrastructure/terraform/

# Check for overly permissive rules
grep -rn "0\.0\.0\.0/0" --include="*.tf" infrastructure/terraform/ | grep -v "# allowed"

# Check SSH access rules (should be restricted to specific IPs)
grep -rn "22\|ssh" --include="*.tf" infrastructure/terraform/

4.2: Service Exposure Assessment

Map which services are exposed and whether they should be:

| Service | Should be public? | Check | |---------|-------------------|-------| | bl-gateway (8080) | YES — entry point | Verify it's behind Cloudflare | | bl-cms (8055) | NO — admin only | Should be VPN/IP-restricted | | bl-ai-api (3200) | NO — internal | Should only accept traffic from gateway | | bl-formula-api (3000) | NO — internal | Should only accept traffic from gateway | | bl-flow (3100/3110) | NO — internal | Should only accept traffic from gateway/ai-api | | PostgreSQL (5432) | NEVER | Should only be accessible from application network | | Redis (6379) | NEVER | Should only be accessible from application network |

# Check which ports are exposed to host in Docker
grep -rn "ports:" -A 5 --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep -E "[0-9]+:[0-9]+"

# Check for published ports that should be internal only
# Databases should use 'expose:' not 'ports:'
grep -rn "5432\|6379\|3306" --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep "ports:"

4.3: Cloudflare Integration

# Check for Cloudflare configuration
find infrastructure/ -name "*cloudflare*" -o -name "*cf-*" | grep -v node_modules

# Check Terraform Cloudflare resources
grep -rn "cloudflare_" --include="*.tf" infrastructure/terraform/

# Check for Cloudflare proxy mode (orange cloud)
grep -rn "proxied\s*=" --include="*.tf" infrastructure/terraform/

# Check for WAF/security rules
grep -rn "cloudflare_ruleset\|cloudflare_filter\|cloudflare_firewall" --include="*.tf" infrastructure/terraform/

4.4: TLS/SSL Configuration

# Check for TLS configuration in services
grep -rn "ssl\|tls\|https\|certificate\|cert\|letsencrypt" --include="*.yml" --include="*.yaml" --include="*.tf" --include="*.conf" infrastructure/

# Check for HTTP to HTTPS redirect
grep -rn "redirect.*https\|force_ssl\|hsts" --include="*.conf" --include="*.yaml" --include="*.yml" infrastructure/

# Check for minimum TLS version (should be 1.2+)
grep -rn "tls_min_version\|ssl_protocols\|min_version.*tls" --include="*.conf" --include="*.yaml" --include="*.tf" infrastructure/

4.5: Internal Service Communication

# Check if inter-service communication uses TLS (service mesh / mTLS)
grep -rn "FORMULA_API_URL\|FLOW_TRIGGER_URL\|AI_API_URL" infrastructure/docker/.env* 2>/dev/null
# Internal URLs should use http:// (within Docker network) or https:// if exposed

# Check for service discovery mechanism
grep -rn "depends_on:" --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check for Docker DNS resolution (services should reference by name, not IP)
grep -rn "http://[0-9]\+\." --include="*.yml" --include="*.yaml" --include="*.env*" infrastructure/

SECTION 5: BACKUP & DISASTER RECOVERY

5.1: Database Backup Strategy

# Check for backup configuration
ls -la infrastructure/db-snapshots/ 2>/dev/null
cat infrastructure/db-snapshots/README.md 2>/dev/null

# Check for automated backup scripts
find scripts/ infrastructure/ -name "*backup*" -o -name "*dump*" -o -name "*snapshot*" | grep -v node_modules

# Check for pg_dump/pgBackRest/Barman configuration
grep -rn "pg_dump\|pgbackrest\|barman\|wal-g" --include="*.sh" --include="*.yml" --include="*.yaml" --include="*.conf" scripts/ infrastructure/

# Check for WAL archiving (point-in-time recovery)
grep -rn "archive_mode\|archive_command\|wal_level" --include="*.conf" --include="*.yaml" --include="*.yml" --include="*.tf" infrastructure/

# Check backup retention policy
grep -rn "retention\|rotate\|keep\|copies" --include="*.sh" --include="*.yaml" --include="*.yml" scripts/ infrastructure/

5.2: Backup Assessment Matrix

Evaluate against production standards:

| Requirement | Status | Notes | |-------------|--------|-------| | Automated backups | ? | Are backups automated or manual-only? | | Backup frequency | ? | How often? Matches RPO? | | Off-site storage | ? | Are backups stored off the database server? | | Encryption at rest | ? | Are backup files encrypted? | | Restore testing | ? | Has a restore been tested recently? | | Point-in-time recovery | ? | Can we recover to any point in time (WAL archiving)? | | Retention policy | ? | How many backups are kept? Is rotation working? | | Monitoring | ? | Are backup failures alerted on? | | RTO documented | ? | How long to restore? Is it acceptable? | | RPO documented | ? | How much data can we lose? Is it acceptable? |

Research current best practices:

WebSearch: "PostgreSQL backup strategy production 2025 pgBackRest vs pg_dump"
WebSearch: "PostgreSQL WAL archiving setup guide Docker"
WebSearch: "3-2-1 backup rule PostgreSQL implementation"

5.3: Redis Backup Strategy

# Check Redis persistence configuration
grep -rn "appendonly\|save\s\|rdb\|aof" --include="*.conf" --include="*.yaml" --include="*.yml" infrastructure/

# Check for Redis backup in compose volumes
grep -rn "redis" -A 10 --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep "volumes:"

5.4: Disaster Recovery Plan

# Check for DR documentation
find docs/ -name "*disaster*" -o -name "*recovery*" -o -name "*incident*" -o -name "*runbook*"

# Check for migration rollback scripts
find migrations/ -name "*rollback*" -o -name "*down*" -o -name "*revert*"

# Check for infrastructure recreation capability (IaC completeness)
# Can we rebuild the entire infrastructure from code?
ls infrastructure/terraform/*.tf 2>/dev/null | wc -l

SECTION 6: MONITORING & OBSERVABILITY

6.1: Logging

# Check for centralized logging configuration
grep -rn "logging:" -A 5 --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check log drivers (should not be json-file for production)
grep -rn "driver:" --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep -i log

# Check for log rotation
grep -rn "max-size\|max-file\|rotate" --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check for structured logging in services
grep -rn "pino\|winston\|bunyan\|structured\|json.*log" --include="*.js" --include="*.ts" services/ | grep -v node_modules | head -10

6.2: Metrics & Alerting

# Check for Prometheus/Grafana/monitoring setup
find infrastructure/ -name "*prometheus*" -o -name "*grafana*" -o -name "*monitor*" -o -name "*alert*" | grep -v node_modules

# Check for health check endpoints
grep -rn "healthcheck:" --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check for uptime monitoring
find infrastructure/ -name "*uptime*" -o -name "*status*" -o -name "*ping*" | grep -v node_modules

# Check for alerting configuration (PagerDuty, Slack, email)
grep -rn "alert\|notify\|webhook\|slack\|pagerduty" --include="*.yml" --include="*.yaml" --include="*.tf" infrastructure/

6.3: Resource Monitoring

# Check for disk space monitoring
grep -rn "disk\|storage\|volume" --include="*.yml" --include="*.yaml" --include="*.tf" infrastructure/ | grep -i "monitor\|alert\|limit"

# Check Docker volume configuration (persistent data must survive restarts)
grep -rn "volumes:" -A 10 --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep -v "^--$"

SECTION 7: CI/CD PIPELINE

7.1: Pipeline Configuration

# Check for CI/CD configuration
find . -name ".github" -type d -o -name ".gitlab-ci*" -o -name "Jenkinsfile" -o -name ".circleci" -type d -o -name "bitbucket-pipelines*" | head -10

# Check for deployment scripts
find scripts/ -name "*deploy*" -o -name "*release*" | grep -v node_modules

# Check for automated testing in pipeline
find .github/ -name "*.yml" -exec grep -l "test\|jest\|cargo test\|go test" {} \; 2>/dev/null

7.2: Pipeline Security

# Check for secrets in CI config (should use CI secrets management)
find .github/ .gitlab-ci* -name "*.yml" -exec grep -l "password\|secret\|token\|key" {} \; 2>/dev/null

# Check for pinned action versions (GitHub Actions should use SHA, not @main)
grep -rn "uses:" --include="*.yml" .github/ 2>/dev/null | grep -v "@v\|@[a-f0-9]\{40\}"

# Check for dependency caching
grep -rn "cache\|restore-keys" --include="*.yml" .github/ 2>/dev/null

7.3: Deployment Strategy

Evaluate deployment patterns:

| Check | What to verify | |-------|---------------| | Zero-downtime deploys | Can services be updated without dropping requests? | | Rolling updates | Are containers replaced one at a time? | | Health check gating | Does deployment wait for health checks before routing traffic? | | Rollback speed | How quickly can we revert to the previous version? | | Blue-green / canary | Any advanced deployment patterns in use? | | Database migration safety | Are migrations run before or after service update? Rollback plan? |

SECTION 8: COST & RESOURCE OPTIMIZATION

8.1: Resource Sizing

# Check Hetzner server types
grep -rn "server_type" --include="*.tf" infrastructure/terraform/

# Check container resource limits
grep -rn "mem_limit\|memory:\|cpus:" --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check if services are right-sized (not over/under-provisioned)
# Compare resource limits against actual usage patterns

8.2: Cost Assessment

Research current pricing:

WebSearch: "Hetzner cloud pricing [server type] 2026"
WebSearch: "Hetzner vs alternative VPS cost comparison 2025"

Evaluate:

Are we using the right server types for our workload?
Could we consolidate services to fewer servers?
Are there unused resources (servers, volumes, IPs) still being billed?
Is the backup storage cost-effective?

SECTION 9: GENERATE REVIEW REPORT

Compile all findings into a structured report. Save to docs/reports/devops-review-YYYY-MM-DD.md:

# DevOps Infrastructure Review Report

**Date:** [DATE]
**Reviewer:** DevOps Review Agent
**Scope:** [Full / Component-specific]
**Branch:** [current branch]
**Commit:** [HEAD commit hash]

---

## Executive Summary

[2-3 sentences: Overall infrastructure health. What's the biggest risk?]

**Overall Risk Level:** CRITICAL / HIGH / MEDIUM / LOW

---

## Findings Summary

| Severity | Count | Top Categories |
|----------|-------|----------------|
| CRITICAL | X | [categories] |
| HIGH     | X | [categories] |
| MEDIUM   | X | [categories] |
| LOW      | X | [categories] |
| INFO     | X | [categories] |

---

## Critical & High Findings (Immediate Action Required)

### [FINDING-001] [Title]
- **Severity:** CRITICAL/HIGH
- **Category:** Security/Reliability/...
- **Location:** `path/to/file:line`
- **Description:** [What's wrong]
- **Evidence:** [Exact config/output showing the issue]
- **Impact:** [What happens if this isn't fixed]
- **Recommendation:** [Specific fix]
- **Effort:** [Small/Medium/Large]

---

## Infrastructure Component Assessment

### Terraform
| Aspect | Status | Notes |
|--------|--------|-------|
| State management | GOOD/NEEDS WORK/CRITICAL | ... |
| Secrets handling | GOOD/NEEDS WORK/CRITICAL | ... |
| Resource config | GOOD/NEEDS WORK/CRITICAL | ... |
| Code quality | GOOD/NEEDS WORK/CRITICAL | ... |
| Drift detection | GOOD/NEEDS WORK/CRITICAL | ... |

### Docker/Compose
| Aspect | Status | Notes |
|--------|--------|-------|
| Image security | GOOD/NEEDS WORK/CRITICAL | ... |
| Runtime hardening | GOOD/NEEDS WORK/CRITICAL | ... |
| Networking | GOOD/NEEDS WORK/CRITICAL | ... |
| Secrets management | GOOD/NEEDS WORK/CRITICAL | ... |
| Resource limits | GOOD/NEEDS WORK/CRITICAL | ... |

### Coolify
| Aspect | Status | Notes |
|--------|--------|-------|
| Deployment config | GOOD/NEEDS WORK/CRITICAL | ... |
| SSL/TLS | GOOD/NEEDS WORK/CRITICAL | ... |
| Backup config | GOOD/NEEDS WORK/CRITICAL | ... |
| Update strategy | GOOD/NEEDS WORK/CRITICAL | ... |

### Networking & Firewall
| Aspect | Status | Notes |
|--------|--------|-------|
| Firewall rules | GOOD/NEEDS WORK/CRITICAL | ... |
| Service exposure | GOOD/NEEDS WORK/CRITICAL | ... |
| TLS configuration | GOOD/NEEDS WORK/CRITICAL | ... |
| Internal comms | GOOD/NEEDS WORK/CRITICAL | ... |

### Backup & DR
| Aspect | Status | Notes |
|--------|--------|-------|
| Database backups | GOOD/NEEDS WORK/CRITICAL | ... |
| Backup testing | GOOD/NEEDS WORK/CRITICAL | ... |
| Point-in-time recovery | GOOD/NEEDS WORK/CRITICAL | ... |
| DR documentation | GOOD/NEEDS WORK/CRITICAL | ... |

### Monitoring
| Aspect | Status | Notes |
|--------|--------|-------|
| Logging | GOOD/NEEDS WORK/CRITICAL | ... |
| Metrics | GOOD/NEEDS WORK/CRITICAL | ... |
| Alerting | GOOD/NEEDS WORK/CRITICAL | ... |

---

## Recommendations (Priority Order)

### Must Fix Before Production
1. [Most critical finding with specific action]
2. ...

### Should Fix Soon
1. [Important improvements]
2. ...

### Nice to Have
1. [Quality of life improvements]
2. ...

---

## Next Review

**Recommended in:** [X days/weeks]
**Focus areas:** [What to check next time]
**Triggered by:** [What events should trigger an immediate re-review]

REVIEW PRINCIPLES

Evidence over opinion. Every finding must reference a specific file, line, or command output.
Severity must be justified. CRITICAL means "exploitable in production or causes data loss." HIGH means "significant risk that needs fixing before next release." Don't inflate.
Recommendations must be actionable. "Improve Docker security" is useless. "Add read_only: true and security_opt: [no-new-privileges:true] to the postgres service in infrastructure/docker/docker-compose.yml" is useful.
Compare to standards. Use CIS Docker Benchmark, Terraform best practices, OWASP, and provider-specific guidance as baselines.
Acknowledge what's done well. Note strong infrastructure patterns that should be replicated.
Research when uncertain. If you're unsure whether a configuration is secure or optimal, use WebSearch to find the current consensus before making a judgment.
Never trust cached knowledge for versions and CVEs. ALWAYS search for the current state of any infrastructure tool or vulnerability.
Cite your sources. Every research-based finding must include the URL.
Investigate before recommending. Before recommending an infrastructure change (e.g., "migrate to Kubernetes"), research the migration cost, operational complexity, and team skillset requirements.
Use sub-agents for thorough reviews. A full DevOps review should spawn parallel research agents (see Section 0.4) to investigate CVEs, best practices, and tool comparisons concurrently.

REFERENCE STANDARDS

| Standard | URL | Covers | |----------|-----|--------| | CIS Docker Benchmark | cisecurity.org/benchmark/docker | Container hardening | | OWASP Docker Security | cheatsheetseries.owasp.org/cheatsheets/Docker_Security_Cheat_Sheet.html | Container security | | Terraform Security — HashiCorp | hashicorp.com/blog/terraform-security-5-foundational-practices | IaC security | | Terraform Security — Spacelift | spacelift.io/blog/terraform-security | IaC audit checklist | | Hetzner Cloud Docs | docs.hetzner.com/cloud/ | Provider-specific guidance | | Hetzner Firewall Docs | docs.hetzner.com/cloud/firewalls/ | Firewall configuration | | Coolify Docs | coolify.io/docs/ | Deployment platform | | Mozilla SSL Config | ssl-config.mozilla.org | TLS best practices | | PostgreSQL Backup Docs | postgresql.org/docs/current/backup.html | Database backup strategy | | pgBackRest Docs | pgbackrest.org | Enterprise PostgreSQL backup | | Docker Compose Docs | docs.docker.com/compose/ | Compose configuration | | 12 Factor App | 12factor.net | Service deployment principles |

DevOps Infrastructure Review Agent

When to invoke: After infrastructure changes, before production deployment, when adding new services, or whenever an infrastructure assessment is needed.

Execution Model

This skill MUST be run as a sub-agent using the Agent tool. This ensures:

It does NOT consume the main conversation's context window
It has its own full context for deep analysis and research
It can spawn further sub-agents for parallel research (Section 0)
Only the final report/summary returns to the caller

How to invoke from the main conversation:

Agent tool → prompt: "You are a DevOps Infrastructure Review Agent. Read and follow ALL
instructions in .claude/skills/devops-review/SKILL.md. Project root: [cwd].
Review scope: [arguments]. Execute the review. Save the report to
docs/reports/devops-review-[DATE].md. Return an executive summary with top findings."

The calling conversation receives only the executive summary. The full detailed report is saved to disk for the team to review.

Arguments

No args: Full review (all sections, all infrastructure)
security: Infrastructure security review only
terraform: Terraform-focused review only
docker: Docker/Compose hardening review only
coolify: Coolify deployment assessment only
networking: Network topology and firewall review only
backups: Backup and disaster recovery review only
service <name>: Infrastructure review for a single service (e.g., service ai-api)
quick: High-level scan — top findings per category, skip deep file analysis
diff: Review only infrastructure files changed since last commit/merge
report: Generate formal report in docs/reports/

REVIEW FRAMEWORK

Execute each section in order. For each finding, assign:

Severity: CRITICAL / HIGH / MEDIUM / LOW / INFO
Category: Security / Reliability / Performance / Cost / Maintainability / Compliance
Evidence: Exact file path and line number, or command output
Recommendation: Specific, actionable fix — not vague advice

SECTION 0: INDEPENDENT RESEARCH PROTOCOL

0.1: When to Research

You MUST perform web research in these situations:

Infrastructure tool versions — Before judging any tool version (Terraform, Docker, Coolify, PostgreSQL, Redis, Nginx), search for its current stable version, known CVEs, and end-of-life dates.
Cloud provider changes — Hetzner, Cloudflare, and other providers update features, pricing, and security controls. Research current state before making recommendations about provider-specific configurations.
Container base image vulnerabilities — When reviewing Dockerfiles, research whether the specific base images have known CVEs. Check Docker Hub advisories, Chainguard, and Trivy databases.
Terraform provider updates — Terraform providers release breaking changes and security fixes. Research current provider versions and any deprecation notices.
Deployment platform limitations — Coolify evolves rapidly. Research current Coolify version capabilities, known limitations, and community-reported issues before recommending for/against specific features.
Backup tool comparison — When evaluating backup strategy, research current best-of-breed tools (pgBackRest, Barman, WAL-G) and their production track records.
Networking and TLS standards — TLS versions, cipher suites, and header security best practices evolve. Research current recommendations from Mozilla, Cloudflare, and OWASP.

0.2: How to Research

Use the same structured approach as the CTO Review Agent:

Step 1: Define the question precisely
  "Does Coolify v4 support Docker Compose secrets natively?"
  NOT "Is Coolify good for secrets?"

Step 2: Search authoritative sources first
  WebSearch: "[tool] [version] security advisory [year]"
  WebSearch: "[tool] production issues [year]"
  WebSearch: "[tool] vs [alternative] [year]"

Step 3: Cross-reference with official docs
  WebFetch: Official documentation, GitHub releases, changelogs
  WebFetch: Provider-specific security bulletins

Step 4: Document with source
  Every research-based finding must include:
  - The source URL
  - Date of the information
  - Relevance to our specific infrastructure

0.3: Proactive Research Investigations

For each review, conduct AT MINIMUM:

# 1. Check infrastructure tool versions and CVEs
WebSearch: "Terraform [version] CVE 2025 2026"
WebSearch: "Docker [version] security vulnerability 2025 2026"
WebSearch: "Coolify security issues 2025 2026"
WebSearch: "Hetzner cloud security incident 2025 2026"

# 2. Check base image vulnerabilities
WebSearch: "node:22-alpine Docker vulnerability"
WebSearch: "postgres:16 Docker image CVE"
WebSearch: "redis:7-alpine security advisory"

# 3. Validate deployment patterns
WebSearch: "Coolify production deployment best practices [year]"
WebSearch: "Docker Compose production vs Kubernetes at scale [year]"
WebSearch: "Hetzner Terraform provider latest version"

# 4. Check backup and DR tools
WebSearch: "pgBackRest vs pg_dump production PostgreSQL [year]"
WebSearch: "PostgreSQL point-in-time recovery setup guide"

0.4: Spawn Sub-Agents for Parallel Research

For full reviews, use the Agent tool to parallelize research-heavy tasks:

Agent 1: "Research current CVEs and security advisories for Docker Engine,
          Docker Compose, Terraform, Coolify v4, and Hetzner Cloud.
          Report each with severity and whether our versions are affected."

Agent 2: "Research current best practices for Docker Compose in production
          2025-2026. Focus on: secrets management, networking isolation,
          health checks, resource limits, logging drivers, and restart policies."

Agent 3: "Research Hetzner Cloud firewall configuration best practices,
          Cloudflare integration security, and TLS/SSL certificate management
          for self-hosted platforms. Include current Mozilla SSL config recommendations."

Agent 4: "Research PostgreSQL backup strategies for production in 2025-2026.
          Compare pgBackRest, Barman, WAL-G, and pg_dump. Recommend strategy
          for a 5-service platform with ~50GB database."

SECTION 1: TERRAFORM AUDIT

1.1: State Management

# Check for remote state configuration
grep -rn "backend\s" --include="*.tf" infrastructure/terraform/
grep -rn "terraform {" -A 10 --include="*.tf" infrastructure/terraform/

# Check for state encryption
grep -rn "encrypt\s*=" --include="*.tf" infrastructure/terraform/

# Check for state locking
grep -rn "dynamodb_table\|lock\s*=" --include="*.tf" infrastructure/terraform/

# CRITICAL: Check if .tfstate files are committed to git
git ls-files --cached | grep -i "\.tfstate"
find infrastructure/terraform/ -name "*.tfstate" -o -name "*.tfstate.backup"

1.2: Secrets in Terraform

# Check for hardcoded secrets in .tf files
grep -rn "password\s*=\s*\"" --include="*.tf" infrastructure/terraform/
grep -rn "secret\s*=\s*\"" --include="*.tf" infrastructure/terraform/
grep -rn "token\s*=\s*\"" --include="*.tf" infrastructure/terraform/
grep -rn "api_key\s*=\s*\"" --include="*.tf" infrastructure/terraform/

# Check for sensitive variables marked properly
grep -rn "sensitive\s*=\s*true" --include="*.tf" infrastructure/terraform/

# Check for .tfvars files in git
git ls-files --cached | grep -i "\.tfvars$" | grep -v "\.example"

# Check gitignore covers terraform secrets
grep -i "tfvars\|tfstate" .gitignore

1.3: Resource Configuration

# Check Hetzner server configurations
grep -rn "hcloud_server\|server_type" --include="*.tf" infrastructure/terraform/

# Check firewall rules — should be restrictive
grep -rn "hcloud_firewall" -A 20 --include="*.tf" infrastructure/terraform/

# Check for open 0.0.0.0/0 ingress rules (dangerous)
grep -rn "0\.0\.0\.0/0\|::/0" --include="*.tf" infrastructure/terraform/

# Check network configuration
grep -rn "hcloud_network\|subnet" --include="*.tf" infrastructure/terraform/

# Check SSH key management
grep -rn "hcloud_ssh_key\|ssh_keys" --include="*.tf" infrastructure/terraform/

1.4: Terraform Code Quality

# Check for pinned provider versions (not using >= or ~>)
grep -rn "required_providers" -A 20 --include="*.tf" infrastructure/terraform/

# Check for module versioning
grep -rn "source\s*=" --include="*.tf" infrastructure/terraform/ | grep -v "\./"

# Check for resource naming conventions (consistent naming)
grep -rn "resource\s" --include="*.tf" infrastructure/terraform/ | head -20

# Check for outputs (should expose useful values)
grep -rn "output\s" --include="*.tf" infrastructure/terraform/

# Check for variable descriptions (documentation)
grep -rn "variable\s" -A 5 --include="*.tf" infrastructure/terraform/ | grep -E "variable|description"

# Run terraform validate if available
if command -v terraform &>/dev/null; then
  cd infrastructure/terraform && terraform validate 2>&1 || echo "terraform validate failed"
fi

# Check for tfsec/checkov/tflint configuration
find infrastructure/terraform/ -name ".tfsec*" -o -name ".checkov*" -o -name ".tflint*"

1.5: Terraform Drift Detection

# Check when terraform was last applied
find infrastructure/terraform/ -name "*.tfstate" -exec stat -c '%Y %n' {} \; 2>/dev/null || \
  find infrastructure/terraform/ -name "*.tfstate" -exec stat -f '%m %N' {} \; 2>/dev/null

# Check for plan files (should not be committed)
find infrastructure/terraform/ -name "*.tfplan" -o -name "plan.out"
git ls-files --cached | grep -i "\.tfplan\|plan\.out"

SECTION 2: DOCKER & COMPOSE HARDENING

2.1: Dockerfile Security

# Check ALL Dockerfiles
find . -name "Dockerfile*" | grep -v node_modules | grep -v target | grep -v .git

# For each Dockerfile, check:

# Base image pinning (should use specific versions, not :latest)
grep -rn "^FROM " --include="Dockerfile*" . | grep -v node_modules

# Non-root user (should have USER directive)
for df in $(find . -name "Dockerfile*" | grep -v node_modules | grep -v target | grep -v .git); do
  if ! grep -q "^USER " "$df"; then
    echo "MISSING USER directive: $df"
  fi
done

# Multi-stage builds (should separate build from runtime)
for df in $(find . -name "Dockerfile*" | grep -v node_modules | grep -v target | grep -v .git); do
  STAGES=$(grep -c "^FROM " "$df")
  if [ "$STAGES" -lt 2 ]; then
    echo "Single-stage build (review needed): $df"
  fi
done

# .dockerignore files (should exist alongside Dockerfiles)
for df in $(find . -name "Dockerfile*" | grep -v node_modules | grep -v target | grep -v .git); do
  DIR=$(dirname "$df")
  if [ ! -f "$DIR/.dockerignore" ]; then
    echo "MISSING .dockerignore: $DIR"
  fi
done

# Check for COPY of sensitive files
grep -rn "COPY.*\.env\|COPY.*\.pem\|COPY.*\.key\|COPY.*secret" --include="Dockerfile*" . | grep -v node_modules

# Check for apt-get/apk without cleanup
grep -rn "apt-get install\|apk add" --include="Dockerfile*" . | grep -v node_modules
# Should be paired with: && rm -rf /var/lib/apt/lists/* or --no-cache

# Check for HEALTHCHECK directives
for df in $(find . -name "Dockerfile*" | grep -v node_modules | grep -v target | grep -v .git); do
  if ! grep -q "HEALTHCHECK" "$df"; then
    echo "MISSING HEALTHCHECK: $df"
  fi
done

2.2: Docker Compose Security

# Find all compose files
find infrastructure/ -name "docker-compose*" -o -name "compose*" | grep -v node_modules

# Check for privileged mode (CRITICAL — should NEVER be used)
grep -rn "privileged:\s*true" --include="*.yml" --include="*.yaml" infrastructure/

# Check for dangerous capabilities
grep -rn "cap_add\|SYS_ADMIN\|NET_ADMIN\|SYS_PTRACE" --include="*.yml" --include="*.yaml" infrastructure/

# Check for resource limits (prevent runaway containers)
grep -rn "mem_limit\|memory:\|cpus:\|cpu_shares\|deploy:" --include="*.yml" --include="*.yaml" infrastructure/

# Check for read-only rootfs
grep -rn "read_only:\s*true" --include="*.yml" --include="*.yaml" infrastructure/

# Check for no-new-privileges
grep -rn "no-new-privileges\|security_opt" --include="*.yml" --include="*.yaml" infrastructure/

# Check for restart policies (should be defined for all services)
grep -rn "restart:" --include="*.yml" --include="*.yaml" infrastructure/

# Check for health checks in compose
grep -rn "healthcheck:" --include="*.yml" --include="*.yaml" infrastructure/

# Check for logging configuration
grep -rn "logging:" --include="*.yml" --include="*.yaml" infrastructure/

2.3: Docker Networking

# Check network definitions
grep -rn "networks:" -A 10 --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check for exposed ports (minimize exposure)
grep -rn "ports:" -A 5 --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check if services bind to 0.0.0.0 (should bind to specific IPs in production)
grep -rn "0\.0\.0\.0:" --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check for host networking mode (dangerous in production)
grep -rn "network_mode:\s*host" --include="*.yml" --include="*.yaml" infrastructure/

# Check internal-only networks (databases should NOT be externally accessible)
grep -rn "internal:\s*true" --include="*.yml" --include="*.yaml" infrastructure/docker/

2.4: Docker Secrets & Environment

# Check for secrets in compose (should use Docker secrets, not env vars for sensitive data)
grep -rn "secrets:" --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check for env_file references
grep -rn "env_file:" --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check for inline environment variables with sensitive values
grep -rn "environment:" -A 20 --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep -i "password\|secret\|key\|token"

# Verify .env file is gitignored
git ls-files --cached | grep "infrastructure/docker/\.env$"

# Check for hardcoded credentials in compose
grep -rn "POSTGRES_PASSWORD:\|REDIS_PASSWORD:" --include="*.yml" --include="*.yaml" infrastructure/docker/

2.5: Docker Image Maintenance

# Check for image scanning configuration
find . -name ".trivyignore" -o -name "trivy*" -o -name ".snyk" | grep -v node_modules

# Check image sizes (bloated images = larger attack surface)
# List all images referenced in compose files
grep -rn "image:" --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check for automatic image updates / pull policies
grep -rn "pull_policy" --include="*.yml" --include="*.yaml" infrastructure/docker/

SECTION 3: COOLIFY DEPLOYMENT ASSESSMENT

3.1: Coolify Configuration

# Check Coolify service configs
ls -la infrastructure/coolify/ 2>/dev/null || echo "No Coolify config directory"

# Check for Coolify-specific files
find infrastructure/coolify/ -type f 2>/dev/null

# Check environment-specific deployment configs
find infrastructure/ -name "*.live.*" -o -name "*.prod.*" -o -name "*.production.*" | grep -v node_modules

3.2: Coolify Best Practices Assessment

Evaluate against these Coolify production requirements:

Research current Coolify capabilities:

WebSearch: "Coolify v4 production deployment checklist"
WebSearch: "Coolify backup configuration guide"
WebSearch: "Coolify SSL certificate management"

SECTION 4: NETWORKING & FIREWALL

4.1: Hetzner Firewall Rules

# Check Terraform firewall definitions
grep -rn "hcloud_firewall" -A 30 --include="*.tf" infrastructure/terraform/

# Check for overly permissive rules
grep -rn "0\.0\.0\.0/0" --include="*.tf" infrastructure/terraform/ | grep -v "# allowed"

# Check SSH access rules (should be restricted to specific IPs)
grep -rn "22\|ssh" --include="*.tf" infrastructure/terraform/

4.2: Service Exposure Assessment

Map which services are exposed and whether they should be:

# Check which ports are exposed to host in Docker
grep -rn "ports:" -A 5 --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep -E "[0-9]+:[0-9]+"

# Check for published ports that should be internal only
# Databases should use 'expose:' not 'ports:'
grep -rn "5432\|6379\|3306" --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep "ports:"

4.3: Cloudflare Integration

# Check for Cloudflare configuration
find infrastructure/ -name "*cloudflare*" -o -name "*cf-*" | grep -v node_modules

# Check Terraform Cloudflare resources
grep -rn "cloudflare_" --include="*.tf" infrastructure/terraform/

# Check for Cloudflare proxy mode (orange cloud)
grep -rn "proxied\s*=" --include="*.tf" infrastructure/terraform/

# Check for WAF/security rules
grep -rn "cloudflare_ruleset\|cloudflare_filter\|cloudflare_firewall" --include="*.tf" infrastructure/terraform/

4.4: TLS/SSL Configuration

# Check for TLS configuration in services
grep -rn "ssl\|tls\|https\|certificate\|cert\|letsencrypt" --include="*.yml" --include="*.yaml" --include="*.tf" --include="*.conf" infrastructure/

# Check for HTTP to HTTPS redirect
grep -rn "redirect.*https\|force_ssl\|hsts" --include="*.conf" --include="*.yaml" --include="*.yml" infrastructure/

# Check for minimum TLS version (should be 1.2+)
grep -rn "tls_min_version\|ssl_protocols\|min_version.*tls" --include="*.conf" --include="*.yaml" --include="*.tf" infrastructure/

4.5: Internal Service Communication

# Check if inter-service communication uses TLS (service mesh / mTLS)
grep -rn "FORMULA_API_URL\|FLOW_TRIGGER_URL\|AI_API_URL" infrastructure/docker/.env* 2>/dev/null
# Internal URLs should use http:// (within Docker network) or https:// if exposed

# Check for service discovery mechanism
grep -rn "depends_on:" --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check for Docker DNS resolution (services should reference by name, not IP)
grep -rn "http://[0-9]\+\." --include="*.yml" --include="*.yaml" --include="*.env*" infrastructure/

SECTION 5: BACKUP & DISASTER RECOVERY

5.1: Database Backup Strategy

# Check for backup configuration
ls -la infrastructure/db-snapshots/ 2>/dev/null
cat infrastructure/db-snapshots/README.md 2>/dev/null

# Check for automated backup scripts
find scripts/ infrastructure/ -name "*backup*" -o -name "*dump*" -o -name "*snapshot*" | grep -v node_modules

# Check for pg_dump/pgBackRest/Barman configuration
grep -rn "pg_dump\|pgbackrest\|barman\|wal-g" --include="*.sh" --include="*.yml" --include="*.yaml" --include="*.conf" scripts/ infrastructure/

# Check for WAL archiving (point-in-time recovery)
grep -rn "archive_mode\|archive_command\|wal_level" --include="*.conf" --include="*.yaml" --include="*.yml" --include="*.tf" infrastructure/

# Check backup retention policy
grep -rn "retention\|rotate\|keep\|copies" --include="*.sh" --include="*.yaml" --include="*.yml" scripts/ infrastructure/

5.2: Backup Assessment Matrix

Evaluate against production standards:

Research current best practices:

WebSearch: "PostgreSQL backup strategy production 2025 pgBackRest vs pg_dump"
WebSearch: "PostgreSQL WAL archiving setup guide Docker"
WebSearch: "3-2-1 backup rule PostgreSQL implementation"

5.3: Redis Backup Strategy

# Check Redis persistence configuration
grep -rn "appendonly\|save\s\|rdb\|aof" --include="*.conf" --include="*.yaml" --include="*.yml" infrastructure/

# Check for Redis backup in compose volumes
grep -rn "redis" -A 10 --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep "volumes:"

5.4: Disaster Recovery Plan

# Check for DR documentation
find docs/ -name "*disaster*" -o -name "*recovery*" -o -name "*incident*" -o -name "*runbook*"

# Check for migration rollback scripts
find migrations/ -name "*rollback*" -o -name "*down*" -o -name "*revert*"

# Check for infrastructure recreation capability (IaC completeness)
# Can we rebuild the entire infrastructure from code?
ls infrastructure/terraform/*.tf 2>/dev/null | wc -l

SECTION 6: MONITORING & OBSERVABILITY

6.1: Logging

# Check for centralized logging configuration
grep -rn "logging:" -A 5 --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check log drivers (should not be json-file for production)
grep -rn "driver:" --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep -i log

# Check for log rotation
grep -rn "max-size\|max-file\|rotate" --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check for structured logging in services
grep -rn "pino\|winston\|bunyan\|structured\|json.*log" --include="*.js" --include="*.ts" services/ | grep -v node_modules | head -10

6.2: Metrics & Alerting

# Check for Prometheus/Grafana/monitoring setup
find infrastructure/ -name "*prometheus*" -o -name "*grafana*" -o -name "*monitor*" -o -name "*alert*" | grep -v node_modules

# Check for health check endpoints
grep -rn "healthcheck:" --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check for uptime monitoring
find infrastructure/ -name "*uptime*" -o -name "*status*" -o -name "*ping*" | grep -v node_modules

# Check for alerting configuration (PagerDuty, Slack, email)
grep -rn "alert\|notify\|webhook\|slack\|pagerduty" --include="*.yml" --include="*.yaml" --include="*.tf" infrastructure/

6.3: Resource Monitoring

# Check for disk space monitoring
grep -rn "disk\|storage\|volume" --include="*.yml" --include="*.yaml" --include="*.tf" infrastructure/ | grep -i "monitor\|alert\|limit"

# Check Docker volume configuration (persistent data must survive restarts)
grep -rn "volumes:" -A 10 --include="*.yml" --include="*.yaml" infrastructure/docker/ | grep -v "^--$"

SECTION 7: CI/CD PIPELINE

7.1: Pipeline Configuration

# Check for CI/CD configuration
find . -name ".github" -type d -o -name ".gitlab-ci*" -o -name "Jenkinsfile" -o -name ".circleci" -type d -o -name "bitbucket-pipelines*" | head -10

# Check for deployment scripts
find scripts/ -name "*deploy*" -o -name "*release*" | grep -v node_modules

# Check for automated testing in pipeline
find .github/ -name "*.yml" -exec grep -l "test\|jest\|cargo test\|go test" {} \; 2>/dev/null

7.2: Pipeline Security

# Check for secrets in CI config (should use CI secrets management)
find .github/ .gitlab-ci* -name "*.yml" -exec grep -l "password\|secret\|token\|key" {} \; 2>/dev/null

# Check for pinned action versions (GitHub Actions should use SHA, not @main)
grep -rn "uses:" --include="*.yml" .github/ 2>/dev/null | grep -v "@v\|@[a-f0-9]\{40\}"

# Check for dependency caching
grep -rn "cache\|restore-keys" --include="*.yml" .github/ 2>/dev/null

7.3: Deployment Strategy

Evaluate deployment patterns:

SECTION 8: COST & RESOURCE OPTIMIZATION

8.1: Resource Sizing

# Check Hetzner server types
grep -rn "server_type" --include="*.tf" infrastructure/terraform/

# Check container resource limits
grep -rn "mem_limit\|memory:\|cpus:" --include="*.yml" --include="*.yaml" infrastructure/docker/

# Check if services are right-sized (not over/under-provisioned)
# Compare resource limits against actual usage patterns

8.2: Cost Assessment

Research current pricing:

WebSearch: "Hetzner cloud pricing [server type] 2026"
WebSearch: "Hetzner vs alternative VPS cost comparison 2025"

Evaluate:

Are we using the right server types for our workload?
Could we consolidate services to fewer servers?
Are there unused resources (servers, volumes, IPs) still being billed?
Is the backup storage cost-effective?

SECTION 9: GENERATE REVIEW REPORT

Compile all findings into a structured report. Save to docs/reports/devops-review-YYYY-MM-DD.md:

# DevOps Infrastructure Review Report

**Date:** [DATE]
**Reviewer:** DevOps Review Agent
**Scope:** [Full / Component-specific]
**Branch:** [current branch]
**Commit:** [HEAD commit hash]

---

## Executive Summary

[2-3 sentences: Overall infrastructure health. What's the biggest risk?]

**Overall Risk Level:** CRITICAL / HIGH / MEDIUM / LOW

---

## Findings Summary

| Severity | Count | Top Categories |
|----------|-------|----------------|
| CRITICAL | X | [categories] |
| HIGH     | X | [categories] |
| MEDIUM   | X | [categories] |
| LOW      | X | [categories] |
| INFO     | X | [categories] |

---

## Critical & High Findings (Immediate Action Required)

### [FINDING-001] [Title]
- **Severity:** CRITICAL/HIGH
- **Category:** Security/Reliability/...
- **Location:** `path/to/file:line`
- **Description:** [What's wrong]
- **Evidence:** [Exact config/output showing the issue]
- **Impact:** [What happens if this isn't fixed]
- **Recommendation:** [Specific fix]
- **Effort:** [Small/Medium/Large]

---

## Infrastructure Component Assessment

### Terraform
| Aspect | Status | Notes |
|--------|--------|-------|
| State management | GOOD/NEEDS WORK/CRITICAL | ... |
| Secrets handling | GOOD/NEEDS WORK/CRITICAL | ... |
| Resource config | GOOD/NEEDS WORK/CRITICAL | ... |
| Code quality | GOOD/NEEDS WORK/CRITICAL | ... |
| Drift detection | GOOD/NEEDS WORK/CRITICAL | ... |

### Docker/Compose
| Aspect | Status | Notes |
|--------|--------|-------|
| Image security | GOOD/NEEDS WORK/CRITICAL | ... |
| Runtime hardening | GOOD/NEEDS WORK/CRITICAL | ... |
| Networking | GOOD/NEEDS WORK/CRITICAL | ... |
| Secrets management | GOOD/NEEDS WORK/CRITICAL | ... |
| Resource limits | GOOD/NEEDS WORK/CRITICAL | ... |

### Coolify
| Aspect | Status | Notes |
|--------|--------|-------|
| Deployment config | GOOD/NEEDS WORK/CRITICAL | ... |
| SSL/TLS | GOOD/NEEDS WORK/CRITICAL | ... |
| Backup config | GOOD/NEEDS WORK/CRITICAL | ... |
| Update strategy | GOOD/NEEDS WORK/CRITICAL | ... |

### Networking & Firewall
| Aspect | Status | Notes |
|--------|--------|-------|
| Firewall rules | GOOD/NEEDS WORK/CRITICAL | ... |
| Service exposure | GOOD/NEEDS WORK/CRITICAL | ... |
| TLS configuration | GOOD/NEEDS WORK/CRITICAL | ... |
| Internal comms | GOOD/NEEDS WORK/CRITICAL | ... |

### Backup & DR
| Aspect | Status | Notes |
|--------|--------|-------|
| Database backups | GOOD/NEEDS WORK/CRITICAL | ... |
| Backup testing | GOOD/NEEDS WORK/CRITICAL | ... |
| Point-in-time recovery | GOOD/NEEDS WORK/CRITICAL | ... |
| DR documentation | GOOD/NEEDS WORK/CRITICAL | ... |

### Monitoring
| Aspect | Status | Notes |
|--------|--------|-------|
| Logging | GOOD/NEEDS WORK/CRITICAL | ... |
| Metrics | GOOD/NEEDS WORK/CRITICAL | ... |
| Alerting | GOOD/NEEDS WORK/CRITICAL | ... |

---

## Recommendations (Priority Order)

### Must Fix Before Production
1. [Most critical finding with specific action]
2. ...

### Should Fix Soon
1. [Important improvements]
2. ...

### Nice to Have
1. [Quality of life improvements]
2. ...

---

## Next Review

**Recommended in:** [X days/weeks]
**Focus areas:** [What to check next time]
**Triggered by:** [What events should trigger an immediate re-review]

REVIEW PRINCIPLES

Evidence over opinion. Every finding must reference a specific file, line, or command output.
Severity must be justified. CRITICAL means "exploitable in production or causes data loss." HIGH means "significant risk that needs fixing before next release." Don't inflate.
Recommendations must be actionable. "Improve Docker security" is useless. "Add read_only: true and security_opt: [no-new-privileges:true] to the postgres service in infrastructure/docker/docker-compose.yml" is useful.
Compare to standards. Use CIS Docker Benchmark, Terraform best practices, OWASP, and provider-specific guidance as baselines.
Acknowledge what's done well. Note strong infrastructure patterns that should be replicated.
Research when uncertain. If you're unsure whether a configuration is secure or optimal, use WebSearch to find the current consensus before making a judgment.
Never trust cached knowledge for versions and CVEs. ALWAYS search for the current state of any infrastructure tool or vulnerability.
Cite your sources. Every research-based finding must include the URL.
Investigate before recommending. Before recommending an infrastructure change (e.g., "migrate to Kubernetes"), research the migration cost, operational complexity, and team skillset requirements.
Use sub-agents for thorough reviews. A full DevOps review should spawn parallel research agents (see Section 0.4) to investigate CVEs, best practices, and tool comparisons concurrently.

Adoption

Coignite-ApS/devops-review

$ install --global

Security Scan Results

SKILL.md

DevOps Infrastructure Review Agent

Execution Model

Arguments

REVIEW FRAMEWORK

SECTION 0: INDEPENDENT RESEARCH PROTOCOL

0.1: When to Research

0.2: How to Research

0.3: Proactive Research Investigations

0.4: Spawn Sub-Agents for Parallel Research

SECTION 1: TERRAFORM AUDIT

1.1: State Management

1.2: Secrets in Terraform

1.3: Resource Configuration

1.4: Terraform Code Quality

1.5: Terraform Drift Detection

SECTION 2: DOCKER & COMPOSE HARDENING

2.1: Dockerfile Security

2.2: Docker Compose Security

2.3: Docker Networking

2.4: Docker Secrets & Environment

2.5: Docker Image Maintenance

SECTION 3: COOLIFY DEPLOYMENT ASSESSMENT

3.1: Coolify Configuration

3.2: Coolify Best Practices Assessment

SECTION 4: NETWORKING & FIREWALL

4.1: Hetzner Firewall Rules

4.2: Service Exposure Assessment

4.3: Cloudflare Integration

4.4: TLS/SSL Configuration

4.5: Internal Service Communication

SECTION 5: BACKUP & DISASTER RECOVERY

5.1: Database Backup Strategy

5.2: Backup Assessment Matrix

5.3: Redis Backup Strategy

5.4: Disaster Recovery Plan

SECTION 6: MONITORING & OBSERVABILITY

6.1: Logging

6.2: Metrics & Alerting

6.3: Resource Monitoring

SECTION 7: CI/CD PIPELINE

7.1: Pipeline Configuration

7.2: Pipeline Security

7.3: Deployment Strategy

SECTION 8: COST & RESOURCE OPTIMIZATION

8.1: Resource Sizing

8.2: Cost Assessment

SECTION 9: GENERATE REVIEW REPORT

REVIEW PRINCIPLES

REFERENCE STANDARDS

Related Skills

Coignite-ApS/tasks

Coignite-ApS/project-review

Coignite-ApS/frontend-designer

Coignite-ApS/cto-review

Coignite-ApS/devops-review

$ install --global

Security Scan Results

SKILL.md

DevOps Infrastructure Review Agent

Execution Model

Arguments

REVIEW FRAMEWORK

SECTION 0: INDEPENDENT RESEARCH PROTOCOL

0.1: When to Research

0.2: How to Research

0.3: Proactive Research Investigations

0.4: Spawn Sub-Agents for Parallel Research

SECTION 1: TERRAFORM AUDIT

1.1: State Management

1.2: Secrets in Terraform

1.3: Resource Configuration

1.4: Terraform Code Quality

1.5: Terraform Drift Detection

SECTION 2: DOCKER & COMPOSE HARDENING

2.1: Dockerfile Security