Every Self-Hosted Service Is a Person You Didn't Hire

Self-hosting a database, a cache, a workflow engine, or a Kubernetes cluster is not free. It costs patching, backup verification, incident response at 3 AM, capacity planning, version upgrades, security hardening, and monitoring of the monitor. Each self-hosted service is an invisible full-time job. For a team of five engineers shipping a SaaS product, running your own PostgreSQL is the equivalent of hiring a sixth engineer whose entire job is keeping PostgreSQL alive -- except you do not hire that person, so the work falls on everyone, and nobody does it well.

Managed services trade money for engineering time. For startups and small teams (under 50 engineers), this trade is almost always correct. The cloud bill goes up by hundreds of dollars per month; the engineering team gets back thousands of dollars in reclaimed time. Self-host only when the managed service genuinely cannot meet your requirements -- and document the justification in an ADR.

The Operations Tax

Every self-hosted service carries a recurring operations cost that is invisible until something breaks.

| Operations Task | Managed Service | Self-Hosted | |-----------------|-----------------|-------------| | OS/kernel patching | Provider handles it | You schedule downtime, test, apply | | Version upgrades | One-click or automatic | You test, migrate, rollback-plan, execute | | Backup & restore | Automated, point-in-time | You configure, verify, test restores quarterly | | Scaling | Auto-scaling or single API call | You monitor, forecast, provision, rebalance | | High availability | Built-in multi-AZ/region | You design, implement, test failover | | Security hardening | Provider hardens, you configure | You harden OS, network, application, and runtime | | Monitoring | Built-in metrics and logs | You deploy exporters, configure dashboards, set alerts | | Incident response | Provider's SRE team + your config | Your team, 24/7, for infrastructure AND application | | Compliance | Provider certifications (SOC2, HIPAA) | You certify the infrastructure yourself |

The compound effect: one self-hosted service is manageable. Three self-hosted services (database + cache + monitoring stack) consume 30-50% of a small team's operational capacity. Five self-hosted services and you are an infrastructure company that happens to also build a product.

Container Orchestration: Managed Over Kubernetes

Kubernetes is the most frequently self-hosted service that teams do not need. For teams under 50 engineers running fewer than 20 services, managed container platforms provide the same deployment model (containers, health checks, scaling, load balancing) without the operational overhead of cluster management, node pool sizing, ingress controller configuration, CNI plugin selection, and etcd maintenance.

Decision Framework

| Criterion | Use Managed Containers | Use Kubernetes | |-----------|----------------------|----------------| | Team size | Under 50 engineers | 50+ engineers with dedicated platform team | | Service count | Under 20 services | 20+ services with complex networking | | GPU workloads | No, or minimal | Heavy GPU scheduling requirements | | Custom scheduling | Not needed | Custom schedulers, operators, CRDs required | | Multi-cloud | Not required | Required for portability | | Service mesh | Not needed | Istio/Linkerd required | | Compliance | Standard | Requires specific K8s-level audit controls |

Good Pattern vs Bad Pattern

# Good: managed container service for a team of 8 engineers

resource "aws_ecs_service" "myapp" {
  name            = "myapp"
  cluster         = data.terraform_remote_state.compute.outputs.cluster_arn
  task_definition = aws_ecs_task_definition.myapp.arn
  desired_count   = 2

  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 50  # Increase to 100 for production-critical services
  }
  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 50  # Spot can be interrupted; suitable for dev, use cautiously in prod
  }

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  # Zero-downtime rolling update
  deployment_maximum_percent         = 200
  deployment_minimum_healthy_percent = 100
}

# Result: no nodes to patch, no cluster upgrades, no CNI plugins,
# no ingress controllers, no etcd backups. Deploy and forget.

# Bad: self-managed Kubernetes for the same team of 8

resource "aws_eks_cluster" "main" {
  name     = "myapp-cluster"
  role_arn = aws_iam_role.eks.arn
  version  = "1.28"  # You must upgrade this every 3-4 months

  vpc_config {
    subnet_ids = var.private_subnet_ids
  }
}

resource "aws_eks_node_group" "workers" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "workers"
  instance_types  = ["m5.large"]
  scaling_config {
    desired_size = 3
    max_size     = 6
    min_size     = 2
  }
  # Now you also need: ingress-nginx, cert-manager, external-dns,
  # metrics-server, cluster-autoscaler, aws-load-balancer-controller,
  # and someone to upgrade all of them every quarter.
}

Workflow Orchestration: Managed Over Self-Hosted

Self-hosted workflow engines (Airflow on EC2/K8s, Temporal self-hosted, Prefect server) require database backends, worker scaling, scheduler high availability, log aggregation, and web UI hosting. Managed workflow services handle all of this.

| Approach | What You Manage | What the Provider Manages | |----------|-----------------|---------------------------| | Managed Airflow | DAG code, connections, variables | Scheduler HA, worker scaling, web UI, database, upgrades | | Self-hosted Airflow | DAG code, connections, variables, scheduler HA, worker scaling, web UI, metadata DB, Redis/Celery, upgrades, monitoring | Nothing | | Managed step functions | Workflow definitions | Execution, scaling, retry, logging, state persistence | | Self-hosted Temporal | Workflow code, namespace management, history DB, visibility DB, upgrades, monitoring | Nothing |

The breaking point: self-hosted Airflow is three services (scheduler, webserver, workers), a metadata database, a message broker, and a log storage backend. That is six components to keep alive for a workflow engine that is supposed to keep your other workflows alive.

Do your research first: managed workflow services vary significantly in quality. Sometimes your cloud provider's offering (e.g., MWAA) is the right choice; sometimes a specialized third-party provider (e.g., Astronomer for Airflow) offers a materially better experience. Evaluate both before committing.

Databases and Caches: Always Managed

There is almost no scenario where a startup or small team should run a self-hosted database or cache in production. The managed service gives you automated backups, point-in-time recovery, failover, patching, and monitoring for a modest premium over the raw compute cost.

# Good: managed database with automated operations

resource "aws_db_instance" "myapp" {
  identifier     = "${module.labels.prefix}myapp-db"
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = "db.t4g.medium"

  multi_az                    = true     # Automatic failover
  backup_retention_period     = 14       # 14-day point-in-time recovery
  auto_minor_version_upgrade  = true     # Security patches applied automatically
  storage_encrypted           = true
  performance_insights_enabled = true    # Built-in query monitoring
  deletion_protection         = true
}

# Bad: self-hosted PostgreSQL on an EC2 instance

resource "aws_instance" "postgres" {
  ami           = "ami-0abcdef1234567890"
  instance_type = "m5.large"

  # Now you must:
  # - Install and configure PostgreSQL
  # - Set up streaming replication for HA
  # - Configure automated backups to object storage
  # - Test backup restores quarterly
  # - Apply OS security patches monthly
  # - Apply PostgreSQL patches on your schedule
  # - Monitor replication lag, connections, disk, memory
  # - Handle failover manually or build automation
  # - Manage SSL certificates for connections
  # - None of this is in the Terraform above
}

The same logic applies to caches. A managed Redis/Valkey instance with automatic failover, patching, and backup costs marginally more than the equivalent EC2 instance and saves dozens of hours per quarter in operational toil.

Monitoring: Managed Over Self-Hosted

Self-hosted monitoring stacks (Prometheus + Grafana + Alertmanager + Loki) are four services that each need their own storage, scaling, and high availability. When your monitoring is down, you are blind to everything else being down. Managed monitoring services eliminate this circular dependency.

| Component | Self-Hosted | Managed Alternative | |-----------|-------------|---------------------| | Metrics collection | Prometheus (+ storage, HA, federation) | Managed Prometheus / cloud metrics | | Visualization | Grafana (+ database, auth, HA) | Managed Grafana / cloud dashboards | | Alerting | Alertmanager (+ dedup, routing, HA) | Cloud alerting / managed alert rules | | Log aggregation | Loki or ELK (+ storage, retention, indexing) | Cloud logging service |

The irony of self-hosted monitoring: the one service that must be available when everything else is failing is the one you built yourself on the same infrastructure that is failing. Managed monitoring runs on the provider's infrastructure, independent of your workloads.

The Only Valid Exceptions

Self-hosting is justified when -- and only when -- the managed service genuinely cannot meet a hard requirement. Document every exception in an ADR with this structure:

What managed service was evaluated?
What specific requirement does it fail to meet? (Not "it's expensive" -- quantify the cost difference.)
What is the operations plan? (Who patches? Who handles incidents? What is the backup/restore process?)
What is the exit criteria? (When the managed service adds this capability, we migrate back.)

Legitimate exceptions (rare)

GPU workloads requiring specific scheduling -- managed containers may not support fractional GPU allocation or custom device plugins. Self-managed nodes with a managed control plane is the compromise.
Regulatory data residency -- the managed service is not available in the required region. Document which region and check quarterly.
Extreme performance requirements -- the managed service adds latency that violates SLAs. Prove it with benchmarks, not assumptions.

Not legitimate exceptions

"It's cheaper to self-host" -- it is not, once you account for engineering time.
"We need more control" -- control over what, specifically? If you cannot name the exact configuration, you do not need it.
"We already know how to run it" -- knowing how to run PostgreSQL does not mean your team should spend time running it instead of building product features.

The Decision Rule

If your platform team would not accept the operational burden of maintaining it, do not self-host it. Use the managed service -- that is the paved road. Self-hosted Kubernetes needs a dedicated platform engineer. Self-hosted monitoring needs an observability engineer. If those roles do not exist on your team, the managed equivalent is the correct choice.

Cloud Provider Translation

| Concept | AWS | GCP | Azure | |---------|-----|-----|-------| | Managed containers (standard) | ECS Fargate | Cloud Run / GKE Autopilot | Container Apps | | Managed containers (GPU) | ECS with EC2 capacity providers | GKE with GPU node pools | AKS with GPU node pools | | Managed Kubernetes | EKS (if you must) | GKE Autopilot | AKS | | Managed PostgreSQL | RDS PostgreSQL / Aurora | Cloud SQL / AlloyDB | Azure Database for PostgreSQL | | Managed Redis/cache | ElastiCache / MemoryDB | Memorystore | Azure Cache for Redis | | Managed workflow engine | MWAA (Airflow) / Step Functions | Cloud Composer / Workflows | (no direct Airflow equivalent) / Logic Apps | | Managed Prometheus | Amazon Managed Prometheus | Cloud Monitoring (built-in) | Azure Monitor (Prometheus) | | Managed Grafana | Amazon Managed Grafana | Cloud Monitoring dashboards | Azure Managed Grafana | | Managed log aggregation | CloudWatch Logs | Cloud Logging | Azure Monitor Logs |

Examples

Working implementations in examples/:

examples/managed-container-service.md -- Complete managed container deployment with spot/preemptible capacity, circuit breaker rollback, auto-scaling, and zero-downtime rolling updates -- no cluster management required
examples/managed-data-stack.md -- Production-grade managed database and cache with automated backups, failover, encryption, and monitoring -- contrasted against the self-hosted equivalent to illustrate the operations tax

Review Checklist

When designing or reviewing service hosting decisions:

[ ] Every self-hosted service has a written ADR justifying why the managed alternative was rejected
[ ] Container orchestration uses a managed service (ECS, Cloud Run, Container Apps), not self-managed Kubernetes, unless team size exceeds 50 engineers or GPU scheduling requires it
[ ] Databases are managed services with automated backups, failover, and patching
[ ] Caches are managed services, not self-hosted Redis/Memcached on EC2/GCE
[ ] Workflow orchestration uses a managed service, not self-hosted Airflow/Temporal
[ ] Monitoring uses managed services -- self-hosted monitoring that fails alongside your infrastructure is worse than no monitoring
[ ] No self-hosted service exists solely because "it's cheaper" without accounting for engineering time
[ ] Self-hosted exceptions include an operations plan (patching, backup, incident response) and exit criteria
[ ] GPU workloads use managed control planes with self-managed node pools, not fully self-managed clusters
[ ] The platform team accepts the operational burden for each self-hosted service -- if not, it should be managed

Every Self-Hosted Service Is a Person You Didn't Hire

The Operations Tax

Every self-hosted service carries a recurring operations cost that is invisible until something breaks.

Container Orchestration: Managed Over Kubernetes

Decision Framework

Good Pattern vs Bad Pattern

# Good: managed container service for a team of 8 engineers

resource "aws_ecs_service" "myapp" {
  name            = "myapp"
  cluster         = data.terraform_remote_state.compute.outputs.cluster_arn
  task_definition = aws_ecs_task_definition.myapp.arn
  desired_count   = 2

  capacity_provider_strategy {
    capacity_provider = "FARGATE"
    weight            = 50  # Increase to 100 for production-critical services
  }
  capacity_provider_strategy {
    capacity_provider = "FARGATE_SPOT"
    weight            = 50  # Spot can be interrupted; suitable for dev, use cautiously in prod
  }

  deployment_circuit_breaker {
    enable   = true
    rollback = true
  }

  # Zero-downtime rolling update
  deployment_maximum_percent         = 200
  deployment_minimum_healthy_percent = 100
}

# Result: no nodes to patch, no cluster upgrades, no CNI plugins,
# no ingress controllers, no etcd backups. Deploy and forget.

# Bad: self-managed Kubernetes for the same team of 8

resource "aws_eks_cluster" "main" {
  name     = "myapp-cluster"
  role_arn = aws_iam_role.eks.arn
  version  = "1.28"  # You must upgrade this every 3-4 months

  vpc_config {
    subnet_ids = var.private_subnet_ids
  }
}

resource "aws_eks_node_group" "workers" {
  cluster_name    = aws_eks_cluster.main.name
  node_group_name = "workers"
  instance_types  = ["m5.large"]
  scaling_config {
    desired_size = 3
    max_size     = 6
    min_size     = 2
  }
  # Now you also need: ingress-nginx, cert-manager, external-dns,
  # metrics-server, cluster-autoscaler, aws-load-balancer-controller,
  # and someone to upgrade all of them every quarter.
}

Workflow Orchestration: Managed Over Self-Hosted

Databases and Caches: Always Managed

# Good: managed database with automated operations

resource "aws_db_instance" "myapp" {
  identifier     = "${module.labels.prefix}myapp-db"
  engine         = "postgres"
  engine_version = "15.4"
  instance_class = "db.t4g.medium"

  multi_az                    = true     # Automatic failover
  backup_retention_period     = 14       # 14-day point-in-time recovery
  auto_minor_version_upgrade  = true     # Security patches applied automatically
  storage_encrypted           = true
  performance_insights_enabled = true    # Built-in query monitoring
  deletion_protection         = true
}

# Bad: self-hosted PostgreSQL on an EC2 instance

resource "aws_instance" "postgres" {
  ami           = "ami-0abcdef1234567890"
  instance_type = "m5.large"

  # Now you must:
  # - Install and configure PostgreSQL
  # - Set up streaming replication for HA
  # - Configure automated backups to object storage
  # - Test backup restores quarterly
  # - Apply OS security patches monthly
  # - Apply PostgreSQL patches on your schedule
  # - Monitor replication lag, connections, disk, memory
  # - Handle failover manually or build automation
  # - Manage SSL certificates for connections
  # - None of this is in the Terraform above
}

Monitoring: Managed Over Self-Hosted

The Only Valid Exceptions

Self-hosting is justified when -- and only when -- the managed service genuinely cannot meet a hard requirement. Document every exception in an ADR with this structure:

What managed service was evaluated?
What specific requirement does it fail to meet? (Not "it's expensive" -- quantify the cost difference.)
What is the operations plan? (Who patches? Who handles incidents? What is the backup/restore process?)
What is the exit criteria? (When the managed service adds this capability, we migrate back.)

Legitimate exceptions (rare)

GPU workloads requiring specific scheduling -- managed containers may not support fractional GPU allocation or custom device plugins. Self-managed nodes with a managed control plane is the compromise.
Regulatory data residency -- the managed service is not available in the required region. Document which region and check quarterly.
Extreme performance requirements -- the managed service adds latency that violates SLAs. Prove it with benchmarks, not assumptions.

Not legitimate exceptions

"It's cheaper to self-host" -- it is not, once you account for engineering time.
"We need more control" -- control over what, specifically? If you cannot name the exact configuration, you do not need it.
"We already know how to run it" -- knowing how to run PostgreSQL does not mean your team should spend time running it instead of building product features.

The Decision Rule

Cloud Provider Translation

Examples

Working implementations in examples/:

examples/managed-container-service.md -- Complete managed container deployment with spot/preemptible capacity, circuit breaker rollback, auto-scaling, and zero-downtime rolling updates -- no cluster management required
examples/managed-data-stack.md -- Production-grade managed database and cache with automated backups, failover, encryption, and monitoring -- contrasted against the self-hosted equivalent to illustrate the operations tax

Review Checklist

When designing or reviewing service hosting decisions:

[ ] Every self-hosted service has a written ADR justifying why the managed alternative was rejected
[ ] Container orchestration uses a managed service (ECS, Cloud Run, Container Apps), not self-managed Kubernetes, unless team size exceeds 50 engineers or GPU scheduling requires it
[ ] Databases are managed services with automated backups, failover, and patching
[ ] Caches are managed services, not self-hosted Redis/Memcached on EC2/GCE
[ ] Workflow orchestration uses a managed service, not self-hosted Airflow/Temporal
[ ] Monitoring uses managed services -- self-hosted monitoring that fails alongside your infrastructure is worse than no monitoring
[ ] No self-hosted service exists solely because "it's cheaper" without accounting for engineering time
[ ] Self-hosted exceptions include an operations plan (patching, backup, incident response) and exit criteria
[ ] GPU workloads use managed control planes with self-managed node pools, not fully self-managed clusters
[ ] The platform team accepts the operational burden for each self-hosted service -- if not, it should be managed

Adoption

oborchers/managed-services-over-self-hosted

$ install --global

Security Scan Results

SKILL.md

Every Self-Hosted Service Is a Person You Didn't Hire

The Operations Tax

Container Orchestration: Managed Over Kubernetes

Decision Framework

Good Pattern vs Bad Pattern

Workflow Orchestration: Managed Over Self-Hosted

Databases and Caches: Always Managed

Monitoring: Managed Over Self-Hosted

The Only Valid Exceptions

Legitimate exceptions (rare)

Not legitimate exceptions

The Decision Rule

Cloud Provider Translation

Examples

Review Checklist

Related Skills

oborchers/using-planning-tools

oborchers/plan-verification-checklist

oborchers/master-plan-methodology

oborchers/whitespace-density

oborchers/managed-services-over-self-hosted

$ install --global

Security Scan Results

SKILL.md

Every Self-Hosted Service Is a Person You Didn't Hire

The Operations Tax

Container Orchestration: Managed Over Kubernetes

Decision Framework

Good Pattern vs Bad Pattern

Workflow Orchestration: Managed Over Self-Hosted

Databases and Caches: Always Managed

Monitoring: Managed Over Self-Hosted

The Only Valid Exceptions

Legitimate exceptions (rare)

Not legitimate exceptions

The Decision Rule

Cloud Provider Translation

Examples

Review Checklist

Related Skills

oborchers/using-planning-tools

oborchers/plan-verification-checklist

oborchers/master-plan-methodology

oborchers/whitespace-density