cloud-foundation-principles/skills/managed-services-over-self-hosted/SKILL.md
This skill should be used when the user is choosing between managed and self-hosted services, deciding whether to run Kubernetes or use managed containers, evaluating self-hosted databases vs managed databases, considering self-hosted monitoring or caches, designing for a small team (under 50 engineers), or justifying a self-hosted exception. Covers the operations tax of self-hosting, managed container orchestration over Kubernetes for small teams, managed workflow engines, managed caches and databases, managed monitoring, and the decision framework for when self-hosting is genuinely justified.
npx skillsauth add oborchers/fractional-cto managed-services-over-self-hostedInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Self-hosting a database, a cache, a workflow engine, or a Kubernetes cluster is not free. It costs patching, backup verification, incident response at 3 AM, capacity planning, version upgrades, security hardening, and monitoring of the monitor. Each self-hosted service is an invisible full-time job. For a team of five engineers shipping a SaaS product, running your own PostgreSQL is the equivalent of hiring a sixth engineer whose entire job is keeping PostgreSQL alive -- except you do not hire that person, so the work falls on everyone, and nobody does it well.
Managed services trade money for engineering time. For startups and small teams (under 50 engineers), this trade is almost always correct. The cloud bill goes up by hundreds of dollars per month; the engineering team gets back thousands of dollars in reclaimed time. Self-host only when the managed service genuinely cannot meet your requirements -- and document the justification in an ADR.
Every self-hosted service carries a recurring operations cost that is invisible until something breaks.
| Operations Task | Managed Service | Self-Hosted | |-----------------|-----------------|-------------| | OS/kernel patching | Provider handles it | You schedule downtime, test, apply | | Version upgrades | One-click or automatic | You test, migrate, rollback-plan, execute | | Backup & restore | Automated, point-in-time | You configure, verify, test restores quarterly | | Scaling | Auto-scaling or single API call | You monitor, forecast, provision, rebalance | | High availability | Built-in multi-AZ/region | You design, implement, test failover | | Security hardening | Provider hardens, you configure | You harden OS, network, application, and runtime | | Monitoring | Built-in metrics and logs | You deploy exporters, configure dashboards, set alerts | | Incident response | Provider's SRE team + your config | Your team, 24/7, for infrastructure AND application | | Compliance | Provider certifications (SOC2, HIPAA) | You certify the infrastructure yourself |
The compound effect: one self-hosted service is manageable. Three self-hosted services (database + cache + monitoring stack) consume 30-50% of a small team's operational capacity. Five self-hosted services and you are an infrastructure company that happens to also build a product.
Kubernetes is the most frequently self-hosted service that teams do not need. For teams under 50 engineers running fewer than 20 services, managed container platforms provide the same deployment model (containers, health checks, scaling, load balancing) without the operational overhead of cluster management, node pool sizing, ingress controller configuration, CNI plugin selection, and etcd maintenance.
| Criterion | Use Managed Containers | Use Kubernetes | |-----------|----------------------|----------------| | Team size | Under 50 engineers | 50+ engineers with dedicated platform team | | Service count | Under 20 services | 20+ services with complex networking | | GPU workloads | No, or minimal | Heavy GPU scheduling requirements | | Custom scheduling | Not needed | Custom schedulers, operators, CRDs required | | Multi-cloud | Not required | Required for portability | | Service mesh | Not needed | Istio/Linkerd required | | Compliance | Standard | Requires specific K8s-level audit controls |
# Good: managed container service for a team of 8 engineers
resource "aws_ecs_service" "myapp" {
name = "myapp"
cluster = data.terraform_remote_state.compute.outputs.cluster_arn
task_definition = aws_ecs_task_definition.myapp.arn
desired_count = 2
capacity_provider_strategy {
capacity_provider = "FARGATE"
weight = 50 # Increase to 100 for production-critical services
}
capacity_provider_strategy {
capacity_provider = "FARGATE_SPOT"
weight = 50 # Spot can be interrupted; suitable for dev, use cautiously in prod
}
deployment_circuit_breaker {
enable = true
rollback = true
}
# Zero-downtime rolling update
deployment_maximum_percent = 200
deployment_minimum_healthy_percent = 100
}
# Result: no nodes to patch, no cluster upgrades, no CNI plugins,
# no ingress controllers, no etcd backups. Deploy and forget.
# Bad: self-managed Kubernetes for the same team of 8
resource "aws_eks_cluster" "main" {
name = "myapp-cluster"
role_arn = aws_iam_role.eks.arn
version = "1.28" # You must upgrade this every 3-4 months
vpc_config {
subnet_ids = var.private_subnet_ids
}
}
resource "aws_eks_node_group" "workers" {
cluster_name = aws_eks_cluster.main.name
node_group_name = "workers"
instance_types = ["m5.large"]
scaling_config {
desired_size = 3
max_size = 6
min_size = 2
}
# Now you also need: ingress-nginx, cert-manager, external-dns,
# metrics-server, cluster-autoscaler, aws-load-balancer-controller,
# and someone to upgrade all of them every quarter.
}
Self-hosted workflow engines (Airflow on EC2/K8s, Temporal self-hosted, Prefect server) require database backends, worker scaling, scheduler high availability, log aggregation, and web UI hosting. Managed workflow services handle all of this.
| Approach | What You Manage | What the Provider Manages | |----------|-----------------|---------------------------| | Managed Airflow | DAG code, connections, variables | Scheduler HA, worker scaling, web UI, database, upgrades | | Self-hosted Airflow | DAG code, connections, variables, scheduler HA, worker scaling, web UI, metadata DB, Redis/Celery, upgrades, monitoring | Nothing | | Managed step functions | Workflow definitions | Execution, scaling, retry, logging, state persistence | | Self-hosted Temporal | Workflow code, namespace management, history DB, visibility DB, upgrades, monitoring | Nothing |
The breaking point: self-hosted Airflow is three services (scheduler, webserver, workers), a metadata database, a message broker, and a log storage backend. That is six components to keep alive for a workflow engine that is supposed to keep your other workflows alive.
Do your research first: managed workflow services vary significantly in quality. Sometimes your cloud provider's offering (e.g., MWAA) is the right choice; sometimes a specialized third-party provider (e.g., Astronomer for Airflow) offers a materially better experience. Evaluate both before committing.
There is almost no scenario where a startup or small team should run a self-hosted database or cache in production. The managed service gives you automated backups, point-in-time recovery, failover, patching, and monitoring for a modest premium over the raw compute cost.
# Good: managed database with automated operations
resource "aws_db_instance" "myapp" {
identifier = "${module.labels.prefix}myapp-db"
engine = "postgres"
engine_version = "15.4"
instance_class = "db.t4g.medium"
multi_az = true # Automatic failover
backup_retention_period = 14 # 14-day point-in-time recovery
auto_minor_version_upgrade = true # Security patches applied automatically
storage_encrypted = true
performance_insights_enabled = true # Built-in query monitoring
deletion_protection = true
}
# Bad: self-hosted PostgreSQL on an EC2 instance
resource "aws_instance" "postgres" {
ami = "ami-0abcdef1234567890"
instance_type = "m5.large"
# Now you must:
# - Install and configure PostgreSQL
# - Set up streaming replication for HA
# - Configure automated backups to object storage
# - Test backup restores quarterly
# - Apply OS security patches monthly
# - Apply PostgreSQL patches on your schedule
# - Monitor replication lag, connections, disk, memory
# - Handle failover manually or build automation
# - Manage SSL certificates for connections
# - None of this is in the Terraform above
}
The same logic applies to caches. A managed Redis/Valkey instance with automatic failover, patching, and backup costs marginally more than the equivalent EC2 instance and saves dozens of hours per quarter in operational toil.
Self-hosted monitoring stacks (Prometheus + Grafana + Alertmanager + Loki) are four services that each need their own storage, scaling, and high availability. When your monitoring is down, you are blind to everything else being down. Managed monitoring services eliminate this circular dependency.
| Component | Self-Hosted | Managed Alternative | |-----------|-------------|---------------------| | Metrics collection | Prometheus (+ storage, HA, federation) | Managed Prometheus / cloud metrics | | Visualization | Grafana (+ database, auth, HA) | Managed Grafana / cloud dashboards | | Alerting | Alertmanager (+ dedup, routing, HA) | Cloud alerting / managed alert rules | | Log aggregation | Loki or ELK (+ storage, retention, indexing) | Cloud logging service |
The irony of self-hosted monitoring: the one service that must be available when everything else is failing is the one you built yourself on the same infrastructure that is failing. Managed monitoring runs on the provider's infrastructure, independent of your workloads.
Self-hosting is justified when -- and only when -- the managed service genuinely cannot meet a hard requirement. Document every exception in an ADR with this structure:
If your platform team would not accept the operational burden of maintaining it, do not self-host it. Use the managed service -- that is the paved road. Self-hosted Kubernetes needs a dedicated platform engineer. Self-hosted monitoring needs an observability engineer. If those roles do not exist on your team, the managed equivalent is the correct choice.
| Concept | AWS | GCP | Azure | |---------|-----|-----|-------| | Managed containers (standard) | ECS Fargate | Cloud Run / GKE Autopilot | Container Apps | | Managed containers (GPU) | ECS with EC2 capacity providers | GKE with GPU node pools | AKS with GPU node pools | | Managed Kubernetes | EKS (if you must) | GKE Autopilot | AKS | | Managed PostgreSQL | RDS PostgreSQL / Aurora | Cloud SQL / AlloyDB | Azure Database for PostgreSQL | | Managed Redis/cache | ElastiCache / MemoryDB | Memorystore | Azure Cache for Redis | | Managed workflow engine | MWAA (Airflow) / Step Functions | Cloud Composer / Workflows | (no direct Airflow equivalent) / Logic Apps | | Managed Prometheus | Amazon Managed Prometheus | Cloud Monitoring (built-in) | Azure Monitor (Prometheus) | | Managed Grafana | Amazon Managed Grafana | Cloud Monitoring dashboards | Azure Managed Grafana | | Managed log aggregation | CloudWatch Logs | Cloud Logging | Azure Monitor Logs |
Working implementations in examples/:
examples/managed-container-service.md -- Complete managed container deployment with spot/preemptible capacity, circuit breaker rollback, auto-scaling, and zero-downtime rolling updates -- no cluster management requiredexamples/managed-data-stack.md -- Production-grade managed database and cache with automated backups, failover, encryption, and monitoring -- contrasted against the self-hosted equivalent to illustrate the operations taxWhen designing or reviewing service hosting decisions:
tools
This skill should be used when the user invokes any /plan-* command from the planning-tools plugin (/plan-context, /plan-master, /plan-open-questions, /plan-verify, /plan-tick, /plan-progress, /plan-delete), asks how Claude Code's plan files work, asks where plans are stored, asks to author or audit a multi-phase master planning document, asks how to walk through a plan's Open Questions interactively, asks how to write progress entries, or mentions ~/.claude/plans/ or .claude/planning-tools.local.md. Provides the index of planning-tools commands, the master-plan workflow lifecycle, the v0.3.0+ list-shape mandate (phases and questions as headings + bulleted scope items, never tables), the v0.3.2+ plain-bullet shape (no `- [ ]` checkboxes — heading emoji is the sole tick signal), the progress-entry methodology, and the mechanics of Claude Code's plan-mode file storage.
testing
This skill should be used by the plan-verifier agent and the /plan-verify command to audit a drafted master plan against a fixed checklist. Covers universal-core completeness, the v0.3.0+ no-tables-for-phases-or-questions rule, trigger-based section-coverage gaps, phase actionability (heading + per-phase TL;DR + bulleted scope + exit criteria), the v0.3.1+ per-phase TL;DR requirement, the v0.3.2+ plain-bullet scope shape (legacy `- [ ]`/`- [x]` accepted silently), the v0.3.3+ context-block shape (plan-level `**TL;DR:**` + bulleted metadata, legacy `>` blockquote accepted silently), integer phase numbering enforcement, dependency traceability, citation resolution, callout/evidence convention compliance, Open Questions placement, and the one-PR-per-master-plan rule. Single-owner of the audit checklist.
tools
This skill should be used when authoring, reviewing, or modifying a multi-phase master planning document via the planning-tools plugin (especially the /plan-master and /plan-verify commands). Codifies the universal core sections, trigger-based optional sections, integer-only phase numbering, Open Questions placement, one-PR-per-plan rule, status conventions, evidence attribution, callouts, cross-reference formats, the v0.3.0 list-shape mandate (phases and questions are heading + bulleted list, never markdown tables), the v0.3.1 per-phase TL;DR requirement (1–3 sentence what/why summary under each phase heading for glance-ability), the v0.3.2 plain-bullet scope shape (`- <action>` items, no `- [ ]` checkboxes — the phase status emoji is the sole tick signal), and the v0.3.3 context-block shape (a plan-level `**TL;DR:**` + a bulleted metadata list instead of a `>` blockquote; legacy blockquote blocks accepted silently). Project-agnostic — no ticket-prefix or plan-type taxonomy.
testing
This skill should be used when the user is adjusting spacing, padding, margins, content density, section gaps, vertical rhythm, or separation between elements. Also applies when reviewing whether a design feels cramped or too sparse, choosing between borders and whitespace for separation, or defining a spacing system. Covers the 4px/8px spacing system, macro vs micro whitespace, content density spectrum, separation techniques (whitespace > background shifts > borders), and vertical rhythm.