cloud-foundation-principles/skills/repository-and-state-strategy/SKILL.md
This skill should be used when the user is structuring Terraform repositories, deciding between mono-repo and multi-repo strategies, organizing infrastructure into layers, designing state management architecture, setting up cross-layer dependencies, or evaluating blast radius of infrastructure changes. Covers multi-repository strategy, numbered layer architecture, state-per-layer-per-environment isolation, cross-layer remote state references, and deployment ordering.
npx skillsauth add oborchers/fractional-cto repository-and-state-strategyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
A single Terraform state file that contains your entire infrastructure is a liability disguised as simplicity. One bad terraform apply can destroy your network, databases, and compute in a single operation. One state file corruption locks out every team. One slow plan blocks every deployment. The blast radius is everything, and the recovery plan is "restore from backup and pray."
Production infrastructure demands intentional separation -- separate repositories for separate concerns, separate state files for separate layers, and numbered directories that encode dependency order at a glance. This is not premature optimization. It is the difference between an outage that takes down one monitoring dashboard and an outage that takes down your entire platform.
Infrastructure repositories should be split by change cadence and ownership. Organization-level IAM changes happen monthly. Network changes happen quarterly. Service deployments happen daily. Forcing all three through the same repository, the same review process, and the same CI pipeline creates friction where none should exist.
REPOSITORIES
|
+-- tf-root <-- Organization & IAM management
| Scope: SSO, permissions, accounts, security delegation
| Changes: Monthly (new users, permission updates)
|
+-- tf-global-infrastructure <-- Shared infrastructure per environment
| Scope: VPCs, security groups, databases, compute clusters, monitoring
| Structure: Numbered layers (00-90) with env subdirectories
| Changes: Weekly (new resources, configuration updates)
|
+-- tf-module-labels <-- Foundational naming/tagging module
+-- tf-module-alerts <-- Monitoring/alerting module
+-- tf-module-container-service <-- Container orchestration module
|
+-- [per-service repos] <-- App-specific infrastructure
Each service manages its own Terraform alongside application code
Changes: Daily (deployments, scaling, feature flags)
Why separate repos, not directories in a mono-repo?
Within the global infrastructure repository, directories are numbered to encode dependency order. Lower numbers are prerequisites for higher numbers. Numbering in steps of ten (00, 10, 20 ... not 1, 2, 3) reserves space to insert or split layers without renumbering -- if databases grow complex, split 30_databases into 30_relational and 35_caches without touching anything else.
tf-global-infrastructure/
+-- 00_network/ <-- VPCs, subnets, Route53, VPN, VPC endpoints
| +-- dev/
| +-- prod/
+-- 10_security/ <-- Security groups, IAM roles, KMS, WAF, certificates
| +-- dev/
| +-- prod/
+-- 20_storage/ <-- Object storage, file systems
| +-- dev/
| +-- prod/
+-- 30_databases/ <-- Relational databases, caches, warehouses
| +-- dev/
| +-- prod/
+-- 40_compute/ <-- Container clusters, auto-scaling, GPU instances
| +-- dev/
| +-- prod/
+-- 50_edge/ <-- CDN, load balancers, API gateways
| +-- prod/
+-- 60_messaging/ <-- Message brokers, event buses, queues
| +-- dev/
| +-- prod/
+-- 70_monitoring/ <-- Metrics, dashboards, log aggregation
| +-- dev/
| +-- prod/
+-- 80_ci_cd/ <-- Build runners, pipeline infrastructure
+-- 90_shared_services/ <-- Bastion hosts, service discovery
+-- dev/
+-- prod/
Not every layer needs per-environment subdirectories. Layers like 80_ci_cd (e.g., self-hosted GitHub runners) are shared infrastructure — there is no reason to duplicate build runners per environment in a startup. Similarly, 50_edge may only exist in production if there is no dev CDN or load balancer. Only create environment subdirectories where the resources are actually environment-specific.
| Property | Benefit |
|----------|---------|
| Dependency encoding | Layer 40 (compute) cannot exist without layer 00 (network). The numbering makes this obvious. |
| Independent state | Each layer has its own state file. A bad apply in monitoring cannot destroy your network. |
| Independent CI/CD | Each layer can have its own pipeline. Network changes do not block compute deployments. |
| Clear mental model | New engineers understand the dependency graph in seconds, not hours. |
| Insert and split | Need to split databases into relational and caches? Insert 35_caches between 30 and 40 without renumbering anything. |
Layers deploy in numerical order. This is the full dependency chain:
tf-root (organization setup, SSO, security delegation)
|
v
00_network (VPCs, subnets, DNS, VPC endpoints)
|
v
10_security (security groups, KMS keys, certificates, WAF)
|
v
20_storage (object storage, file systems)
|
v
30_databases (relational databases, caches, warehouses)
|
v
40_compute (container clusters, auto-scaling groups)
|
v
50_edge (CDN distributions, load balancers)
|
v
60_messaging (message brokers, event buses, queues)
|
v
70_monitoring (metrics collection, dashboards, alerting)
|
v
80_ci_cd (build runners)
|
v
90_shared_services (bastion hosts, service discovery)
Dependencies are strictly forward: a layer may reference any lower-numbered layer via remote state, but never a higher-numbered one. Layer 50 can read from layers 00, 10, or 40 -- but layer 50 cannot depend on layer 60. This ensures the deployment chain is always acyclic and any layer can be planned or applied without waiting for higher layers to exist.
The cardinal rule of Terraform state management: every layer in every environment gets its own state file. No exceptions. No "we will split it later." Split it now.
One state bucket per cloud account (state buckets use <org>-<env>-tfstate as an exception
to the labels module naming -- they are account-global and need globally unique names):
myorg-root-tfstate <-- Root/management account
myorg-security-tfstate <-- Security account
myorg-log-archive-tfstate <-- Log archive account
myorg-dev-tfstate <-- Development account
myorg-prod-tfstate <-- Production account
Within each bucket, one key per layer or service:
myorg-dev-tfstate/
network <-- 00_network/dev state
security <-- 10_security/dev state
storage <-- 20_storage/dev state
databases <-- 30_databases/dev state
compute <-- 40_compute/dev state
messaging <-- 60_messaging/dev state
monitoring <-- 70_monitoring/dev state
shared_services <-- 90_shared_services/dev state
myapp-api <-- Service-owned state (separate repo)
billing-service <-- Service-owned state (separate repo)
Every state bucket must have all four:
| Property | Setting | Why | |----------|---------|-----| | Encryption | AES-256 server-side | State contains secrets (database passwords, API keys) | | Versioning | Enabled | Recover from accidental state corruption or deletion | | Locking | Enabled | Prevent concurrent applies that corrupt state | | Public access | Blocked | State files are the keys to your kingdom |
Higher layers read outputs from lower layers using remote state data sources. This creates explicit, auditable dependency chains.
# 40_compute/dev/main.tf -- Compute reads from network and security
data "terraform_remote_state" "network" {
backend = "s3"
config = {
bucket = "myorg-dev-tfstate"
key = "network"
region = "eu-west-1"
}
}
data "terraform_remote_state" "security" {
backend = "s3"
config = {
bucket = "myorg-dev-tfstate"
key = "security"
region = "eu-west-1"
}
}
locals {
vpc_id = data.terraform_remote_state.network.outputs.vpc_id
private_subnets = data.terraform_remote_state.network.outputs.private_subnets
base_sg_ids = data.terraform_remote_state.security.outputs.base_security_group_ids
}
Dependency chain in practice:
compute reads from network + securitydatabases reads from network + securitymonitoring reads from compute + networkshared_services reads from networkedge reads from compute + securityBad: Monolithic state file
myorg-dev-tfstate/
everything <-- One state file for ALL infrastructure
Problems: blast radius is everything. One bad apply can destroy networking, databases, and compute simultaneously. Plans take minutes as Terraform refreshes hundreds of resources. Two engineers cannot work on different layers in parallel.
Good: State per layer per environment
myorg-dev-tfstate/
network <-- 42 resources, 15-second plan
security <-- 28 resources, 10-second plan
databases <-- 15 resources, 8-second plan
compute <-- 35 resources, 12-second plan
Benefits: blast radius limited to one layer. Plans are fast. Engineers work on different layers in parallel. Recovery from corruption affects only one layer.
Bad: Environment state mixed together
# One state file contains both dev and prod resources
resource "aws_vpc" "dev" { cidr_block = "10.0.0.0/16" }
resource "aws_vpc" "prod" { cidr_block = "10.1.0.0/16" }
Problems: a mistake in dev configuration can destroy prod resources. No way to restrict who can modify prod without restricting dev.
Good: Separate directories, separate state, separate permissions
00_network/
dev/ -> myorg-dev-tfstate/network
prod/ -> myorg-prod-tfstate/network
| Concept | AWS | GCP | Azure |
|---------|-----|-----|-------|
| State backend | S3 bucket (use_lockfile) | GCS bucket (native locking) | Azure Blob Storage + lease locking |
| State encryption | AES-256 SSE-S3 or SSE-KMS | Default encryption (Google-managed or CMEK) | Storage Service Encryption (Microsoft-managed or CMK) |
| State locking | S3 native locking (use_lockfile = true) | GCS native locking | Blob lease locking |
| Remote state reference | terraform_remote_state with S3 backend | terraform_remote_state with GCS backend | terraform_remote_state with azurerm backend |
| Account isolation | AWS accounts via Organizations | GCP projects via folders | Azure subscriptions via Management Groups |
| State bucket per account | One S3 bucket per AWS account | One GCS bucket per GCP project | One Storage Account per Azure subscription |
Working implementations in examples/:
examples/numbered-layer-layout.md -- Complete directory structure for a global infrastructure repository with numbered layers, environment subdirectories, and backend configuration for each layerexamples/cross-layer-state-references.md -- Terraform configurations showing how the compute layer reads outputs from network and security layers via remote state, including the backend configuration and output definitionsWhen designing or reviewing repository and state architecture:
dev/, prod/)terraform_remote_state data sources, not hardcoded values<org>-<env>-tfstate)tools
This skill should be used when the user invokes any /plan-* command from the planning-tools plugin (/plan-context, /plan-master, /plan-open-questions, /plan-verify, /plan-tick, /plan-progress, /plan-delete), asks how Claude Code's plan files work, asks where plans are stored, asks to author or audit a multi-phase master planning document, asks how to walk through a plan's Open Questions interactively, asks how to write progress entries, or mentions ~/.claude/plans/ or .claude/planning-tools.local.md. Provides the index of planning-tools commands, the master-plan workflow lifecycle, the v0.3.0+ list-shape mandate (phases and questions as headings + bulleted scope items, never tables), the v0.3.2+ plain-bullet shape (no `- [ ]` checkboxes — heading emoji is the sole tick signal), the progress-entry methodology, and the mechanics of Claude Code's plan-mode file storage.
testing
This skill should be used by the plan-verifier agent and the /plan-verify command to audit a drafted master plan against a fixed checklist. Covers universal-core completeness, the v0.3.0+ no-tables-for-phases-or-questions rule, trigger-based section-coverage gaps, phase actionability (heading + per-phase TL;DR + bulleted scope + exit criteria), the v0.3.1+ per-phase TL;DR requirement, the v0.3.2+ plain-bullet scope shape (legacy `- [ ]`/`- [x]` accepted silently), the v0.3.3+ context-block shape (plan-level `**TL;DR:**` + bulleted metadata, legacy `>` blockquote accepted silently), integer phase numbering enforcement, dependency traceability, citation resolution, callout/evidence convention compliance, Open Questions placement, and the one-PR-per-master-plan rule. Single-owner of the audit checklist.
tools
This skill should be used when authoring, reviewing, or modifying a multi-phase master planning document via the planning-tools plugin (especially the /plan-master and /plan-verify commands). Codifies the universal core sections, trigger-based optional sections, integer-only phase numbering, Open Questions placement, one-PR-per-plan rule, status conventions, evidence attribution, callouts, cross-reference formats, the v0.3.0 list-shape mandate (phases and questions are heading + bulleted list, never markdown tables), the v0.3.1 per-phase TL;DR requirement (1–3 sentence what/why summary under each phase heading for glance-ability), the v0.3.2 plain-bullet scope shape (`- <action>` items, no `- [ ]` checkboxes — the phase status emoji is the sole tick signal), and the v0.3.3 context-block shape (a plan-level `**TL;DR:**` + a bulleted metadata list instead of a `>` blockquote; legacy blockquote blocks accepted silently). Project-agnostic — no ticket-prefix or plan-type taxonomy.
testing
This skill should be used when the user is adjusting spacing, padding, margins, content density, section gaps, vertical rhythm, or separation between elements. Also applies when reviewing whether a design feels cramped or too sparse, choosing between borders and whitespace for separation, or defining a spacing system. Covers the 4px/8px spacing system, macro vs micro whitespace, content density spectrum, separation techniques (whitespace > background shifts > borders), and vertical rhythm.