The Four Phases

You MUST complete each phase before proceeding to the next.

Phase 1: Architecture & Security (The Blueprint)

BEFORE writing a single line of YAML or Terraform:

Define the Topology
- Draw the network diagram (VPCs, Subnets, Firewalls).
- Where does the data live? Is it public or private?
- Rule: nothing talks to the internet unless explicitly required. Default to private.
Identity & Access Management (IAM)
- Define "Who" needs access to "What."
- Principle of Least Privilege: A build service should not have Admin access.
- No Hardcoded Secrets: Plan where keys/passwords live (Secret Manager, Vault).
Cost Estimation
- Check the pricing calculator.
- Set up Budget Alerts before deploying resources.
- Avoid "Cloud Shock" by defining retention policies (logs, backups) now.

Phase 1.5: FinOps & Cost Optimization (2026)

Cloud costs are the #2 startup killer (after running out of customers):

Tagging Strategy (Cost Accountability)
- Required Tags: environment, team, project, cost-center
- Enable: Cost allocation reports, blame tracking
- Enforcement: Tag policies (reject untagged resources)
```
resource "google_storage_bucket" "data" {
  labels = {
    environment = var.environment
    team        = "data-platform"
    project     = "analytics"
  }
}
```
Right-Sizing & Committed Use
- Spot/Preemptible Instances: 60-90% cheaper for fault-tolerant workloads
- Reserved Instances: 1-3 year commit for predictable loads (40-60% savings)
- Savings Plans: Flexible commitment (AWS/Azure)
- Auto-scaling: Scale down aggressively (nights, weekends)
- Tool: Infracost (estimate before deploy), Kubecost (k8s)
Cost Alerts & Anomaly Detection
- Budget Alerts: 50%, 80%, 100%, 120% thresholds
- Per-Service Budgets: Separate for Compute, Storage, Network
- Anomaly Detection: Alert on 2x daily average
- Example: "BigQuery cost is 10x higher today vs yesterday"
Storage Lifecycle Policies
- Hot → Warm → Cold → Archive
  - S3: Standard → IA → Glacier → Deep Archive
  - GCS: Standard → Nearline → Coldline → Archive
- Auto-delete: Logs after 90 days, temp files after 7 days
- Savings: 50-90% on storage costs

Phase 2: Infrastructure as Code (IaC)

Treat infrastructure like software:

Codify Resources (Terraform/OpenTofu)
- Rule: Never click in the Console to create production resources ("ClickOps").
- Use Modules for repeatable patterns (e.g., a standard bucket configuration).
- Store state remotely and lock it (e.g., GCS bucket with state locking).
Immutability Strategy
- Don't SSH into servers to patch them.
- Plan to replace infrastructure, not repair it ("Cattle, not Pets").
- Define your container strategy (Dockerfiles, Base Images).
Environment Parity
- Staging should look like Production (just smaller).
- Use variables to handle differences, not different logic.
- Goal: If it works in Staging, it works in Prod.

Phase 2.5: GitOps & Modern Deployment

The infrastructure deployment evolution:

GitOps Pattern
- Git as Source of Truth: Desired state lives in Git
- Pull-Based Deployment: Cluster pulls config (vs push)
- Auto-Sync: Changes to Git automatically deploy
- Tools: ArgoCD (k8s), Flux, Atlantis (Terraform)
- Benefits: Audit trail, easy rollback, no CI/CD credentials in cluster
Drift Detection
- Problem: Someone clicks in console, state diverges from Git
- Solution: Automated drift detection + alerts
- terraform plan in CI/CD on schedule
- ArgoCD: Visual diff of desired vs actual state
Progressive Delivery
- Blue/Green: Deploy new version, switch traffic instantly
- Canary: Route 5% → 25% → 100% traffic gradually
- Feature Flags: Enable features for subset of users
- Tools: Flagger (k8s), LaunchDarkly, Split.io

Phase 3: The Pipeline (CI/CD)

Automate the factory:

Continuous Integration (CI)
- Lint the code (TFSec, Hadolint).
- Run unit tests.
- Build the artifact (Container Image, Binary).
- Rule: The build fails if the tests fail. No exceptions.
Continuous Deployment (CD)
- Automate the rollout.
- Gatekeeping: Require approval for Prod deployment?
- Strategy: Blue/Green? Canary? Rolling Update? Choose one to minimize downtime.
Rollback Strategy
- "Break Glass" procedure: How do we undo the last deploy instantly?
- Test the rollback mechanism. A rollback that fails is an outage.

Phase 4: Observability & Reliability (SRE)

Flying the plane:

Logging & Tracing
- Centralize logs (don't leave them on the instance).
- Ensure logs are structured (JSON) so they can be queried.
- Correlate logs across services (Trace ID).
Metrics & Alerting
- The Golden Signals: Latency, Traffic, Errors, Saturation.
- Alert on symptoms (User sees 500 error), not just causes (CPU high).
- Rule: If an alert wakes you up, it must be actionable. If not, delete the alert.
Disaster Recovery (DR)
- Automate backups.
- Crucial: Test the Restore. A backup you haven't restored is just a file you hope works.

Red Flags - STOP and Follow Process

If you catch yourself thinking:

"I'll just change this firewall rule in the console quickly to test."
"I'll hardcode the API key just for this commit."
"I'll write the Terraform import later."
"It works on my laptop, just ship the container."
"We don't need logs for this service."
"I'll ssh in and restart the service manually."
Using admin or owner permissions for a service account.

ALL of these mean: STOP. Return to Phase 1/2.

Your Human Partner's Signals You're Doing It Wrong

Watch for these complaints:

Dev: "The build takes 20 minutes." (Your pipeline is bloated).
PM: "Why did Staging break?" (You lacked Environment Parity).
Sec: "Why is port 22 open to the world?" (You failed Phase 1).
Fin: "Why is the bill 3x higher this month?" (You failed Cost Estimation).
Dev: "I can't debug this without SSH." (You failed Observability).

When you see these: STOP. Fix the platform/process.

Common Rationalizations

| Excuse | Reality | |--------|---------| | "ClickOps is faster for prototyping" | It creates "Configuration Drift" instantly. | | "We'll secure the bucket later" | You will be hacked/scraped before "later" comes. | | "I don't need a budget alert" | One infinite loop can cost $10k overnight. | | "Terraform is too complex for this" | Manual maintenance is infinitely more complex over time. | | "I'll document the manual steps" | Docs get outdated. Code (IaC) is the documentation. | | "Logs are too expensive" | Downtime is more expensive. Sample/filter them. |

Quick Reference

| Phase | Key Activities | Success Criteria | |-------|---------------|------------------| | 1. Architecture | Topology, IAM, Cost | Secure design, Budget set | | 2. IaC | Terraform, Docker, State | Reproducible infra, No manual clicks | | 3. Pipeline | Lint, Build, Test, Deploy | Automated flow from Git to Prod | | 4. Observability | Golden Signals, Logs, Backups | You know it's broken before users do |

When Process Reveals "Legacy Drift"

When you inherit a project built via "ClickOps" (manual console work):

Don't nuke it.
Audit: Use tools (like terraformer or gcp-export) to read the current state.
Import: Systematically import resources into Terraform state.
Lock Down: Once imported, remove write access to the Console to force usage of the pipeline.

Supporting Techniques

superpowers:infrastructure-as-code - Writing clean Terraform modules.
superpowers:container-security - Scanning images for vulnerabilities.
superpowers:post-mortem - Learning from outages (Blameless RCA).

Real-World Impact

"SysAdmin" Approach: Manual servers, "it works on my machine," fear of Friday deploys, 3 AM wake-up calls.
"DevOps" Approach: Immutable infrastructure, automated pipelines, deploys on Friday at 4 PM without sweating, sleeping through the night.

🛠️ Modern Platform Stack (2026)

Infrastructure as Code

Terraform/OpenTofu: Multi-cloud IaC
Pulumi: Code-native IaC (TypeScript, Python, Go)
CDK (AWS/GCP): Cloud-native IaC

Container Orchestration

Kubernetes: Industry standard (GKE, EKS, AKS)
Nomad: Simpler alternative (HashiCorp)
Cloud Run / Fargate: Serverless containers

GitOps

ArgoCD: Kubernetes deployments
Flux: CNCF GitOps
Atlantis: Terraform pull requests

Serverless

Functions: AWS Lambda, Cloud Functions, Cloudflare Workers
Edge Compute: Cloudflare Workers (ms latency globally)
Containers: Cloud Run, Fargate, Fly.io

Observability

OpenTelemetry: Unified standard
Platforms: Datadog, Grafana Cloud, New Relic
Logs: Loki, CloudWatch, Mezmo
Metrics: Prometheus, VictoriaMetrics

CI/CD

GitHub Actions: Popular, integrated
GitLab CI: Full DevOps platform
CircleCI / Buildkite: Specialized

💰 FinOps Best Practices

Cost Optimization Checklist

[ ] Tag all resources with team/project/environment
[ ] Set up budget alerts at 50%/80%/100%
[ ] Use spot instances for dev/test environments
[ ] Implement auto-scaling with aggressive scale-down
[ ] Set lifecycle policies on storage (hot → cold → archive)
[ ] Review and delete unused resources monthly
[ ] Use reserved instances/savings plans for stable workloads
[ ] Monitor egress costs (data transfer between regions)
[ ] Compress and deduplicate backups
[ ] Use CDN to reduce origin server traffic

Cost Monitoring Dashboard

| Metric | Alert Threshold | Why | |--------|----------------|-----| | Daily spend | 2x 7-day average | Detect runaway costs | | Per-service budget | 80% of monthly | Prevent overruns | | Egress traffic | 10% increase | Network costs spike | | Storage growth | 20% month-over-month | Lifecycle policies failing | | Idle resources | > 7 days unused | Wasted spend |

The Four Phases

You MUST complete each phase before proceeding to the next.

Phase 1: Architecture & Security (The Blueprint)

BEFORE writing a single line of YAML or Terraform:

Define the Topology
- Draw the network diagram (VPCs, Subnets, Firewalls).
- Where does the data live? Is it public or private?
- Rule: nothing talks to the internet unless explicitly required. Default to private.
Identity & Access Management (IAM)
- Define "Who" needs access to "What."
- Principle of Least Privilege: A build service should not have Admin access.
- No Hardcoded Secrets: Plan where keys/passwords live (Secret Manager, Vault).
Cost Estimation
- Check the pricing calculator.
- Set up Budget Alerts before deploying resources.
- Avoid "Cloud Shock" by defining retention policies (logs, backups) now.

Phase 1.5: FinOps & Cost Optimization (2026)

Cloud costs are the #2 startup killer (after running out of customers):

Tagging Strategy (Cost Accountability)
- Required Tags: environment, team, project, cost-center
- Enable: Cost allocation reports, blame tracking
- Enforcement: Tag policies (reject untagged resources)
```
resource "google_storage_bucket" "data" {
  labels = {
    environment = var.environment
    team        = "data-platform"
    project     = "analytics"
  }
}
```
Right-Sizing & Committed Use
- Spot/Preemptible Instances: 60-90% cheaper for fault-tolerant workloads
- Reserved Instances: 1-3 year commit for predictable loads (40-60% savings)
- Savings Plans: Flexible commitment (AWS/Azure)
- Auto-scaling: Scale down aggressively (nights, weekends)
- Tool: Infracost (estimate before deploy), Kubecost (k8s)
Cost Alerts & Anomaly Detection
- Budget Alerts: 50%, 80%, 100%, 120% thresholds
- Per-Service Budgets: Separate for Compute, Storage, Network
- Anomaly Detection: Alert on 2x daily average
- Example: "BigQuery cost is 10x higher today vs yesterday"
Storage Lifecycle Policies
- Hot → Warm → Cold → Archive
  - S3: Standard → IA → Glacier → Deep Archive
  - GCS: Standard → Nearline → Coldline → Archive
- Auto-delete: Logs after 90 days, temp files after 7 days
- Savings: 50-90% on storage costs

Phase 2: Infrastructure as Code (IaC)

Treat infrastructure like software:

Codify Resources (Terraform/OpenTofu)
- Rule: Never click in the Console to create production resources ("ClickOps").
- Use Modules for repeatable patterns (e.g., a standard bucket configuration).
- Store state remotely and lock it (e.g., GCS bucket with state locking).
Immutability Strategy
- Don't SSH into servers to patch them.
- Plan to replace infrastructure, not repair it ("Cattle, not Pets").
- Define your container strategy (Dockerfiles, Base Images).
Environment Parity
- Staging should look like Production (just smaller).
- Use variables to handle differences, not different logic.
- Goal: If it works in Staging, it works in Prod.

Phase 2.5: GitOps & Modern Deployment

The infrastructure deployment evolution:

GitOps Pattern
- Git as Source of Truth: Desired state lives in Git
- Pull-Based Deployment: Cluster pulls config (vs push)
- Auto-Sync: Changes to Git automatically deploy
- Tools: ArgoCD (k8s), Flux, Atlantis (Terraform)
- Benefits: Audit trail, easy rollback, no CI/CD credentials in cluster
Drift Detection
- Problem: Someone clicks in console, state diverges from Git
- Solution: Automated drift detection + alerts
- terraform plan in CI/CD on schedule
- ArgoCD: Visual diff of desired vs actual state
Progressive Delivery
- Blue/Green: Deploy new version, switch traffic instantly
- Canary: Route 5% → 25% → 100% traffic gradually
- Feature Flags: Enable features for subset of users
- Tools: Flagger (k8s), LaunchDarkly, Split.io

Phase 3: The Pipeline (CI/CD)

Automate the factory:

Continuous Integration (CI)
- Lint the code (TFSec, Hadolint).
- Run unit tests.
- Build the artifact (Container Image, Binary).
- Rule: The build fails if the tests fail. No exceptions.
Continuous Deployment (CD)
- Automate the rollout.
- Gatekeeping: Require approval for Prod deployment?
- Strategy: Blue/Green? Canary? Rolling Update? Choose one to minimize downtime.
Rollback Strategy
- "Break Glass" procedure: How do we undo the last deploy instantly?
- Test the rollback mechanism. A rollback that fails is an outage.

Phase 4: Observability & Reliability (SRE)

Flying the plane:

Logging & Tracing
- Centralize logs (don't leave them on the instance).
- Ensure logs are structured (JSON) so they can be queried.
- Correlate logs across services (Trace ID).
Metrics & Alerting
- The Golden Signals: Latency, Traffic, Errors, Saturation.
- Alert on symptoms (User sees 500 error), not just causes (CPU high).
- Rule: If an alert wakes you up, it must be actionable. If not, delete the alert.
Disaster Recovery (DR)
- Automate backups.
- Crucial: Test the Restore. A backup you haven't restored is just a file you hope works.

Red Flags - STOP and Follow Process

If you catch yourself thinking:

"I'll just change this firewall rule in the console quickly to test."
"I'll hardcode the API key just for this commit."
"I'll write the Terraform import later."
"It works on my laptop, just ship the container."
"We don't need logs for this service."
"I'll ssh in and restart the service manually."
Using admin or owner permissions for a service account.

ALL of these mean: STOP. Return to Phase 1/2.

Your Human Partner's Signals You're Doing It Wrong

Watch for these complaints:

Dev: "The build takes 20 minutes." (Your pipeline is bloated).
PM: "Why did Staging break?" (You lacked Environment Parity).
Sec: "Why is port 22 open to the world?" (You failed Phase 1).
Fin: "Why is the bill 3x higher this month?" (You failed Cost Estimation).
Dev: "I can't debug this without SSH." (You failed Observability).

When you see these: STOP. Fix the platform/process.

Common Rationalizations

Quick Reference

When Process Reveals "Legacy Drift"

When you inherit a project built via "ClickOps" (manual console work):

Don't nuke it.
Audit: Use tools (like terraformer or gcp-export) to read the current state.
Import: Systematically import resources into Terraform state.
Lock Down: Once imported, remove write access to the Console to force usage of the pipeline.

Supporting Techniques

superpowers:infrastructure-as-code - Writing clean Terraform modules.
superpowers:container-security - Scanning images for vulnerabilities.
superpowers:post-mortem - Learning from outages (Blameless RCA).

Real-World Impact

"SysAdmin" Approach: Manual servers, "it works on my machine," fear of Friday deploys, 3 AM wake-up calls.
"DevOps" Approach: Immutable infrastructure, automated pipelines, deploys on Friday at 4 PM without sweating, sleeping through the night.

🛠️ Modern Platform Stack (2026)

Infrastructure as Code

Terraform/OpenTofu: Multi-cloud IaC
Pulumi: Code-native IaC (TypeScript, Python, Go)
CDK (AWS/GCP): Cloud-native IaC

Container Orchestration

Kubernetes: Industry standard (GKE, EKS, AKS)
Nomad: Simpler alternative (HashiCorp)
Cloud Run / Fargate: Serverless containers

GitOps

ArgoCD: Kubernetes deployments
Flux: CNCF GitOps
Atlantis: Terraform pull requests

Serverless

Functions: AWS Lambda, Cloud Functions, Cloudflare Workers
Edge Compute: Cloudflare Workers (ms latency globally)
Containers: Cloud Run, Fargate, Fly.io

Observability

OpenTelemetry: Unified standard
Platforms: Datadog, Grafana Cloud, New Relic
Logs: Loki, CloudWatch, Mezmo
Metrics: Prometheus, VictoriaMetrics

CI/CD

GitHub Actions: Popular, integrated
GitLab CI: Full DevOps platform
CircleCI / Buildkite: Specialized

💰 FinOps Best Practices

Cost Optimization Checklist

[ ] Tag all resources with team/project/environment
[ ] Set up budget alerts at 50%/80%/100%
[ ] Use spot instances for dev/test environments
[ ] Implement auto-scaling with aggressive scale-down
[ ] Set lifecycle policies on storage (hot → cold → archive)
[ ] Review and delete unused resources monthly
[ ] Use reserved instances/savings plans for stable workloads
[ ] Monitor egress costs (data transfer between regions)
[ ] Compress and deduplicate backups
[ ] Use CDN to reduce origin server traffic

Adoption

harshahosur81/platform-engineer

$ install --global

Security Scan Results

SKILL.md

The Four Phases

Phase 1: Architecture & Security (The Blueprint)

Phase 1.5: FinOps & Cost Optimization (2026)

Phase 2: Infrastructure as Code (IaC)

Phase 2.5: GitOps & Modern Deployment

Phase 3: The Pipeline (CI/CD)

Phase 4: Observability & Reliability (SRE)

Red Flags - STOP and Follow Process

Your Human Partner's Signals You're Doing It Wrong

Common Rationalizations

Quick Reference

When Process Reveals "Legacy Drift"

Supporting Techniques

Real-World Impact

🛠️ Modern Platform Stack (2026)

Infrastructure as Code

Container Orchestration

GitOps

Serverless

Observability

CI/CD

💰 FinOps Best Practices

Cost Optimization Checklist

Cost Monitoring Dashboard

Related Skills

harshahosur81/vector-index-tuning

harshahosur81/vector-database-engineer

harshahosur81/similarity-search-patterns

harshahosur81/search-specialist

harshahosur81/platform-engineer

$ install --global

Security Scan Results

SKILL.md

The Four Phases

Phase 1: Architecture & Security (The Blueprint)

Phase 1.5: FinOps & Cost Optimization (2026)

Phase 2: Infrastructure as Code (IaC)

Phase 2.5: GitOps & Modern Deployment

Phase 3: The Pipeline (CI/CD)

Phase 4: Observability & Reliability (SRE)

Red Flags - STOP and Follow Process

Your Human Partner's Signals You're Doing It Wrong

Common Rationalizations

Quick Reference

When Process Reveals "Legacy Drift"

Supporting Techniques

Real-World Impact

🛠️ Modern Platform Stack (2026)

Infrastructure as Code

Container Orchestration

GitOps

Serverless

Observability

CI/CD

💰 FinOps Best Practices

Cost Optimization Checklist

Cost Monitoring Dashboard

Related Skills

harshahosur81/vector-index-tuning

harshahosur81/vector-database-engineer

harshahosur81/similarity-search-patterns

harshahosur81/search-specialist