skills/senior-devops/SKILL.md
Production infrastructure, CI/CD pipelines, container orchestration, and cloud operations. TRIGGER: "deploy", "infrastructure", "terraform", "kubernetes", "docker compose", "CI/CD", "monitoring", "alerting", "SLO", "load balancer", "autoscaling", "helm", "IaC", "container orchestration", "incident response", "rollback", "blue-green", "canary" EXCLUDE: GitHub Actions YAML (use github-actions-creator), Docker basics for development (use docker-expert), cloud pricing/billing questions
npx skillsauth add sharkitect-solutions/sharkitect-claude-toolkit senior-devopsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
| File | Load When | Do NOT Load |
|------|-----------|-------------|
| references/infrastructure-as-code.md | Terraform state issues, module design, IaC migration, drift detection | Simple docker-compose or basic CLI questions |
| references/container-orchestration.md | K8s deployments, pod issues, Helm charts, service mesh, autoscaling | Single-container dev environments |
| references/observability-stack.md | Monitoring setup, alert fatigue, SLO design, incident response, dashboards | Application-level logging only |
IF single service or simple app:
IF microservices (3+ services communicating):
references/container-orchestration.mdIF serverless (event-driven, bursty, low-traffic):
references/infrastructure-as-code.md for module patternsIF legacy migration (bare metal or manually configured VMs):
ss -tlnp and systemctl list-units --type=service to discover servicesThese are things experienced DevOps engineers know that contradict common advice:
CPU limits are almost always wrong. K8s CPU limits trigger CFS throttling -- the kernel pauses your container mid-request even when the node has idle cores. Set CPU requests (for scheduling) but omit CPU limits unless you have a specific noisy-neighbor problem. Google's internal clusters run without CPU limits. Memory limits remain mandatory (OOM is unrecoverable; CPU throttling is just slow).
Autoscaling is not the default -- right-sizing is. Teams enable HPA before profiling their service. Most services have stable, predictable load 95% of the time. HPA adds complexity (flapping, cold pods, connection pool exhaustion on scale-down). Right-size with static replicas first. Add HPA only when you have measured traffic variability that justifies it.
Terraform modules should be boring. The instinct to build clever, reusable, parameterized modules creates "Terraform frameworks" that nobody can debug. A module should do one thing with minimal variables. Copy-paste between environments is acceptable when the alternative is a 40-variable module with conditionals. The DRY principle applies less aggressively to infrastructure than to application code.
Canary deployments are useless without observability. Teams implement canary deploys and declare victory. But if you cannot detect a 1% error rate increase within 5 minutes, the canary gives you zero signal. You need error budget burn rate alerting (see observability companion) before canary adds value. Without it, you are just doing a slow rolling deploy with extra steps.
The cgroups memory trap. Your container uses 180MB but gets OOMKilled at a 256MB limit. Why? The kernel page cache counts toward the cgroup's memory usage. File-heavy operations (reading config files, log writes, temp file processing) consume page cache that is attributed to your container. Set memory limits at 2x observed RSS, not 1.5x, to account for kernel-managed memory.
GitOps does not mean "put everything in git." GitOps (ArgoCD/Flux) reconciles cluster state to a git repo. But secrets, ephemeral jobs, and debug resources should NOT go through GitOps. Reconciliation loops will delete your kubectl run debug pods. Use GitOps for steady-state resources (deployments, services, config maps). Use imperative commands for operational tasks.
terraform init # Download providers, initialize backend
terraform plan # Preview changes (ALWAYS review before apply)
terraform apply # Execute changes (requires plan approval)
terraform destroy # Tear down (use -target for surgical removal)
IF solo developer, single environment: Local state with .gitignore (temporary only)
IF team of 2+: Remote state with locking. S3 + DynamoDB (AWS) or GCS + Cloud Storage (GCP).
IF enterprise or multi-team: Terraform Cloud or Spacelift. State is managed, RBAC on workspaces.
NEVER: Commit state to git. State contains secrets in plaintext.
modules/
networking/ # VPC, subnets, security groups, NAT gateways
main.tf
variables.tf
outputs.tf
compute/ # EC2/GCE instances, ASGs, launch templates
data/ # RDS, ElastiCache, S3 buckets
dns/ # Route53/Cloud DNS records
environments/
production/ # Calls modules with prod values
main.tf # module "networking" { source = "../../modules/networking" }
terraform.tfvars
staging/
main.tf
terraform.tfvars
Rule: modules contain zero hardcoded values. All configuration flows through variables.
# Stage 1: Build (1.2 GB with all dev dependencies)
FROM node:20-alpine AS builder
WORKDIR /app
COPY package*.json ./
RUN npm ci --ignore-scripts
COPY . .
RUN npm run build && npm prune --production
# Stage 2: Production (89 MB)
FROM node:20-alpine
RUN addgroup -g 1001 app && adduser -u 1001 -G app -s /bin/sh -D app
WORKDIR /app
COPY --from=builder --chown=app:app /app/dist ./dist
COPY --from=builder --chown=app:app /app/node_modules ./node_modules
USER app
EXPOSE 3000
CMD ["node", "dist/server.js"]
Key wins: non-root user, no dev dependencies, no source code in production image.
resources:
requests: # Scheduler uses this for placement
cpu: 100m # 0.1 CPU core -- what the pod normally uses
memory: 128Mi # What the pod normally uses
limits:
cpu: 500m # Burst ceiling -- throttled beyond this, NOT killed
memory: 256Mi # Hard ceiling -- OOMKilled beyond this
OOMKilled debugging: kubectl describe pod <name> shows Last State: Terminated, Reason: OOMKilled. Fix: increase memory limit OR find the leak. Check kubectl top pod during load test to find actual usage. Set limit at 1.5x observed peak.
readinessProbe: # "Can this pod serve traffic?"
httpGet:
path: /healthz
port: 3000
initialDelaySeconds: 5 # Wait for app startup
periodSeconds: 10
failureThreshold: 3 # 3 failures = stop sending traffic
livenessProbe: # "Is this pod alive at all?"
httpGet:
path: /livez
port: 3000
initialDelaySeconds: 15 # MUST be longer than readiness initial delay
periodSeconds: 20
failureThreshold: 5 # 5 failures = restart pod
Timing gotcha: If livenessProbe fires before the app finishes starting, K8s kills and restarts the pod in a crash loop. Set initialDelaySeconds for liveness to at least startup time + buffer. For JVM apps, this can be 60-90s. Use startupProbe for slow-starting apps instead.
| Strategy | Rollback Speed | Extra Cost | Risk Level | Use When | |----------|---------------|------------|------------|----------| | Blue-Green | Instant (DNS/LB switch) | 2x infra during deploy | Low | Database-compatible releases, need instant rollback | | Canary | Fast (route 0% to canary) | +1 instance minimum | Low | Validating with real traffic, gradual confidence building | | Rolling | Slow (redeploy old version) | None | Medium | Stateless services, backward-compatible changes | | Recreate | Slow (full redeploy) | None | High | Stateful apps that cannot run two versions, dev/staging only |
Blue-green database trap: Both blue and green must work with the same database schema. Deploy schema changes BEFORE the app change. Never deploy a breaking schema change and app change simultaneously.
Deploying directly to production without staging. "It works on my machine" becomes "it crashed in production." Every change goes through: dev -> staging -> production. No exceptions. Staging must mirror production config (same env vars, same resource limits, same network topology).
Storing Terraform state on a local filesystem. One developer runs terraform apply, another runs it from their machine with stale state, and infrastructure gets destroyed or duplicated. Remote state with locking is non-negotiable for any team larger than one person.
Running K8s pods without resource requests or limits. The scheduler cannot make informed placement decisions. One pod consumes all node memory and triggers OOMKilled cascading across unrelated services. Every pod gets requests AND limits.
Manually SSH-ing into servers to install packages, edit configs, or fix issues. No two servers are identical. When the server dies, nobody knows how to rebuild it. If you SSH into a production server to fix something, your next task is codifying that fix in IaC.
Alerting on every metric crossing any threshold. The on-call engineer receives 50 alerts per shift, learns to ignore all of them, and misses the one that matters. Alert on symptoms (error rate, latency), not causes (CPU usage, disk I/O). See references/observability-stack.md for severity design.
One CI/CD pipeline builds, tests, and deploys all services in a monorepo. A typo in Service A's README triggers a 45-minute rebuild of Services B through E. Each service gets its own pipeline with path-based triggers. Only build what changed.
Running base images that were pulled months ago without updates. Known CVEs accumulate silently. Schedule weekly base image rebuilds. Pin to minor versions (node:20-alpine), not SHA digests, so patch updates flow in. Scan images with Trivy or Grype in CI -- fail the build on CRITICAL/HIGH CVEs.
Hardcoding secrets in docker-compose files, Kubernetes manifests, or environment variable definitions committed to git. Use a secrets manager (Vault, AWS SSM Parameter Store, GCP Secret Manager) and reference secrets by path, never by value.
SLI (Service Level Indicator): A measured metric. Examples:
SLO (Service Level Objective): Internal target. "p99 latency < 500ms over 30-day window."
SLA (Service Level Agreement): Contractual promise with financial penalties.
| Level | Definition | Response Time | Example | |-------|-----------|---------------|---------| | SEV1 | Total service outage, data loss risk | 15 min | Database corruption, full site down | | SEV2 | Major feature broken, significant user impact | 30 min | Payment processing failed, auth broken | | SEV3 | Minor feature degraded, workaround exists | 4 hours | Search slow, non-critical API errors | | SEV4 | Cosmetic or minor, no user impact | Next business day | Dashboard graph rendering issue |
DETECTED: [timestamp] - How was it detected? (alert/user report/monitoring)
ASSESSED: [timestamp] - Severity level assigned, responders identified
MITIGATED: [timestamp] - Bleeding stopped (rollback/feature flag/scaling)
RESOLVED: [timestamp] - Root cause fixed and deployed
REVIEWED: [date] - Postmortem completed and action items assigned
IF pods are crashing:
kubectl top pod under load reveals actual usage. Check for cgroups page cache contribution (see Counterintuitive Truths).kubectl logs <pod> --previous for app-level crash. Common: liveness probe fires before startup completes. Use startupProbe for JVM/Python apps with 30s+ startup.imagePullSecrets.kubectl describe pod events. Usually: insufficient CPU/memory (node full), no nodes match nodeSelector/affinity, or PVC cannot bind (wrong storage class or AZ mismatch).IF deploys cause 502 errors:
preStop: sleep 10 so the load balancer drains before the app shuts down. See container-orchestration companion for the full shutdown sequence.maxSurge: 25% and maxUnavailable: 0 for zero-downtime.IF Terraform plan shows unexpected destroys:
~ (update in-place) is safe. -/+ (destroy and recreate) is dangerous. Check why: provider version change altering resource schema, or create_before_destroy missing on resources that need it.terraform refresh then terraform plan to see what changed.terraform apply a plan with unexpected destroys without understanding each one. Use -target for surgical application.IF CI/CD pipeline takes too long:
COPY package*.json before COPY . ..development
When the user wants help with paid advertising campaigns on Google Ads, Meta (Facebook/Instagram), LinkedIn, Twitter/X, or other ad platforms. Also use when the user mentions 'PPC,' 'paid media,' 'ad copy,' 'ad creative,' 'ROAS,' 'CPA,' 'ad campaign,' 'retargeting,' or 'audience targeting.' This skill covers campaign strategy, ad creation, audience targeting, and optimization.
testing
--- name: using-sharkitect-methodology description: Use when starting any conversation in a Sharkitect workspace OR before any task involving NEW pricing, positioning, proposal, strategy, plan-execution, or schema-design work — mandates invocation of Sharkitect-specific methodology skills (pricing-strategy, marketing-strategy-pmm, smb-cfo, hq-revenue-ops, executing-plans, brainstorming) under the same anti-rationalization discipline as using-superpowers. Documentation has failed 4 times across H
testing
Use when user says 'end session', 'wrap up', 'stop for the day', 'done for today', 'close out', 'save session', 'wrapping up', or invokes /end-session. Runs the full 9-step end-of-session protocol: resource audit, MEMORY.md update, lessons capture, plan status, pending items, workspace checklist, .tmp/ audit, git commit+push, Supabase brain sync, session brief, summary. Final step schedules a detached self-kill of the current session ONLY (3s delay) so the window closes cleanly. Other claude.exe processes (active workspaces) are NOT touched -- orphan cleanup is handled separately by Claude-Orphan-Cleanup-Hourly with proper age safeguards. Do NOT use for: mid-session quick saves (use session-checkpoint), skill syncing (use sync-skills.py), brain memory queries (use supabase-sync.py pull), document freshness reviews (use document-lifecycle), resource gap detection (use resource-auditor).
testing
Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, passive voice, negative parallelisms, and filler phrases.