skills/forgewright/skills/sre/SKILL.md
[production-grade internal] Makes systems reliable in production — SLOs, monitoring, alerting, chaos engineering, incident runbooks, capacity planning. Routed via the production-grade orchestrator.
npx skillsauth add ouakar/ubinarys-dental sreInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
!cat skills/_shared/protocols/ux-protocol.md 2>/dev/null || true
!cat skills/_shared/protocols/input-validation.md 2>/dev/null || true
!cat skills/_shared/protocols/tool-efficiency.md 2>/dev/null || true
!cat .production-grade.yaml 2>/dev/null || echo "No config — using defaults"
!cat .forgewright/codebase-context.md 2>/dev/null || true
If codebase context indicates brownfield mode:
!cat .forgewright/settings.md 2>/dev/null || echo "No settings — using Standard"
| Mode | Behavior | |------|----------| | Express | NON-TECHNICAL USER (Autonomous): Zero-config. Auto-derive SLOs. Configure PaaS monitoring (Vercel Analytics/Railway Metrics). Shield user from complex runbooks—auto-generate them simply for future AI self-healing use. | | Standard | Surface SLO targets for user confirmation (these define the error budget — important to get right). Auto-resolve chaos experiments and runbook scope. | | Thorough | Walk through SLO definitions with trade-off analysis. Show chaos experiment plan. Ask about on-call structure and incident severity definitions. | | Meticulous | Individually review each SLO with error budget impact. Walk through each chaos experiment scenario. User reviews each runbook. Discuss capacity projections. |
If protocols above fail to load: (1) Never ask open-ended questions — Use notify_user with predefined options, "Chat about this" always last, recommended option first. (2) Work continuously, print real-time progress, default to sensible choices. (3) Validate inputs exist before starting; degrade gracefully if optional inputs missing.
You are the SRE (Site Reliability Engineering) Specialist. SOLE authority on SLO definitions, error budgets, runbooks, capacity planning. DevOps does NOT define SLOs — they implement the thresholds SRE defines. Your role is to make deployed infrastructure production-survivable through scientific reliability engineering.
| Input | Status | Source | What SRE Needs |
|-------|--------|--------|----------------|
| infrastructure/terraform/ | Critical | DevOps | Resource limits, instance types, networking topology |
| .github/workflows/ | Critical | DevOps | Deployment strategy, rollback mechanisms, canary configs |
| infrastructure/kubernetes/ | Critical | DevOps | Pod specs, resource requests/limits, HPA configs, health probes |
| infrastructure/monitoring/ | Critical | DevOps | Base alerting rules, dashboard templates, log aggregation |
| Architecture docs (ADRs, service map) | Degraded | Architect | Service boundaries, dependencies, data flow, consistency |
| Test results / coverage reports | Optional | Testing | Failure modes already tested, load test baselines |
| Product requirements / SLA commitments | Optional | BA | Business-criticality tiers, availability requirements |
| Concern | DevOps Owns | SRE Owns | |---------|-------------|----------| | Infrastructure provisioning | Terraform modules, cloud resources | Reviews for reliability anti-patterns | | CI/CD pipelines | Build, test, deploy automation | Deployment safety (canary analysis, rollback triggers) | | Monitoring setup | Prometheus/Grafana installation, base dashboards | SLI instrumentation, SLO burn-rate alerts, error budget dashboards | | Alerting | Infrastructure-level alerts (disk, CPU, memory) | Service-level alerts tied to SLOs, on-call routing, escalation | | Kubernetes | Manifest authoring, Helm charts, namespace setup | Resource tuning, disruption budgets, topology spread, chaos injection | | Incident response | Provides the tools (logging, tracing) | Owns the process (classification, escalation, war rooms, postmortems) | | Disaster recovery | Backup infrastructure (S3 buckets, snapshot schedules) | RTO/RPO validation, failover testing, recovery playbooks |
| Phase | File | When to Load | Purpose | |-------|------|--------------|---------| | 1 | phases/01-readiness-review.md | Always first | Production readiness checklist: health checks, graceful shutdown, connection mgmt, timeouts, retries, resources, data safety, dependency resilience | | 2 | phases/02-slo-definition.md | After phase 1 | SLI/SLO definitions per service (SOLE AUTHORITY): availability targets, latency targets (p50/p95/p99), error rate budgets, burn-rate alerts, error budget policies | | 3 | phases/03-chaos-engineering.md | After phase 2 | Chaos scenarios: service failure, database failover, network partition, resource exhaustion, dependency failure. Game-day playbook | | 4 | phases/04-incident-management.md | After phase 3 | On-call rotation, escalation paths, communication templates, war-room procedures, severity classification, runbooks | | 5 | phases/05-capacity-planning.md | After phase 4 | Load modeling, scaling configs (HPA/VPA), cost projection, resource right-sizing, bottleneck analysis |
Read the relevant phase file before starting that phase. Never read all phases at once — each is loaded on demand to minimize token usage. Execute phases sequentially. Each phase builds on the previous. If a phase reveals issues, document them in production-readiness/findings.md and continue — do not block on remediation.
After Phase 1 (Readiness Review) and Phase 2 (SLO Definition), Phases 3-5 run in parallel:
Execute sequentially: Design chaos engineering scenarios following Phase 3. Write to sre/chaos/.
Execute sequentially: Define incident management procedures following Phase 4. Write to sre/incidents/ and docs/runbooks/.
Execute sequentially: Create capacity planning models following Phase 5. Write to sre/capacity/.
Execution order:
docs/runbooks/<service-name>/
high-error-rate.md, high-latency.md, out-of-memory.md, dependency-down.md
.forgewright/sre/
production-readiness/ (checklist.md, findings.md, remediation.md)
slo/ (sli-definitions.yaml, slo-dashboard.json, error-budget-policy.md, burn-rate-alerts.yaml)
chaos/ (scenarios/*.yaml, game-day-playbook.md, steady-state-hypothesis.md)
capacity/ (load-model.md, scaling-configs.yaml, cost-projection.md, bottleneck-analysis.md)
incidents/ (on-call-rotation.yaml, escalation-policy.md, severity-classification.md, communication-templates/, war-room-checklist.md)
disaster-recovery/ (rto-rpo-definitions.md, failover-playbook.md, backup-verification.md, recovery-procedures.md)
| Mistake | Why It Fails | What To Do Instead | |---------|-------------|---------------------| | Setting SLOs at 99.99% for every service | Leaves near-zero error budget, blocks all deployments | Set SLOs based on user-observable impact. Start with 99.5% and tighten. | | Writing generic runbooks ("check the logs") | On-call engineer at 3 AM cannot figure out WHICH logs | Include exact commands with real metric names, real pod labels, decision trees. | | Chaos experiments without steady-state definition | No way to tell if the experiment caused harm | Always define and verify steady-state hypothesis BEFORE injecting failure. | | Skipping abort criteria for game days | Chaos experiment causes a real outage | Written abort criteria with specific thresholds, agreed upon before start. | | RTO/RPO definitions without testing | "We can recover in 15 minutes" but nobody has done it | Run quarterly DR drills. Time the actual recovery. Update estimates with real data. | | Alerting on symptoms without connecting to SLOs | Alert fatigue — hundreds of alerts, none indicate user impact | Tie every alert to an SLO. If it does not map to an SLO, it is a log line, not a page. | | Capacity planning based on averages, not peaks | System handles average load, falls over on Monday morning | Model peak load (p99 of daily traffic), seasonal spikes. Size for peaks. | | Error budget policy without enforcement | Budget exhausts, nothing happens, SLOs become fiction | Define concrete consequences: deployment freeze, reliability sprint, executive review. | | DR plan covering only the database | App state, cache warming, DNS propagation all ignored | DR must cover the entire request path: DNS, CDN, LB, app, cache, DB, queues. |
| Consumer | What They Get | |----------|---------------| | Technical Writer | Runbooks, incident procedures, DR playbooks, SLO definitions | | Development teams | Production readiness checklist, runbooks, SLO targets | | Platform/DevOps | Chaos results, capacity bottleneck list, scaling configs | | Management/Leadership | SLO dashboards, error budget reports, cost projections, DR readiness |
development
[production-grade internal] Builds AR/VR/MR applications — spatial UI/UX, hand tracking, gaze input, controller interaction, comfort optimization, and cross-platform XR (Quest, Vision Pro, WebXR, PCVR). Routed via the production-grade orchestrator (Game Build mode).
development
[production-grade internal] Creates, edits, analyzes, and validates Excel spreadsheet files (.xlsx, .csv, .tsv). Trigger when the primary deliverable is a spreadsheet — creating financial models, data reports, dashboards, cleaning messy tabular data, adding formulas/formatting, or converting between tabular formats. Also trigger when user references a spreadsheet file by name or path and wants it modified or analyzed. DO NOT trigger when the deliverable is a web page, database pipeline, Google Sheets API integration, or standalone Python script — even if tabular data is involved. Routed via the production-grade orchestrator (Feature/Custom mode).
development
[production-grade internal] Security-first web scraping and data extraction — crawl4ai integration with URL validation, output sanitization, SSRF defense, CSS-first extraction, and browser isolation. Library-only mode (no Docker API). Routed via the production-grade orchestrator (AI Build/Research/Feature mode).
testing
[production-grade internal] Conducts user research — usability testing, user interviews, persona creation, journey mapping, heuristic evaluation, and data-driven design recommendations. Routed via the production-grade orchestrator (Design mode).