skills/monitoring-stack-deployer/SKILL.md
Production monitoring stack deployer with Prometheus, Grafana, and SLO-based alerting. Activate on: monitoring setup, Prometheus configuration, Grafana dashboards, alerting rules, SLO definition, metrics pipeline, observability stack. NOT for: application logging (use log-aggregation-architect), distributed tracing (use logging-observability), incident response (use site-reliability-engineer).
npx skillsauth add curiositech/windags-skills monitoring-stack-deployerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Expert in deploying and configuring production monitoring with Prometheus, Grafana, and SLO-driven alerting.
Activate on: "monitoring setup", "Prometheus config", "Grafana dashboard", "alerting rules", "SLO dashboard", "metrics pipeline", "observability stack", "kube-prometheus-stack", "ServiceMonitor"
NOT for: Application logging → log-aggregation-architect | Distributed tracing → logging-observability | Incident response → site-reliability-engineer
| Domain | Technologies | |--------|-------------| | Metrics | Prometheus 3.x, Mimir, Thanos, VictoriaMetrics | | Visualization | Grafana 11, Perses (open-source Grafana alternative) | | Alerting | Alertmanager, PagerDuty, OpsGenie, Slack integration | | SLOs | Sloth, Pyrra, Google SRE workbook burn-rate model | | K8s Native | kube-prometheus-stack, ServiceMonitor, PodMonitor, PrometheusRule |
Traditional (BAD): "Alert if error rate > 1% for 5 minutes"
Problem: Too many false positives, alert fatigue
SLO-Based (GOOD): "Alert if burning SLO budget too fast"
SLO: 99.9% availability over 30 days → 43.2 min error budget
Multi-window burn rate:
┌─────────────────────────────────────────────┐
│ Severity │ Burn Rate │ Long Window │ Short │
│ Critical │ 14.4x │ 1 hour │ 5 min │
│ Warning │ 6x │ 6 hours │ 30 min │
│ Ticket │ 1x │ 3 days │ 6 hrs │
└─────────────────────────────────────────────┘
# PrometheusRule for SLO burn rate
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: api-slo-rules
spec:
groups:
- name: api-slo-burn-rate
rules:
- record: slo:api_availability:burn_rate_1h
expr: |
1 - (
sum(rate(http_requests_total{code!~"5.."}[1h]))
/
sum(rate(http_requests_total[1h]))
)
/ (1 - 0.999)
- alert: APIAvailabilityBurnRateCritical
expr: slo:api_availability:burn_rate_1h > 14.4
and slo:api_availability:burn_rate_5m > 14.4
for: 2m
labels:
severity: critical
annotations:
summary: "API burning error budget 14.4x faster than allowed"
┌─────────────────────────────────────────────────────────┐
│ Service: api-gateway SLO: 99.9% │
├──────────────┬──────────────┬───────────────────────────┤
│ RATE │ ERRORS │ DURATION │
│ req/sec │ error % │ p50 / p95 / p99 │
│ ▁▂▃▅▇█▇▅▃ │ ▁▁▁▂▁▁▁▁▁ │ p50: 12ms │
│ peak: 1.2k │ curr: 0.02% │ p95: 89ms p99: 240ms │
├──────────────┴──────────────┴───────────────────────────┤
│ Error Budget: 38.2 min remaining (88% of 43.2 min) │
│ ████████████████████████████████░░░░ │
└─────────────────────────────────────────────────────────┘
[ ] kube-prometheus-stack or equivalent deployed and healthy
[ ] ServiceMonitors auto-discover all application metrics endpoints
[ ] SLOs defined for every user-facing service
[ ] Burn-rate alerts configured (critical, warning, ticket)
[ ] Recording rules pre-compute expensive queries
[ ] Grafana dashboards use RED method for services, USE for infrastructure
[ ] Alertmanager routes to correct channels (PagerDuty/Slack/OpsGenie)
[ ] Alert grouping and inhibition rules prevent notification storms
[ ] Every alert has a linked runbook
[ ] Metrics retention configured (15d local, long-term in Mimir/Thanos)
[ ] Dashboard provisioned as code (JSON/YAML in Git)
[ ] Error budget dashboard visible to engineering and product
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.