Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

curiositech/monitoring-stack-deployer

Name: monitoring-stack-deployer
Author: curiositech

skills/monitoring-stack-deployer/SKILL.md

npx skillsauth add curiositech/windags-skills monitoring-stack-deployer

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Monitoring Stack Deployer

Expert in deploying and configuring production monitoring with Prometheus, Grafana, and SLO-driven alerting.

Activation Triggers

Activate on: "monitoring setup", "Prometheus config", "Grafana dashboard", "alerting rules", "SLO dashboard", "metrics pipeline", "observability stack", "kube-prometheus-stack", "ServiceMonitor"

NOT for: Application logging → log-aggregation-architect | Distributed tracing → logging-observability | Incident response → site-reliability-engineer

Quick Start

Deploy kube-prometheus-stack — Prometheus, Grafana, Alertmanager, node-exporter in one Helm chart
Define SLOs — availability and latency targets per service
Create ServiceMonitors — auto-discover application metrics endpoints
Build dashboards — USE method (utilization, saturation, errors) for infrastructure; RED method (rate, errors, duration) for services
Configure alerting — SLO burn-rate alerts, not threshold alerts

Core Capabilities

| Domain | Technologies | |--------|-------------| | Metrics | Prometheus 3.x, Mimir, Thanos, VictoriaMetrics | | Visualization | Grafana 11, Perses (open-source Grafana alternative) | | Alerting | Alertmanager, PagerDuty, OpsGenie, Slack integration | | SLOs | Sloth, Pyrra, Google SRE workbook burn-rate model | | K8s Native | kube-prometheus-stack, ServiceMonitor, PodMonitor, PrometheusRule |

Architecture Patterns

SLO-Based Burn-Rate Alerting

Traditional (BAD):  "Alert if error rate > 1% for 5 minutes"
  Problem: Too many false positives, alert fatigue

SLO-Based (GOOD):  "Alert if burning SLO budget too fast"
  SLO: 99.9% availability over 30 days → 43.2 min error budget

  Multi-window burn rate:
  ┌─────────────────────────────────────────────┐
  │ Severity │ Burn Rate │ Long Window │ Short  │
  │ Critical │ 14.4x     │ 1 hour      │ 5 min  │
  │ Warning  │ 6x        │ 6 hours     │ 30 min │
  │ Ticket   │ 1x        │ 3 days      │ 6 hrs  │
  └─────────────────────────────────────────────┘

Prometheus Recording Rules for SLOs

# PrometheusRule for SLO burn rate
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-slo-rules
spec:
  groups:
    - name: api-slo-burn-rate
      rules:
        - record: slo:api_availability:burn_rate_1h
          expr: |
            1 - (
              sum(rate(http_requests_total{code!~"5.."}[1h]))
              /
              sum(rate(http_requests_total[1h]))
            )
            / (1 - 0.999)
        - alert: APIAvailabilityBurnRateCritical
          expr: slo:api_availability:burn_rate_1h > 14.4
            and slo:api_availability:burn_rate_5m > 14.4
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "API burning error budget 14.4x faster than allowed"

RED Method Dashboard Layout

┌─────────────────────────────────────────────────────────┐
│ Service: api-gateway                     SLO: 99.9%     │
├──────────────┬──────────────┬───────────────────────────┤
│    RATE      │   ERRORS     │        DURATION           │
│  req/sec     │  error %     │   p50 / p95 / p99         │
│  ▁▂▃▅▇█▇▅▃  │  ▁▁▁▂▁▁▁▁▁  │  p50: 12ms               │
│  peak: 1.2k  │  curr: 0.02% │  p95: 89ms  p99: 240ms  │
├──────────────┴──────────────┴───────────────────────────┤
│ Error Budget: 38.2 min remaining (88% of 43.2 min)      │
│ ████████████████████████████████░░░░                     │
└─────────────────────────────────────────────────────────┘

Anti-Patterns

Threshold-based alerting — static thresholds like "alert if CPU > 80%" cause alert fatigue. Use SLO burn rates that correlate with user impact.
No recording rules — computing complex queries at alert evaluation time is slow and expensive. Pre-compute with recording rules.
Dashboard sprawl — hundreds of dashboards nobody checks. Build one service dashboard template, parameterize with variables.
Missing service discovery — manually listing scrape targets. Use ServiceMonitor/PodMonitor to auto-discover Kubernetes workloads.
Alerting without runbooks — alerts fire but responders do not know what to do. Every alert must link to a runbook with diagnosis steps.

Quality Checklist

[ ] kube-prometheus-stack or equivalent deployed and healthy
[ ] ServiceMonitors auto-discover all application metrics endpoints
[ ] SLOs defined for every user-facing service
[ ] Burn-rate alerts configured (critical, warning, ticket)
[ ] Recording rules pre-compute expensive queries
[ ] Grafana dashboards use RED method for services, USE for infrastructure
[ ] Alertmanager routes to correct channels (PagerDuty/Slack/OpsGenie)
[ ] Alert grouping and inhibition rules prevent notification storms
[ ] Every alert has a linked runbook
[ ] Metrics retention configured (15d local, long-term in Mimir/Thanos)
[ ] Dashboard provisioned as code (JSON/YAML in Git)
[ ] Error budget dashboard visible to engineering and product

curiositech/monitoring-stack-deployer

skills/monitoring-stack-deployer/SKILL.md

Production monitoring stack deployer with Prometheus, Grafana, and SLO-based alerting. Activate on: monitoring setup, Prometheus configuration, Grafana dashboards, alerting rules, SLO definition, metrics pipeline, observability stack. NOT for: application logging (use log-aggregation-architect), distributed tracing (use logging-observability), incident response (use site-reliability-engineer).

testing

Updated Apr 4, 2026

$ install --global

skillsauth

npx skillsauth add curiositech/windags-skills monitoring-stack-deployer

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 4, 2026, 2:15 PM4.3s1 file scanned

SKILL.md

license:: Apache-2.0
name:: monitoring-stack-deployer
description:: Production monitoring stack deployer with Prometheus, Grafana, and SLO-based alerting. Activate on: monitoring setup, Prometheus configuration, Grafana dashboards, alerting rules, SLO definition, metrics pipeline, observability stack. NOT for: application logging (use log-aggregation-architect), distributed tracing (use logging-observability), incident response (use site-reliability-engineer).
allowed-tools:: Read,Write,Edit,Bash(docker:*,kubectl:*,terraform:*,npm:*,npx:*)
category:: DevOps & Infrastructure
- skill:: log-aggregation-architect
reason:: Logging and metrics pipelines often share infrastructure

Monitoring Stack Deployer

Expert in deploying and configuring production monitoring with Prometheus, Grafana, and SLO-driven alerting.

Activation Triggers

Activate on: "monitoring setup", "Prometheus config", "Grafana dashboard", "alerting rules", "SLO dashboard", "metrics pipeline", "observability stack", "kube-prometheus-stack", "ServiceMonitor"

NOT for: Application logging → log-aggregation-architect | Distributed tracing → logging-observability | Incident response → site-reliability-engineer

Quick Start

Deploy kube-prometheus-stack — Prometheus, Grafana, Alertmanager, node-exporter in one Helm chart
Define SLOs — availability and latency targets per service
Create ServiceMonitors — auto-discover application metrics endpoints
Build dashboards — USE method (utilization, saturation, errors) for infrastructure; RED method (rate, errors, duration) for services
Configure alerting — SLO burn-rate alerts, not threshold alerts

Core Capabilities

Architecture Patterns

SLO-Based Burn-Rate Alerting

Traditional (BAD):  "Alert if error rate > 1% for 5 minutes"
  Problem: Too many false positives, alert fatigue

SLO-Based (GOOD):  "Alert if burning SLO budget too fast"
  SLO: 99.9% availability over 30 days → 43.2 min error budget

  Multi-window burn rate:
  ┌─────────────────────────────────────────────┐
  │ Severity │ Burn Rate │ Long Window │ Short  │
  │ Critical │ 14.4x     │ 1 hour      │ 5 min  │
  │ Warning  │ 6x        │ 6 hours     │ 30 min │
  │ Ticket   │ 1x        │ 3 days      │ 6 hrs  │
  └─────────────────────────────────────────────┘

Prometheus Recording Rules for SLOs

# PrometheusRule for SLO burn rate
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
  name: api-slo-rules
spec:
  groups:
    - name: api-slo-burn-rate
      rules:
        - record: slo:api_availability:burn_rate_1h
          expr: |
            1 - (
              sum(rate(http_requests_total{code!~"5.."}[1h]))
              /
              sum(rate(http_requests_total[1h]))
            )
            / (1 - 0.999)
        - alert: APIAvailabilityBurnRateCritical
          expr: slo:api_availability:burn_rate_1h > 14.4
            and slo:api_availability:burn_rate_5m > 14.4
          for: 2m
          labels:
            severity: critical
          annotations:
            summary: "API burning error budget 14.4x faster than allowed"

RED Method Dashboard Layout

┌─────────────────────────────────────────────────────────┐
│ Service: api-gateway                     SLO: 99.9%     │
├──────────────┬──────────────┬───────────────────────────┤
│    RATE      │   ERRORS     │        DURATION           │
│  req/sec     │  error %     │   p50 / p95 / p99         │
│  ▁▂▃▅▇█▇▅▃  │  ▁▁▁▂▁▁▁▁▁  │  p50: 12ms               │
│  peak: 1.2k  │  curr: 0.02% │  p95: 89ms  p99: 240ms  │
├──────────────┴──────────────┴───────────────────────────┤
│ Error Budget: 38.2 min remaining (88% of 43.2 min)      │
│ ████████████████████████████████░░░░                     │
└─────────────────────────────────────────────────────────┘

Anti-Patterns

Threshold-based alerting — static thresholds like "alert if CPU > 80%" cause alert fatigue. Use SLO burn rates that correlate with user impact.
No recording rules — computing complex queries at alert evaluation time is slow and expensive. Pre-compute with recording rules.
Dashboard sprawl — hundreds of dashboards nobody checks. Build one service dashboard template, parameterize with variables.
Missing service discovery — manually listing scrape targets. Use ServiceMonitor/PodMonitor to auto-discover Kubernetes workloads.
Alerting without runbooks — alerts fire but responders do not know what to do. Every alert must link to a runbook with diagnosis steps.

Quality Checklist

[ ] kube-prometheus-stack or equivalent deployed and healthy
[ ] ServiceMonitors auto-discover all application metrics endpoints
[ ] SLOs defined for every user-facing service
[ ] Burn-rate alerts configured (critical, warning, ticket)
[ ] Recording rules pre-compute expensive queries
[ ] Grafana dashboards use RED method for services, USE for infrastructure
[ ] Alertmanager routes to correct channels (PagerDuty/Slack/OpsGenie)
[ ] Alert grouping and inhibition rules prevent notification storms
[ ] Every alert has a linked runbook
[ ] Metrics retention configured (15d local, long-term in Mimir/Thanos)
[ ] Dashboard provisioned as code (JSON/YAML in Git)
[ ] Error budget dashboard visible to engineering and product

Related Skills

curiositech/revisiting-interview-data-analysing-turn

data-ai

VerifiedTrustedCommunity

license: Apache-2.0 NOT for unrelated tasks outside this domain.

8SKILL.mdUpdated Jul 19, 2026

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

development

VerifiedTrustedCommunity

Use when designing caching strategies (cache-aside, write-through, write-behind), implementing distributed locks, building rate limiters, leaderboards, real-time streams (XADD/consumer groups), pub/sub, or tuning eviction policies. Triggers: thundering-herd on cache miss, dogpile on key expiry, Redlock vs SET-NX-PX choice, sliding-window rate limiter, hot-key on a single cluster slot, big-key blowup, MULTI/EXEC across slots, KEYS in production. NOT for Redis Cluster operations/admin (different domain), embedded KV (SQLite, leveldb), in-process LRU caches, or Memcached.

8SKILL.mdUpdated Jul 19, 2026

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

tools

VerifiedTrustedCommunity

Drawing the `'use client'` boundary correctly in React Server Components apps (Next.js App Router, RSC frameworks) — leaf-pushing, slot composition, serialization rules, and environment poisoning prevention. Grounded in react.dev and Next.js 16 docs.

8SKILL.mdUpdated Jul 19, 2026

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy

development

VerifiedTrustedCommunity

Use when designing rate limiting for an API, choosing between token bucket / sliding window / leaky bucket / fixed window, implementing it in Redis, deciding edge (Cloudflare/Upstash) vs origin enforcement, sizing per-user vs per-IP vs per-endpoint quotas, returning the right 429 response with Retry-After, or fixing the boundary-burst bug in fixed-window limiters. Triggers: 429 too many requests, INCR + EXPIRE, ZADD + ZREMRANGEBYSCORE + ZCARD, X-RateLimit-Remaining header, Cloudflare WAF rate limiting rules, Upstash @upstash/ratelimit, leaky bucket shaping vs policing, distributed rate limiter consistency. NOT for DDoS mitigation specifically (different scale), CAPTCHA / bot management, full WAF design, or per-user quota billing.

8SKILL.mdUpdated Jul 19, 2026

curiositech/rate-limiting-strategy

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/curiositech/windags-skills.git

# Copy into Claude Code skills folder (global)
cp -r windags-skills/skills/monitoring-stack-deployer ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

curiositech/windags-skills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT