skills/monitoring-ops/SKILL.md
Observability patterns - metrics, logging, tracing, alerting, and infrastructure monitoring. Use for: monitoring, observability, prometheus, grafana, metrics, alerting, structured logging, distributed tracing, opentelemetry, SLO, SLI, dashboard, health check, loki, jaeger, datadog, pagerduty.
npx skillsauth add 0xDarkMatter/claude-mods monitoring-opsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Comprehensive observability patterns covering the three pillars (metrics, logging, tracing), alerting strategies, dashboard design, and infrastructure monitoring for production systems.
Use this table to decide which observability signal fits your need:
| Pillar | Best For | Tools | Data Type | |--------|----------|-------|-----------| | Metrics | Aggregated numeric measurements, trends, alerting on thresholds | Prometheus, Datadog, CloudWatch, StatsD | Time-series (numeric) | | Logs | Discrete events, error details, audit trails, debugging context | Loki, ELK, CloudWatch Logs, Fluentd | Unstructured/structured text | | Traces | Request flow across services, latency breakdown, dependency mapping | Jaeger, Tempo, Zipkin, Datadog APM | Span trees (structured) |
When to use which:
Correlation is key: Connect all three by embedding trace_id in log entries, recording exemplars in metrics, and linking trace spans to log queries.
Use this tree to select the correct metric type:
What are you measuring?
│
├─ A count of events that only goes up?
│ └─ COUNTER
│ Examples: http_requests_total, errors_total, bytes_sent_total
│ Use rate() or increase() to get per-second or per-interval values
│ Never use a counter's raw value — it resets on restart
│
├─ A current value that goes up AND down?
│ └─ GAUGE
│ Examples: temperature_celsius, active_connections, queue_depth
│ Use for snapshots of current state
│ Can use avg_over_time(), max_over_time() for trends
│
├─ A distribution of values (latency, size)?
│ │
│ ├─ Need aggregatable quantiles across instances?
│ │ └─ HISTOGRAM
│ │ Examples: http_request_duration_seconds, response_size_bytes
│ │ Define buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5, 5, 10]
│ │ Use histogram_quantile() for percentiles (p50, p95, p99)
│ │ Aggregatable across instances (histograms can be summed)
│ │
│ └─ Need pre-calculated quantiles on a single instance?
│ └─ SUMMARY
│ Examples: go_gc_duration_seconds
│ Pre-calculates quantiles client-side
│ NOT aggregatable across instances
│ Prefer histogram unless you have a specific reason
│
└─ None of the above?
└─ INFO metric (labels only, value=1)
Examples: build_info{version="1.2.3", commit="abc123"}
Use for metadata exposed as metrics
Rule of thumb: Start with counters and histograms. Add gauges for current state. Avoid summaries unless you have a compelling reason.
What type of alert do you need?
│
├─ Known threshold with a fixed boundary?
│ └─ THRESHOLD-BASED
│ Example: CPU > 90% for 5 minutes
│ Pros: Simple, predictable, easy to understand
│ Cons: Requires manual tuning, doesn't adapt to patterns
│ Best for: Resource limits, error rate spikes, queue depth
│
├─ Normal behavior varies by time/season?
│ └─ ANOMALY-BASED
│ Example: Traffic 3 standard deviations below normal for this hour
│ Pros: Adapts to patterns, catches novel failures
│ Cons: Noisy during transitions, requires training data
│ Best for: Traffic patterns, business metrics, gradual degradation
│
└─ Defined reliability targets?
└─ SLO-BASED (PREFERRED)
Example: Error budget burn rate > 14.4x for 1 hour
Pros: Aligned with user impact, reduces noise, principled
Cons: Requires SLI/SLO definition, more complex setup
Best for: User-facing services, platform reliability
| Severity | Response | Examples | Routing | |----------|----------|----------|---------| | Critical (P1) | Page on-call immediately | Service down, data loss risk, security breach | PagerDuty high-urgency, phone call | | Warning (P2) | Investigate within hours | Elevated error rate, disk 80% full, SLO burn rate elevated | PagerDuty low-urgency, Slack alert channel | | Info (P3) | Review next business day | Deployment completed, certificate expiring in 30 days | Slack info channel, ticket auto-created |
Page (wake someone up) when:
Create ticket (don't page) when:
{
"timestamp": "2026-03-09T14:32:01.123Z",
"level": "ERROR",
"message": "Failed to process payment",
"service": "payment-api",
"version": "1.4.2",
"trace_id": "4bf92f3577b34da6a3ce929d0e0e4736",
"span_id": "00f067aa0ba902b7",
"request_id": "req-abc123",
"user_id": "usr-789",
"error": {
"type": "PaymentGatewayTimeout",
"message": "Gateway response timeout after 30s",
"stack": "..."
},
"duration_ms": 30042,
"http": {
"method": "POST",
"path": "/api/v1/payments",
"status_code": 504
}
}
| Level | When to Use | Examples | |-------|-------------|---------| | DEBUG | Development only, verbose internal state | Variable values, SQL queries, cache hits/misses | | INFO | Normal operations worth recording | Request completed, job started/finished, config loaded | | WARN | Degraded but still functioning | Retry succeeded, fallback used, approaching limit | | ERROR | Operation failed, needs attention | Payment failed, API call error, constraint violation | | FATAL | Process cannot continue, must exit | Database unreachable at startup, invalid config, OOM |
Rules:
request_id (UUID v4 or ULID) at the edge/gatewayX-Request-ID)trace_id and span_id from distributed tracingtraceparent: 00-4bf92f3577b34da6a3ce929d0e0e4736-00f067aa0ba902b7-01
│ │ │ │
│ │ │ └─ flags (01=sampled)
│ │ └─ parent span ID (16 hex)
│ └─ trace ID (32 hex)
└─ version (00)
| Strategy | How It Works | Use When | |----------|--------------|----------| | Head-based (ratio) | Decide at trace start, propagate decision | Low traffic, need predictable volume | | Always-on | Sample everything | Development, low-traffic services | | Parent-based | Follow parent's sampling decision | Default for most services | | Tail-based | Decide after trace completes (at Collector) | Need error/slow traces, high traffic |
Recommendation: Use parent-based + tail-based at the Collector. This captures all error traces and slow traces while controlling volume.
Always include trace_id in structured log entries. This enables jumping from a log line to the full trace view:
Log entry → trace_id → Jaeger/Tempo → full request waterfall
| Feature | Prometheus + Grafana | Datadog | Grafana Cloud | CloudWatch | |---------|---------------------|---------|---------------|------------| | Cost | Free (infra costs) | $$$$ (per host/metric) | $$ (usage-based) | $$ (AWS-native) | | Setup complexity | High (self-managed) | Low (SaaS agent) | Medium (managed) | Low (AWS-native) | | Metrics | Prometheus (excellent) | Built-in (excellent) | Mimir (excellent) | Built-in (good) | | Logs | Loki (good) | Built-in (excellent) | Loki (good) | CloudWatch Logs (good) | | Traces | Jaeger/Tempo (good) | APM (excellent) | Tempo (good) | X-Ray (adequate) | | Alerting | Alertmanager (good) | Built-in (excellent) | Grafana Alerting (good) | CloudWatch Alarms (adequate) | | Dashboards | Grafana (excellent) | Built-in (excellent) | Grafana (excellent) | Dashboards (adequate) | | Retention | Configurable (unlimited) | 15 months default | Configurable | Up to 15 months | | Multi-cloud | Yes | Yes | Yes | AWS only | | Best for | Cost-conscious, control | Full-featured, enterprise | Open-source + managed | AWS-native shops |
Recommendation path:
For every resource (CPU, memory, disk, network):
| Signal | Question | Metric Example |
|--------|----------|----------------|
| Utilization | How busy is it? | node_cpu_seconds_total (% busy) |
| Saturation | How overloaded is it? | node_load1 (run queue length) |
| Errors | Are there error events? | node_network_receive_errs_total |
For every service endpoint:
| Signal | Question | Metric Example |
|--------|----------|----------------|
| Rate | How many requests per second? | rate(http_requests_total[5m]) |
| Errors | How many are failing? | rate(http_requests_total{status=~"5.."}[5m]) |
| Duration | How long do they take? | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) |
| Signal | What to Measure | Alert Threshold Guidance | |--------|-----------------|--------------------------| | Latency | Time to serve a request (distinguish success vs error latency) | p99 > 2x baseline | | Traffic | Demand on the system (requests/sec, sessions, transactions) | Anomaly detection | | Errors | Rate of failed requests (explicit 5xx, implicit policy violations) | > 0.1% of traffic | | Saturation | How "full" the service is (CPU, memory, queue depth) | > 80% capacity |
| Gotcha | Why It Happens | Fix |
|--------|----------------|-----|
| Cardinality explosion | Using unbounded label values (user ID, request path, query string) | Use bounded labels only; aggregate high-cardinality data in logs, not metrics |
| Alert fatigue | Too many alerts, too sensitive thresholds, alerts on non-actionable symptoms | Require runbook for every alert; tune thresholds; use SLO-based alerting |
| Missing correlation IDs | Logs, metrics, and traces not linked together | Include trace_id in all log entries; use exemplars in metrics |
| Sampling bias | Head-based sampling drops error/slow traces at high sample rates | Use tail-based sampling at the Collector to always capture errors and slow traces |
| Log volume costs | DEBUG or verbose INFO in production, logging full request/response bodies | Set production to INFO minimum; truncate large payloads; use sampling for verbose paths |
| Metric naming inconsistency | Different teams use different naming conventions | Adopt OpenMetrics naming: namespace_subsystem_unit_suffix (e.g., http_server_request_duration_seconds) |
| Dashboard sprawl | Everyone creates dashboards, nobody maintains them | Standardize with USE/RED templates; review quarterly; delete unused dashboards |
| SLO too aggressive | Setting 99.99% availability without the budget or architecture for it | Start with 99.5% or 99.9%; tighten only when consistently meeting targets with margin |
| Missing baseline | Alerting on absolute thresholds without understanding normal behavior | Collect 2-4 weeks of baseline data before setting alert thresholds |
| Over-instrumentation | Instrumenting every function, creating too many spans/metrics | Instrument at service boundaries; use auto-instrumentation for HTTP/DB/gRPC; add manual spans selectively |
| Ignoring metric staleness | Assuming a metric that stops reporting means zero | Use absent() or up == 0 to detect missing scrapers; distinguish "zero" from "not reporting" |
| Alerting on cause not symptom | Alerting on CPU usage instead of user-facing error rate | Alert on symptoms (error rate, latency); use cause metrics (CPU, memory) for investigation |
| No retention policy | Storing all metrics/logs at full resolution forever | Define retention tiers: 15s resolution for 2 weeks, 1m for 3 months, 5m for 1 year |
| Dashboard without context | Graphs with no units, no description, no threshold lines | Add units to Y-axis, threshold lines for SLOs, panel descriptions explaining what "good" looks like |
| File | Contents | Lines | |------|----------|-------| | metrics-alerting.md | Prometheus, Grafana, OpenTelemetry metrics, SLI/SLO/SLA, alert routing, runbooks, uptime monitoring | ~650 | | logging.md | Structured logging, log levels, correlation IDs, aggregation (Loki, ELK), retention, PII masking, language-specific | ~550 | | tracing.md | OpenTelemetry, spans, context propagation, sampling, Jaeger, async tracing, DB/HTTP/gRPC instrumentation | ~600 | | infrastructure.md | Health checks, K8s probes, Docker HEALTHCHECK, infra metrics, APM, cost optimization, incident response | ~550 |
tools
Behavioural-first software supply chain defense - catches poisoned npm/PyPI packages in the publish-to-advisory window that CVE tools miss. Use BEFORE every install or version bump (not only when an attack is suspected) - the 7-day cooldown gate + behavioural score catches freshly-published malware that CVE tools won't see for days. Socket.dev integration (free CLI + GitHub app + depscore MCP for Claude Code), stale-OIDC audit, dependency cooldown policy, publish-token rotation, VS Code extension audit, and a self-integrity scan that detects worm persistence hooks injected into Claude Code / VS Code settings. Triggers on: pip install, uv add, uv tool install, npm install, pnpm add, yarn add, cargo add, go get, composer require, gem install, upgrade dependency, dependency upgrade, version bump, bump version, bump package, adding dependency, new dependency, vetting a dependency, vet package, is this package safe, safe to install, should I install, before installing, pre-install check, preinstall scan, preinstall-check, PyPI cooldown, npm cooldown, release cooldown, minimumReleaseAge, score a package, package score, depscore, socket score, supply chain, supply chain attack, malicious package, poisoned dependency, npm worm, Shai-Hulud, behavioural scanning, Socket.dev, socket scan, dependency security, postinstall malware, OIDC token theft, compromised maintainer, typosquat, dependency confusion, package provenance, SLSA, persistence hook, malicious VS Code extension.
testing
GitHub remote operations — repo creation, metadata (description/homepage/topics), releases, README 'Recent Updates' enforcement, and issue / PR management with preview-before-send discipline. Companion to git-ops (local) and push-gate (pre-push safety). Three modes: new (first publish), update (subsequent release), audit (read-only checklist), plus atomic operations for issues and PRs. Triggers on: push to github, publish repo, ship release, cut release, gh release, set topics, repo description, github metadata, recent updates section, audit github repo, repo visibility, make repo public, gh repo create, gh issue, gh pr, create issue, comment on issue, close issue, triage issue, create PR, review PR, merge PR, pre-merge check, pr checks.
tools
Defend the agent's instruction surface against adversarial content - hidden-Unicode prompt injection (Trojan Source bidi reordering, U+E0000 tag-block ASCII smuggling, zero-width text), homoglyph confusables, and poisoned context that a human reviewer can't see but the model obeys. Scan CLAUDE.md / AGENTS.md / SKILL.md / .cursorrules and MCP tool descriptions; sanitize fetched web pages, issue/PR bodies, and dependency READMEs before they enter context. Triggers on: prompt injection, hidden unicode, invisible characters, zero-width space, bidi override, Trojan Source, ASCII smuggling, tag characters, homoglyph, confusable, unicode steganography, poisoned CLAUDE.md, malicious tool description, MCP tool poisoning, instruction injection, jailbreak in file, is this file safe, sanitize untrusted content, scan for hidden text.
tools
Set tool permissions for Claude Code. Configures allowed commands, rules, and preferences in .claude/ directory. Triggers on: setperms, init tools, configure permissions, setup project, set permissions, init claude.