skills/monitoring-observability/SKILL.md
Production monitoring, observability, and incident response practices. Use when the user asks about structured logging, distributed tracing, metrics collection, Prometheus, Grafana dashboards, log aggregation, ELK or Loki, alerting strategy, SLIs and SLOs, error budgets, health checks, RED or USE method, uptime monitoring, synthetic checks, incident response, postmortems, runbooks, on-call rotations, alert fatigue, monitoring infrastructure, APM (application performance monitoring), observability signals, cardinality explosion, or designing an observability stack.
npx skillsauth add 1mangesh1/dev-skills-collection monitoring-observabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Concepts, tooling, and operational practices for monitoring production systems and responding to incidents effectively.
A metric alert fires (error rate spike). You query logs filtered by service and time window. You pull a trace ID from the logs to see the full request path. Together they move you from "something is wrong" to "here is why."
Unstructured text logs are difficult to search and aggregate. Structured logs (typically JSON) make every field machine-parseable.
{
"timestamp": "2025-09-14T08:22:11.403Z",
"level": "ERROR",
"service": "payment-service",
"trace_id": "abc123def456",
"correlation_id": "order-98765",
"message": "Charge failed: card declined",
"duration_ms": 342
}
Log levels: DEBUG (dev only), INFO (normal operations), WARN (recoverable issues), ERROR (request-level failures), FATAL (process must exit).
Correlation IDs: Generate a unique ID at the edge (API gateway). Propagate
it via X-Request-ID header to every downstream call. Include it in every log
line to reconstruct the full request path.
rate() to query. Examples: total
requests, total errors._bucket,
_sum, _count series. Use for latency distributions via
histogram_quantile().Prometheus is a pull-based metrics system that scrapes HTTP endpoints exposing metrics in its text format.
global:
scrape_interval: 15s
scrape_configs:
- job_name: "api-server"
static_configs:
- targets: ["api-server:8080"]
metrics_path: /metrics
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
rate(http_requests_total[5m]) # request rate
histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) # p99
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) # error ratio
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "Error rate above 5% for 5 minutes"
runbook_url: "https://wiki.internal/runbooks/high-error-rate"
Data sources: Prometheus, Loki, Elasticsearch, InfluxDB, CloudWatch, others.
Panel types: Time series (metrics over time), Stat (single value), Table (top-N lists), Heatmap (latency distribution), Logs (inline log viewer).
Template variables: Define $namespace, $service, $instance at the
dashboard level. Use in queries to make one dashboard serve many teams.
Alerts: Grafana 9+ has a unified alerting engine. Define rules on panels, route notifications to Slack, PagerDuty, or OpsGenie via notification policies matched by label.
When to choose: Need full-text search across all fields? ELK. Need cost-effective storage with label-based queries? Loki. Already running Prometheus and Grafana? Loki reduces operational overhead.
Every alert must be actionable. If no one needs to act, remove it. Prefer alerting on symptoms (users affected) over causes (CPU is high).
| Severity | Meaning | Response | |----------|---------|----------| | P1 / Critical | Service down or data loss | Page immediately | | P2 / High | Degraded, partial outage | Page during business hours | | P3 / Medium | Non-urgent, workaround exists | Ticket, fix within days | | P4 / Low | Cosmetic or minor | Backlog |
Every alert should link to a runbook containing: what the alert means, how to verify, steps to mitigate, escalation contacts, and relevant dashboard links. Keep runbooks in version control alongside alerting rules.
Route P1/P2 to on-call paging tools. Route P3/P4 to Slack or ticketing systems. Configure escalation policies so the secondary is paged if the primary does not acknowledge within N minutes.
SLI (Service Level Indicator): A quantitative measure of a service aspect. Examples: availability (% successful requests), latency (% requests < threshold).
SLO (Service Level Objective): A target for an SLI over a time window. Example: 99.9% of requests succeed over a 30-day rolling window.
Error budget: The allowed unreliability: 1 - SLO. A 99.9% SLO gives 0.1%
budget (roughly 43 minutes of downtime per 30 days). When the budget is nearly
exhausted, freeze releases and focus on reliability.
Tips: Start with one or two SLOs per service. Measure from the client perspective (load balancer logs, synthetic probes). Review in weekly reliability meetings.
livenessProbe:
httpGet: { path: /healthz, port: 8080 }
initialDelaySeconds: 5
periodSeconds: 10
readinessProbe:
httpGet: { path: /ready, port: 8080 }
periodSeconds: 5
For request-driven services, dashboard these three signals:
| Signal | Measure | Example Metric |
|--------|---------|----------------|
| Rate | Requests per second | rate(http_requests_total[5m]) |
| Errors | Failed requests per second | rate(http_requests_total{status=~"5.."}[5m]) |
| Duration | Latency (p50, p95, p99) | histogram_quantile(0.95, ...) |
Alert when error rate or latency exceeds SLO thresholds.
For resources (CPU, memory, disk, network):
| Signal | Definition | Example | |--------|-----------|---------| | Utilization | % of resource busy | CPU at 85% | | Saturation | Queued work beyond capacity | Run queue > core count | | Errors | Error events | Disk I/O errors, packet drops |
High utilization alone is not a problem. High saturation means the resource is a bottleneck. Infrastructure bottlenecks often manifest as increased request latency (linking USE back to RED).
Synthetic checks: Automated scripts simulating user actions from multiple regions. Tools: Grafana Synthetic Monitoring, Checkly, Pingdom.
Real User Monitoring (RUM): Collects performance data from actual browsers. Measures page load, time to interactive, core web vitals. Captures real network conditions and device diversity that synthetics cannot.
Status pages: Publish service status externally (Statuspage, Instatus). Automate updates from alert state changes when possible.
tools
Parallel execution with xargs, GNU parallel, and batch processing patterns. Use when user mentions "xargs", "parallel", "batch processing", "run in parallel", "parallel execution", "process list of files", "bulk operations", "concurrent commands", "map over files", or running commands on multiple inputs.
development
WebSocket implementation for real-time bidirectional communication. Use when user mentions "websocket", "ws://", "wss://", "real-time", "live updates", "chat application", "socket.io", "Server-Sent Events", "SSE", "push notifications", "live data", "streaming data", "bidirectional communication", "websocket server", "reconnection", or building real-time features.
tools
Frontend bundler configuration for Webpack and Vite. Use when user mentions "webpack", "vite", "bundler", "vite config", "webpack config", "code splitting", "tree shaking", "hot module replacement", "HMR", "build optimization", "bundle size", "chunk splitting", "loader", "plugin", "esbuild", "rollup", "dev server", or configuring JavaScript build tools.
tools
VS Code configuration, extensions, keybindings, and workspace optimization. Use when user mentions "vscode", "vs code", "vscode settings", "vscode extensions", "keybindings", "code editor", "workspace settings", "settings.json", "launch.json", "tasks.json", "vscode snippets", "devcontainer", "remote development", or customizing their VS Code setup.