engineering/devops/skills/monitoring-observability/SKILL.md
This skill should be used when the user asks about "monitoring", "observability", "Prometheus", "Grafana", "Datadog", "Loki", "Jaeger", "distributed tracing", "OpenTelemetry", "metrics", "logs", "traces", "alerting", "SLO", "SLA", "error budget", "on-call", "PagerDuty", "incident alert", "dashboard", "runbook", "MTTR", "MTTD", "log aggregation", "log parsing", "structured logging", "service mesh observability", "Istio metrics", "APM", or "application performance monitoring". Also trigger for "how do I know if my service is healthy", "I can't see what's happening in production", "why is P99 latency high", or "set up alerting for my service".
npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library monitoring-observabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Build observable systems using the three pillars: metrics, logs, and traces.
| Pillar | What it tells you | Tool examples | |--------|------------------|---------------| | Metrics | Aggregated numerical measurements over time | Prometheus, Datadog, CloudWatch | | Logs | Discrete events with context | Loki, Elasticsearch, CloudWatch Logs, Datadog Logs | | Traces | End-to-end request flows across services | Jaeger, Tempo, Datadog APM, X-Ray |
You need all three. Metrics tell you something is wrong. Logs tell you what happened. Traces tell you where in the system it went wrong.
Define service reliability targets before building dashboards. Every team should have explicit SLOs.
# slo.yaml
service: api-service
slos:
- name: availability
description: "Percentage of requests that succeed (non-5xx)"
target: 99.9% # 0.1% error budget = 43.8 min/month
measurement:
metric: rate(http_requests_total{status!~"5.."}[5m]) / rate(http_requests_total[5m])
- name: latency-p99
description: "P99 response time for all requests"
target: 200ms # 99% of requests under 200ms
measurement:
metric: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m]))
- name: throughput
description: "Service processes at least N requests/second under normal load"
target: 100 rps
Error budget = (1 - SLO target) × time window
99.9% availability over 30 days = 0.1% × 30 × 24 × 60 = 43.2 minutes
Burn rate = how fast you're consuming the error budget
- Burn rate 1x = consuming budget at exactly the target rate (sustainable)
- Burn rate 14x = consuming 14 days of budget in 1 hour (page immediately)
- Burn rate 6x = consuming budget too fast (page within 24 hours)
# Alert when error budget burns too fast
# Page immediately: 2% of 30-day budget consumed in 1 hour
- alert: ErrorBudgetBurnRatePage
expr: |
(
rate(http_requests_total{status=~"5.."}[1h]) / rate(http_requests_total[1h]) > (14.4 * 0.001)
) and (
rate(http_requests_total{status=~"5.."}[5m]) / rate(http_requests_total[5m]) > (14.4 * 0.001)
)
for: 2m
labels:
severity: page
annotations:
summary: "Error budget burning at 14x rate — page immediately"
# Ticket: 5% consumed in 6 hours
- alert: ErrorBudgetBurnRateTicket
expr: |
rate(http_requests_total{status=~"5.."}[6h]) / rate(http_requests_total[6h]) > (6 * 0.001)
for: 15m
labels:
severity: ticket
The Four Golden Signals (Google SRE):
| Signal | What to measure | Prometheus metric type |
|--------|----------------|----------------------|
| Latency | Time to serve a request (P50, P95, P99) | histogram |
| Traffic | Requests per second | counter |
| Errors | Rate of failed requests | counter |
| Saturation | How full the service is (CPU, queue depth) | gauge |
import { Counter, Histogram, Gauge, register } from 'prom-client';
// HTTP request metrics (latency + traffic + errors)
export const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'route', 'status_code'],
buckets: [0.005, 0.01, 0.025, 0.05, 0.1, 0.25, 0.5, 1, 2.5],
});
export const httpRequestTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'route', 'status_code'],
});
// Queue depth (saturation)
export const jobQueueDepth = new Gauge({
name: 'job_queue_depth',
help: 'Number of jobs waiting in the queue',
labelNames: ['queue_name'],
});
// Business metric
export const ordersProcessed = new Counter({
name: 'orders_processed_total',
help: 'Total orders successfully processed',
labelNames: ['payment_method', 'region'],
});
// Expose metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
# prometheus.yml
scrape_configs:
- job_name: 'api-service'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
action: keep
regex: api-service
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: ${1}:${2}
Dashboard 1: Service Health Overview
Dashboard 2: Resource Utilization
Dashboard 3: Business Metrics
Always emit structured JSON logs. Never console.log("error: " + err).
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL || 'info',
base: {
service: 'api-service',
version: process.env.APP_VERSION,
env: process.env.NODE_ENV,
},
redact: ['req.headers.authorization', 'body.password', 'body.credit_card'],
});
// Good: structured, searchable fields
logger.info({
event: 'order.created',
order_id: order.id,
user_id: user.id,
amount: order.total,
payment_method: order.paymentMethod,
duration_ms: Date.now() - startTime,
}, 'Order created successfully');
// Good: errors with full context
logger.error({
event: 'payment.failed',
order_id: order.id,
error_code: err.code,
err, // pino serializes Error objects correctly
}, 'Payment processing failed');
| Level | Use for |
|-------|---------|
| trace | Very detailed debugging (disabled in production) |
| debug | Debug info useful during development |
| info | Normal operational events (request received, job completed) |
| warn | Unexpected but handled situations (retry succeeded, deprecated API used) |
| error | Errors that require attention but didn't crash the service |
| fatal | Errors that cause the process to exit |
In production: Set LOG_LEVEL=info. Only enable debug when actively investigating.
// tracing.ts — must be loaded before everything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { OTLPTraceExporter } from '@opentelemetry/exporter-trace-otlp-http';
import { HttpInstrumentation } from '@opentelemetry/instrumentation-http';
import { ExpressInstrumentation } from '@opentelemetry/instrumentation-express';
import { PgInstrumentation } from '@opentelemetry/instrumentation-pg';
const sdk = new NodeSDK({
serviceName: 'api-service',
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || 'http://jaeger:4318/v1/traces',
}),
instrumentations: [
new HttpInstrumentation(),
new ExpressInstrumentation(),
new PgInstrumentation(),
],
});
sdk.start();
import { trace } from '@opentelemetry/api';
const tracer = trace.getTracer('api-service');
async function processOrder(orderId: string) {
return tracer.startActiveSpan('processOrder', async (span) => {
span.setAttributes({
'order.id': orderId,
'order.source': 'api',
});
try {
const result = await chargePayment(orderId);
span.setStatus({ code: SpanStatusCode.OK });
return result;
} catch (err) {
span.recordException(err);
span.setStatus({ code: SpanStatusCode.ERROR, message: err.message });
throw err;
} finally {
span.end();
}
});
}
Symptom-based alerts over cause-based alerts:
Thresholds that eliminate noise:
for: 5m so transient spikes don't pageEvery alert must link to a runbook:
# Alert: HighErrorRate
**When does this fire?** Error rate > 1% for 5 minutes (SLO burn rate 14x)
**Severity:** P1 — Page immediately
## 1. Initial Assessment (< 2 min)
- Check [Grafana dashboard](https://grafana.example.com/d/api-errors)
- Which endpoints are erroring? `rate(http_requests_total{status=~"5.."}[5m]) by (route)`
- When did it start? Look at the deployment history
## 2. Common Causes
### Bad deployment
- Check if deploy happened with: `kubectl rollout history deployment/api-service -n production`
- If yes: `kubectl rollout undo deployment/api-service -n production`
### Database connectivity
- Check DB errors: `kubectl logs -l app=api-service --since=5m | grep "connection"`
- Check RDS health in AWS console
- If DB is down: enable read-only mode via feature flag
### Downstream service failure
- Check dependencies: `[list of dependencies and their status pages]`
## 3. Escalation
- 15 min without mitigation: page engineering lead
- 30 min without mitigation: page VP Engineering
## 4. Post-Incident
- Open postmortem issue within 24 hours
- Complete postmortem within 5 business days
For complete alerting rule sets and distributed tracing integration guides, see:
references/metrics-alerting.md — Prometheus recording rules, alert thresholds, and Grafana dashboard JSON for the four golden signalsreferences/tracing-patterns.md — OpenTelemetry SDK setup, span attribute conventions, and Tempo/Jaeger query patterns for distributed trace analysistesting
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.
tools
Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.
testing
This skill should be used when scientists need help with research problem selection, project ideation, troubleshooting stuck projects, or strategic scientific decisions. Use this skill when users ask to pitch a new research idea, work through a project problem, evaluate project risks, plan research strategy, navigate decision trees, or get help choosing what scientific problem to work on. Typical requests include "I have an idea for a project", "I'm stuck on my research", "help me evaluate this project", "what should I work on", or "I need strategic advice about my research".
development
Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.