plugins/backend-toolkit/skills/observability-setup/SKILL.md
Instrument a backend with the three signals unified by one correlation context — structured logs, metrics (RED for services, USE for resources), and distributed tracing (OpenTelemetry + W3C Trace Context). Use before production, when debugging is blind, or when an incident has no trail. Not for diagnosing specific bottlenecks (use performance-profiling) or AI-specific token/cost metrics (use ai-llm-backend on top of this backbone).
npx skillsauth add jaykim88/claude-ai-engineering observability-setupInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Turn production silence into signal by emitting three correlated signals — logs, metrics, traces — joined by one propagated context, so any request can be reconstructed end-to-end and any regression is visible.
Universal — the three-signals model, RED/USE methods, and trace-context propagation are vendor-neutral (OpenTelemetry is the CNCF standard); implemented across Node/Python/Go.
Propagate ONE correlation context
traceId (W3C Trace Context header) at the edgeStructured logging (JSON)
{ level, timestamp, traceId, route, userId(anonymized), msg, ...context }console.log; one logger utility; levels (debug/info/warn/error)Metrics — RED for services, USE for resources
userId, requestId, email, or unbounded ids in labels explodes the TSDB. Keep labels low-cardinality (route, status code, region); push per-request detail into logs/traces, not metricsDistributed tracing (OpenTelemetry) — and sample it
Capture errors with context — and scrub PII
catch that returns gracefully must still record the error (silent catch = invisible failure)beforeSend scrubbing + platform-side data scrubbing; treat the trace/log/error pipeline as the same trust boundary as your API responsesAlert on patterns, not noise
Validate (validation loop)
| ❌ Anti-pattern | ✅ Correct |
|---|---|
| console.log everywhere | Structured JSON logger with traceId |
| Logs/metrics/traces with no shared id | One propagated W3C trace context |
| Averages only | p95/p99 (tail latency is what users feel) |
| RED applied to infra / USE to endpoints | RED→services, USE→resources |
| Alert on every error | Alert on SLO burn / threshold breach |
| Silent catch (no record) | Record every handled error with context |
| userId / requestId / email in metric labels (TSDB explosion) | Low-cardinality labels (route, status, region); per-request detail in logs/traces |
| 100 % trace capture in production | Head- or tail-based sampling; always-on for errors + slow traces |
| Trusting SDK auto-captured payloads to be PII-free | beforeSend scrubbing + platform data scrubbing; treat the pipeline as a trust boundary |
| Tier | Examples | Action SLA |
|---|---|---|
| Critical | No error tracking in production; silent catches hiding failures; no correlation id (can't trace incidents) | Block release; fix immediately |
| Major | Metrics on averages only (no p99); traces not propagated to jobs/queues; PII / auth headers leaking into traces or breadcrumbs (no beforeSend scrubbing) | Fix this sprint |
| Minor | Alert noise (per-error alerts); missing USE metrics on a resource; metric cardinality unbudgeted (label explosion risk); trace sampling rate unset | Schedule within 2 sprints |
beforeSend + platform data scrubbing)docs/observability-alerts.md — SLOs, thresholds, runbooksfeat(obs): wire OpenTelemetry tracing / feat(obs): RED metrics for <service>@opentelemetry/sdk-node + auto-instrumentations (HTTP, Prisma, Redis, BullMQ); export to Tempo/Jaeger/Datadogpino (JSON) with a traceId field from the active span contextprom-client exposing /metrics for Prometheus; Grafana dashboards@sentry/node with tracesSampleRate; attach traceIdopentelemetry-instrumentation-fastapi; structlog; prometheus_clientgo.opentelemetry.io/otel; slog (stdlib structured logging); prometheus/client_golangperformance-profiling — traces/metrics surface the bottlenecks profiling then drills intobackground-jobs — jobs need the same correlation context as requestsai-llm-backend — token/cost/latency are AI-specific metrics on this backbonedevelopment
Design webhooks correctly on both sides — sending (HMAC signing, retries with backoff, at-least-once) and receiving (verify signature on raw body, enqueue + 200 fast, dedupe on event id). Use when adding webhook delivery or consuming a provider's webhooks. Not for internal service-to-service events (use async-messaging) or general outbound-call retry policy (use resilience-patterns).
testing
Use transactions and isolation levels correctly — keep them short, no network calls inside, explicit isolation, retry on serialization conflicts, and choose optimistic vs pessimistic locking. Use when a write spans multiple tables, when concurrent updates corrupt data, or when designing money/inventory flows. Not for cross-service event delivery (use async-messaging Outbox) or schema-level constraints (use schema-design).
development
Backend testing pyramid — unit for pure logic, integration against a real DB (Testcontainers), and consumer-driven contract testing (Pact) for service boundaries. Use before a feature, after a bug fix, or when services break each other on deploy. Not for load testing (use performance-profiling) or security testing (use backend-security-audit).
data-ai
Design a relational schema — normalize to 3NF then denormalize with justification, choose the right Postgres index type per data shape, enforce constraints at the DB. Use when modeling a new domain, when queries are slow, or before a migration. Not for diagnosing slow queries (use query-optimization) or shipping the change without downtime (use migration-strategy).