Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

jaykim88/observability-setup

Name: observability-setup
Author: jaykim88

plugins/backend-toolkit/skills/observability-setup/SKILL.md

npx skillsauth add jaykim88/claude-ai-engineering observability-setup

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Observability Setup (Backend)

Purpose

Turn production silence into signal by emitting three correlated signals — logs, metrics, traces — joined by one propagated context, so any request can be reconstructed end-to-end and any regression is visible.

Universal — the three-signals model, RED/USE methods, and trace-context propagation are vendor-neutral (OpenTelemetry is the CNCF standard); implemented across Node/Python/Go.

Procedure

Propagate ONE correlation context
- Generate/accept a traceId (W3C Trace Context header) at the edge
- Include it in every log line, span, and downstream call (incl. background jobs + queue messages)
- This is what lets logs/metrics/traces join — without it you have three disconnected silos
Structured logging (JSON)
- { level, timestamp, traceId, route, userId(anonymized), msg, ...context }
- No console.log; one logger utility; levels (debug/info/warn/error)
- Never log secrets/PII
Metrics — RED for services, USE for resources
- RED (services/endpoints): Rate, Errors, Duration (p50/p95/p99)
- USE (resources: DB pool, CPU, queue): Utilization, Saturation, Errors
- Apply RED to request handlers, USE to infrastructure — different questions, different methods
- Cardinality budget: every unique label combination is a stored time series — putting userId, requestId, email, or unbounded ids in labels explodes the TSDB. Keep labels low-cardinality (route, status code, region); push per-request detail into logs/traces, not metrics
Distributed tracing (OpenTelemetry) — and sample it
- Auto-instrument the framework + DB + HTTP client; add manual spans around key business operations
- Spans carry the traceId; a trace shows the full request waterfall (where time went)
- Sampling: 100% trace capture is unaffordable at scale. Choose: head-based (decide at request start, simplest), tail-based (decide after the trace completes — keeps all errors + slow traces, drops the boring middle), or adaptive. Default to head-based at a low rate + always-on for errors
Capture errors with context — and scrub PII
- Errors → error tracker (Sentry) WITH traceId + anonymized user context
- Any catch that returns gracefully must still record the error (silent catch = invisible failure)
- The SDK captures more than you set: request URLs, headers, breadcrumbs (and span attributes in tracing) can leak tokens, auth headers, and PII. Configure beforeSend scrubbing + platform-side data scrubbing; treat the trace/log/error pipeline as the same trust boundary as your API responses
Alert on patterns, not noise
- Alert on RED/USE threshold breaches (error rate spike, p99 regression, queue saturation), not every error
- Define SLOs; alert on burn rate
Validate (validation loop)
- Trigger a test error + a slow request → verify you can find them by traceId across logs + trace + error tracker
- If the three signals don't join on traceId → context propagation is broken; fix and re-test

Anti-patterns

| ❌ Anti-pattern | ✅ Correct | |---|---| | console.log everywhere | Structured JSON logger with traceId | | Logs/metrics/traces with no shared id | One propagated W3C trace context | | Averages only | p95/p99 (tail latency is what users feel) | | RED applied to infra / USE to endpoints | RED→services, USE→resources | | Alert on every error | Alert on SLO burn / threshold breach | | Silent catch (no record) | Record every handled error with context | | userId / requestId / email in metric labels (TSDB explosion) | Low-cardinality labels (route, status, region); per-request detail in logs/traces | | 100 % trace capture in production | Head- or tail-based sampling; always-on for errors + slow traces | | Trusting SDK auto-captured payloads to be PII-free | beforeSend scrubbing + platform data scrubbing; treat the pipeline as a trust boundary |

Severity tiers

| Tier | Examples | Action SLA | |---|---|---| | Critical | No error tracking in production; silent catches hiding failures; no correlation id (can't trace incidents) | Block release; fix immediately | | Major | Metrics on averages only (no p99); traces not propagated to jobs/queues; PII / auth headers leaking into traces or breadcrumbs (no beforeSend scrubbing) | Fix this sprint | | Minor | Alert noise (per-error alerts); missing USE metrics on a resource; metric cardinality unbudgeted (label explosion risk); trace sampling rate unset | Schedule within 2 sprints |

Completion Criteria

[ ] One correlation context propagated through requests + jobs + queues
[ ] Structured JSON logging (no console.log; traceId on every line)
[ ] RED metrics on services, USE on resources, p95/p99 tracked
[ ] OpenTelemetry tracing wired (auto + key manual spans)
[ ] Errors captured with traceId; SLO-based alerts
[ ] Trace sampling rate set (head- or tail-based); always-on for errors / slow traces
[ ] Metric labels low-cardinality (no userId / unbounded ids); a cardinality budget documented
[ ] PII auto-capture scrubbed (beforeSend + platform data scrubbing)
[ ] Three signals verified to join on traceId

Output

Instrumentation: OTel setup + structured logger + metrics
Dashboards: RED + USE
Alert config: docs/observability-alerts.md — SLOs, thresholds, runbooks
Commit format: feat(obs): wire OpenTelemetry tracing / feat(obs): RED metrics for <service>

Implementation

TypeScript + NestJS (default)

Tracing: @opentelemetry/sdk-node + auto-instrumentations (HTTP, Prisma, Redis, BullMQ); export to Tempo/Jaeger/Datadog
Logging: pino (JSON) with a traceId field from the active span context
Metrics: prom-client exposing /metrics for Prometheus; Grafana dashboards
Errors: @sentry/node with tracesSampleRate; attach traceId
Propagate traceId into BullMQ job data

Other stacks

Python / FastAPI: opentelemetry-instrumentation-fastapi; structlog; prometheus_client
Go: go.opentelemetry.io/otel; slog (stdlib structured logging); prometheus/client_golang
Universal: OpenTelemetry + W3C Trace Context are the cross-language standard; RED/USE are methodologies, not tools

Related skills

performance-profiling — traces/metrics surface the bottlenecks profiling then drills into
background-jobs — jobs need the same correlation context as requests
ai-llm-backend — token/cost/latency are AI-specific metrics on this backbone

Reference

Key insight encoded: Propagate one correlation context (traceId in every log + W3C Trace Context header) so logs/metrics/traces join; apply RED to services, USE to infrastructure. Three cost-and-safety gates often missed: trace sampling (100% is unaffordable at scale), a metric-cardinality budget (one bad label explodes the TSDB), and SDK-level PII scrubbing — the pipeline auto-captures more than you set.

jaykim88/observability-setup

plugins/backend-toolkit/skills/observability-setup/SKILL.md

Instrument a backend with the three signals unified by one correlation context — structured logs, metrics (RED for services, USE for resources), and distributed tracing (OpenTelemetry + W3C Trace Context). Use before production, when debugging is blind, or when an incident has no trail. Not for diagnosing specific bottlenecks (use performance-profiling) or AI-specific token/cost metrics (use ai-llm-backend on top of this backbone).

development

Updated Jun 9, 2026

$ install --global

skillsauth

npx skillsauth add jaykim88/claude-ai-engineering observability-setup

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 10, 2026, 2:48 AM106.3s1 file scanned

SKILL.md

name:: observability-setup
description:: Instrument a backend with the three signals unified by one correlation context — structured logs, metrics (RED for services, USE for resources), and distributed tracing (OpenTelemetry + W3C Trace Context). Use before production, when debugging is blind, or when an incident has no trail. Not for diagnosing specific bottlenecks (use performance-profiling) or AI-specific token/cost metrics (use ai-llm-backend on top of this backbone).
license:: MIT

Observability Setup (Backend)

Purpose

Universal — the three-signals model, RED/USE methods, and trace-context propagation are vendor-neutral (OpenTelemetry is the CNCF standard); implemented across Node/Python/Go.

Procedure

Propagate ONE correlation context
- Generate/accept a traceId (W3C Trace Context header) at the edge
- Include it in every log line, span, and downstream call (incl. background jobs + queue messages)
- This is what lets logs/metrics/traces join — without it you have three disconnected silos
Structured logging (JSON)
- { level, timestamp, traceId, route, userId(anonymized), msg, ...context }
- No console.log; one logger utility; levels (debug/info/warn/error)
- Never log secrets/PII
Metrics — RED for services, USE for resources
- RED (services/endpoints): Rate, Errors, Duration (p50/p95/p99)
- USE (resources: DB pool, CPU, queue): Utilization, Saturation, Errors
- Apply RED to request handlers, USE to infrastructure — different questions, different methods
- Cardinality budget: every unique label combination is a stored time series — putting userId, requestId, email, or unbounded ids in labels explodes the TSDB. Keep labels low-cardinality (route, status code, region); push per-request detail into logs/traces, not metrics
Distributed tracing (OpenTelemetry) — and sample it
- Auto-instrument the framework + DB + HTTP client; add manual spans around key business operations
- Spans carry the traceId; a trace shows the full request waterfall (where time went)
- Sampling: 100% trace capture is unaffordable at scale. Choose: head-based (decide at request start, simplest), tail-based (decide after the trace completes — keeps all errors + slow traces, drops the boring middle), or adaptive. Default to head-based at a low rate + always-on for errors
Capture errors with context — and scrub PII
- Errors → error tracker (Sentry) WITH traceId + anonymized user context
- Any catch that returns gracefully must still record the error (silent catch = invisible failure)
- The SDK captures more than you set: request URLs, headers, breadcrumbs (and span attributes in tracing) can leak tokens, auth headers, and PII. Configure beforeSend scrubbing + platform-side data scrubbing; treat the trace/log/error pipeline as the same trust boundary as your API responses
Alert on patterns, not noise
- Alert on RED/USE threshold breaches (error rate spike, p99 regression, queue saturation), not every error
- Define SLOs; alert on burn rate
Validate (validation loop)
- Trigger a test error + a slow request → verify you can find them by traceId across logs + trace + error tracker
- If the three signals don't join on traceId → context propagation is broken; fix and re-test

Anti-patterns

Severity tiers

Completion Criteria

[ ] One correlation context propagated through requests + jobs + queues
[ ] Structured JSON logging (no console.log; traceId on every line)
[ ] RED metrics on services, USE on resources, p95/p99 tracked
[ ] OpenTelemetry tracing wired (auto + key manual spans)
[ ] Errors captured with traceId; SLO-based alerts
[ ] Trace sampling rate set (head- or tail-based); always-on for errors / slow traces
[ ] Metric labels low-cardinality (no userId / unbounded ids); a cardinality budget documented
[ ] PII auto-capture scrubbed (beforeSend + platform data scrubbing)
[ ] Three signals verified to join on traceId

Output

Instrumentation: OTel setup + structured logger + metrics
Dashboards: RED + USE
Alert config: docs/observability-alerts.md — SLOs, thresholds, runbooks
Commit format: feat(obs): wire OpenTelemetry tracing / feat(obs): RED metrics for <service>

Implementation

TypeScript + NestJS (default)

Tracing: @opentelemetry/sdk-node + auto-instrumentations (HTTP, Prisma, Redis, BullMQ); export to Tempo/Jaeger/Datadog
Logging: pino (JSON) with a traceId field from the active span context
Metrics: prom-client exposing /metrics for Prometheus; Grafana dashboards
Errors: @sentry/node with tracesSampleRate; attach traceId
Propagate traceId into BullMQ job data

Other stacks

Python / FastAPI: opentelemetry-instrumentation-fastapi; structlog; prometheus_client
Go: go.opentelemetry.io/otel; slog (stdlib structured logging); prometheus/client_golang
Universal: OpenTelemetry + W3C Trace Context are the cross-language standard; RED/USE are methodologies, not tools

Related skills

performance-profiling — traces/metrics surface the bottlenecks profiling then drills into
background-jobs — jobs need the same correlation context as requests
ai-llm-backend — token/cost/latency are AI-specific metrics on this backbone

Reference

Key insight encoded: Propagate one correlation context (traceId in every log + W3C Trace Context header) so logs/metrics/traces join; apply RED to services, USE to infrastructure. Three cost-and-safety gates often missed: trace sampling (100% is unaffordable at scale), a metric-cardinality budget (one bad label explodes the TSDB), and SDK-level PII scrubbing — the pipeline auto-captures more than you set.

Related Skills

jaykim88/webhook-design

development

VerifiedTrustedCommunity

Design webhooks correctly on both sides — sending (HMAC signing, retries with backoff, at-least-once) and receiving (verify signature on raw body, enqueue + 200 fast, dedupe on event id). Use when adding webhook delivery or consuming a provider's webhooks. Not for internal service-to-service events (use async-messaging) or general outbound-call retry policy (use resilience-patterns).

SKILL.mdUpdated Jun 9, 2026

jaykim88/webhook-design

jaykim88/transaction-management

testing

VerifiedTrustedCommunity

Use transactions and isolation levels correctly — keep them short, no network calls inside, explicit isolation, retry on serialization conflicts, and choose optimistic vs pessimistic locking. Use when a write spans multiple tables, when concurrent updates corrupt data, or when designing money/inventory flows. Not for cross-service event delivery (use async-messaging Outbox) or schema-level constraints (use schema-design).

SKILL.mdUpdated Jun 9, 2026

jaykim88/transaction-management

jaykim88/test-strategy

development

VerifiedTrustedCommunity

Backend testing pyramid — unit for pure logic, integration against a real DB (Testcontainers), and consumer-driven contract testing (Pact) for service boundaries. Use before a feature, after a bug fix, or when services break each other on deploy. Not for load testing (use performance-profiling) or security testing (use backend-security-audit).

SKILL.mdUpdated Jun 9, 2026

jaykim88/test-strategy

jaykim88/schema-design

data-ai

VerifiedTrustedCommunity

Design a relational schema — normalize to 3NF then denormalize with justification, choose the right Postgres index type per data shape, enforce constraints at the DB. Use when modeling a new domain, when queries are slow, or before a migration. Not for diagnosing slow queries (use query-optimization) or shipping the change without downtime (use migration-strategy).

SKILL.mdUpdated Jun 9, 2026

jaykim88/schema-design

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/jaykim88/claude-ai-engineering.git

# Copy into Claude Code skills folder (global)
cp -r claude-ai-engineering/plugins/backend-toolkit/skills/observability-setup ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

jaykim88/claude-ai-engineering

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT