skills/logging-observability/SKILL.md
Structured logging, distributed tracing, and metrics for production applications. [What: OpenTelemetry setup, log level strategy, correlation IDs, SLI/SLO alerting thresholds, Grafana dashboard design, PagerDuty integration] [When: setting up production logging, adding observability to a service, debugging distributed systems, designing alerting, implementing traces/metrics/logs] [Keywords: logging, observability, OpenTelemetry, OTel, structured logs, distributed tracing, correlation ID, metrics, Grafana, Prometheus, PagerDuty, Winston, Pino, structlog, log levels, SLI, SLO, alerting] NOT for application performance profiling (use a profiler), load testing, or database query optimization.
npx skillsauth add curiositech/windags-skills logging-observabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Structured logging, distributed tracing, and metrics for production systems. Covers the full observability stack from log formatting to alert routing.
1. Log Level Assignment by Event Type
Event occurs →
├── System failure?
├── YES → Service cannot continue?
├── YES → FATAL (page immediately)
└── NO → ERROR (operation failed, will retry)
└── NO → Unexpected condition?
├── YES → WARN (circuit breaker, deprecation)
└── NO → Business event?
├── YES → INFO (user action, payment processed)
└── NO → Debug helper?
├── YES → DEBUG (DB queries, cache hits)
└── NO → TRACE (spans, fine-grained flow)
2. Observability Stack Choice by Scale
Request volume →
├── < 1000/min → Structured logs + simple metrics
├── < 10k/min → Add distributed tracing (10% sampling)
├── < 100k/min → Full OTel + head-based sampling
└── > 100k/min → Tail-based sampling + cardinality limits
3. Alert Threshold Setting
SLI established →
├── User-facing service?
├── YES → Start with 99% SLO (44min/month error budget)
└── NO → Start with 95% SLO (36hr/month error budget)
└── Historical data available?
├── YES → Set threshold at 95th percentile of normal operation
└── NO → Set conservative threshold, tune weekly for 1 month
4. Trace Sampling Decision
Performance impact →
├── Latency sensitive service?
├── YES → 1-5% sampling rate
└── NO → 10-20% sampling rate
└── Error debugging needed?
├── YES → Always sample errors (status=error)
└── NO → Uniform probability sampling
5. PII Handling Strategy
Field contains sensitive data →
├── Required for debugging?
├── YES → Hash or tokenize (preserve cardinality)
└── NO → Complete redaction
└── Regulatory compliance?
├── GDPR/CCPA → Allowlist approach only
└── PCI → Redact payment fields specifically
1. Alert Fatigue
2. PII Leakage
"password":, "ssn":, credit card regex3. Trace Orphaning
traceparent header propagation on outbound HTTP calls4. Log-and-Throw Duplication
5. Cardinality Explosion
Scenario: Payment service returning 500s sporadically. Need to trace through API Gateway → Payment Service → Bank API.
Step 1: Trace ID Recovery
# Customer reports failed payment at 14:35 UTC
# Find trace ID from customer-facing logs
grep -A5 -B5 "payment_failed" /var/log/api-gateway.log | grep "14:3[0-9]"
# Extract: trace_id: "abc123def456"
Step 2: Cross-Service Trace Following
# Follow trace through each service
kubectl logs payment-service | grep "abc123def456"
# Shows: bank_api_call_failed, status_code: 502, bank_error: "insufficient_funds"
# Verify bank API logs (if accessible)
curl -H "X-Trace-ID: abc123def456" https://bank-api/logs
Decision Point: Sampling trade-off encountered
Step 3: Root Cause Analysis
// Found in payment service code
logger.error({
trace_id,
bank_response_code: 502,
bank_error: "insufficient_funds",
our_retry_count: 3
}, "payment_processing_failed");
Resolution: Bank API returns 502 for business logic errors (insufficient funds). Change error handling to return 400 instead of retrying on 502.
Node.js Payment Service Implementation:
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL ?? 'info',
redact: {
paths: ['req.headers.authorization', 'body.cardNumber', '*.ssn'],
censor: '[REDACTED]'
}
});
// Correlation middleware
export function correlationMiddleware(req, res, next) {
const traceId = req.headers['x-trace-id'] ?? randomUUID();
res.setHeader('x-trace-id', traceId);
// AsyncLocalStorage context
requestContext.run({ traceId }, () => {
logger.info({
traceId,
method: req.method,
path: req.path,
userAgent: req.headers['user-agent']
}, 'request_received');
next();
});
}
Alert Noise Syndrome
Schema Drift
Sampling Blind Spots
This skill handles: Production observability, structured logging, distributed tracing, alerting strategy
Delegate elsewhere:
performance-optimization skill insteadinfrastructure-scaling skilldatabase-architect skillsecurity-architect skillcost-optimization skilltools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.