skills/observability-apm-expert/SKILL.md
OpenTelemetry, distributed tracing, Grafana, and Datadog for full-stack observability. Activate on: observability, tracing, OpenTelemetry, Grafana, Datadog, metrics, logging, APM, SLO, alerting. NOT for: application error handling (use relevant language skill), security monitoring (use relevant security skill).
npx skillsauth add curiositech/windags-skills observability-apm-expertInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Implement comprehensive observability with distributed tracing, metrics, structured logging, and SLO-based alerting using OpenTelemetry and modern backends.
Error Rate Analysis:
├── Error rate < 0.1%
│ ├── Low cardinality service (< 10k spans/min) → 100% sampling
│ └── High cardinality service (> 10k spans/min) → Tail-based sampling
│ ├── Keep all error traces (100%)
│ ├── Keep slow traces > P95 latency (100%)
│ └── Sample successful traces (1-10%)
└── Error rate > 0.1%
├── Critical service → Keep all errors + 50% successful
└── Non-critical service → Keep all errors + 10% successful
If self-hosted tolerance = high AND cost sensitivity = high:
├── Use Grafana stack (Tempo + Mimir + Loki)
└── Export via OTLP to unified collector
If operational overhead tolerance = low OR compliance = strict:
├── Cloud vendors (Datadog, New Relic, Honeycomb)
└── Direct SDK exports + OTLP fallback
If hybrid requirements:
├── Critical services → SaaS backend
└── Development/staging → Self-hosted stack
For each SLO:
├── Define error budget (e.g., 99.9% = 43.2min downtime/month)
├── Calculate burn rates:
│ ├── Fast burn (14.4x) over 1h → Critical alert (2min delay)
│ ├── Medium burn (6x) over 6h → Warning alert (15min delay)
│ └── Slow burn (3x) over 24h → Info alert (1h delay)
└── Link each alert to specific runbook action
Symptom: Metrics cardinality > 10M series, query timeouts, high storage costs
Detection: prometheus_tsdb_head_cardinality growing exponentially
Fix: Add label cardinality limits, aggregate high-cardinality labels, use recording rules
Symptom: Spans appearing disconnected, missing parent-child relationships
Detection: Spans with same trace_id but no parent reference in service map
Fix: Verify context propagation headers (traceparent/tracestate), check async context handling
Symptom: > 10 alerts per incident, team ignoring notifications Detection: Alert:incident ratio > 5:1, MTTA (time to acknowledge) > 30min Fix: Implement alert dependencies, use SLO burn rate instead of threshold alerts
Symptom: Critical errors not captured in traces, debugging impossible Detection: Error logs present but corresponding traces missing Fix: Switch to tail-based sampling, increase error trace retention to 100%
Symptom: Traces terminate at service boundaries, no cross-service correlation
Detection: Spans from downstream services have different trace_id
Fix: Verify HTTP headers propagation, add OTel middleware to all services
Scenario: Customer reports 5-second checkout timeouts starting 2 hours ago
Step 1 - Triage with SLO dashboard:
P99 latency jumped from 200ms → 5000ms at 14:30 UTC
Error rate spiked from 0.1% → 2.3%
SLO burn rate: 46x (critical threshold)
Step 2 - Trace analysis:
-- Find slow traces in time window
{service_name="checkout-service"} |= "POST /checkout"
| json | duration > 2s | trace_id
Expert insight: Filter by duration first, then sample traces - don't analyze all traces
Step 3 - Root cause drill-down:
Selected trace_id: abc123
├── checkout-service: 50ms (normal)
├── payment-service: 4.8s (🚨 anomaly)
│ ├── validate_card: 45ms
│ ├── fraud_check: 12ms
│ └── database_query: 4.7s (🚨 root cause)
└── inventory-service: 100ms
Step 4 - Correlate with infrastructure:
Database span attributes show:
- db.statement: "SELECT * FROM transactions WHERE user_id = ?"
- db.connection.pool.idle: 0
- db.connection.pool.max: 10
Expert insight: Connection pool exhaustion - scale pool or optimize queries
Resolution: Increased connection pool from 10 → 50, added query timeout
Scenario: Add business metrics for order processing pipeline
// 1. Initialize custom meter
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('order-service', '1.0.0');
// 2. Define business metrics
const ordersTotal = meter.createCounter('orders_total', {
description: 'Total orders processed',
unit: '1'
});
const orderValue = meter.createHistogram('order_value_dollars', {
description: 'Order value distribution',
unit: 'USD'
});
// 3. Instrument business logic
async function processOrder(order: Order) {
const span = trace.getActiveSpan();
span?.setAttributes({
'order.id': order.id,
'order.user_id': order.userId,
'order.value': order.totalValue
});
try {
// Business logic here
await validateOrder(order);
await chargePayment(order);
// Record success metrics
ordersTotal.add(1, {
status: 'success',
payment_method: order.paymentMethod
});
orderValue.record(order.totalValue);
} catch (error) {
span?.recordException(error);
span?.setStatus({ code: SpanStatusCode.ERROR });
ordersTotal.add(1, { status: 'error' });
throw error;
}
}
traceparent header propagationtrace_id and span_id fieldsApplication Error Handling → Use relevant language skill (node-js-expert, python-expert, etc.) for try/catch, error boundaries, graceful degradation
Security Event Monitoring → Use security-expert skill for SIEM, threat detection, compliance logging
Log Storage Infrastructure → Use kubernetes-expert or cloud-expert for ELK stack deployment, log retention policies
Performance Testing → Use load-testing-expert for generating telemetry during performance validation
Cost Optimization → Use finops-expert for observability spend analysis and retention tuning
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.