Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

curiositech/observability-apm-expert

Name: observability-apm-expert
Author: curiositech

skills/observability-apm-expert/SKILL.md

npx skillsauth add curiositech/windags-skills observability-apm-expert

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Observability & APM Expert

Implement comprehensive observability with distributed tracing, metrics, structured logging, and SLO-based alerting using OpenTelemetry and modern backends.

Decision Points

1. Sampling Strategy Selection

Error Rate Analysis:
├── Error rate < 0.1%
│   ├── Low cardinality service (< 10k spans/min) → 100% sampling
│   └── High cardinality service (> 10k spans/min) → Tail-based sampling
│       ├── Keep all error traces (100%)
│       ├── Keep slow traces > P95 latency (100%)
│       └── Sample successful traces (1-10%)
└── Error rate > 0.1%
    ├── Critical service → Keep all errors + 50% successful
    └── Non-critical service → Keep all errors + 10% successful

2. Backend Selection Strategy

If self-hosted tolerance = high AND cost sensitivity = high:
├── Use Grafana stack (Tempo + Mimir + Loki)
└── Export via OTLP to unified collector

If operational overhead tolerance = low OR compliance = strict:
├── Cloud vendors (Datadog, New Relic, Honeycomb)
└── Direct SDK exports + OTLP fallback

If hybrid requirements:
├── Critical services → SaaS backend
└── Development/staging → Self-hosted stack

3. Alert Configuration Logic

For each SLO:
├── Define error budget (e.g., 99.9% = 43.2min downtime/month)
├── Calculate burn rates:
│   ├── Fast burn (14.4x) over 1h → Critical alert (2min delay)
│   ├── Medium burn (6x) over 6h → Warning alert (15min delay)
│   └── Slow burn (3x) over 24h → Info alert (1h delay)
└── Link each alert to specific runbook action

Failure Modes

Schema Bloat

Symptom: Metrics cardinality > 10M series, query timeouts, high storage costs Detection: prometheus_tsdb_head_cardinality growing exponentially Fix: Add label cardinality limits, aggregate high-cardinality labels, use recording rules

Trace Orphaning

Symptom: Spans appearing disconnected, missing parent-child relationships Detection: Spans with same trace_id but no parent reference in service map Fix: Verify context propagation headers (traceparent/tracestate), check async context handling

Alert Fatigue Storm

Symptom: > 10 alerts per incident, team ignoring notifications Detection: Alert:incident ratio > 5:1, MTTA (time to acknowledge) > 30min Fix: Implement alert dependencies, use SLO burn rate instead of threshold alerts

Sampling Blind Spots

Symptom: Critical errors not captured in traces, debugging impossible Detection: Error logs present but corresponding traces missing Fix: Switch to tail-based sampling, increase error trace retention to 100%

Context Propagation Gaps

Symptom: Traces terminate at service boundaries, no cross-service correlation Detection: Spans from downstream services have different trace_id Fix: Verify HTTP headers propagation, add OTel middleware to all services

Worked Examples

Production Incident Investigation

Scenario: Customer reports 5-second checkout timeouts starting 2 hours ago

Step 1 - Triage with SLO dashboard:

P99 latency jumped from 200ms → 5000ms at 14:30 UTC
Error rate spiked from 0.1% → 2.3%
SLO burn rate: 46x (critical threshold)

Step 2 - Trace analysis:

-- Find slow traces in time window
{service_name="checkout-service"} |= "POST /checkout" 
| json | duration > 2s | trace_id

Expert insight: Filter by duration first, then sample traces - don't analyze all traces

Step 3 - Root cause drill-down:

Selected trace_id: abc123
├── checkout-service: 50ms (normal)
├── payment-service: 4.8s (🚨 anomaly)
│   ├── validate_card: 45ms
│   ├── fraud_check: 12ms  
│   └── database_query: 4.7s (🚨 root cause)
└── inventory-service: 100ms

Step 4 - Correlate with infrastructure:

Database span attributes show:
- db.statement: "SELECT * FROM transactions WHERE user_id = ?"
- db.connection.pool.idle: 0
- db.connection.pool.max: 10

Expert insight: Connection pool exhaustion - scale pool or optimize queries

Resolution: Increased connection pool from 10 → 50, added query timeout

Custom Instrumentation Setup

Scenario: Add business metrics for order processing pipeline

// 1. Initialize custom meter
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('order-service', '1.0.0');

// 2. Define business metrics
const ordersTotal = meter.createCounter('orders_total', {
  description: 'Total orders processed',
  unit: '1'
});

const orderValue = meter.createHistogram('order_value_dollars', {
  description: 'Order value distribution',
  unit: 'USD'
});

// 3. Instrument business logic
async function processOrder(order: Order) {
  const span = trace.getActiveSpan();
  span?.setAttributes({
    'order.id': order.id,
    'order.user_id': order.userId,
    'order.value': order.totalValue
  });

  try {
    // Business logic here
    await validateOrder(order);
    await chargePayment(order);
    
    // Record success metrics
    ordersTotal.add(1, { 
      status: 'success',
      payment_method: order.paymentMethod 
    });
    orderValue.record(order.totalValue);
    
  } catch (error) {
    span?.recordException(error);
    span?.setStatus({ code: SpanStatusCode.ERROR });
    ordersTotal.add(1, { status: 'error' });
    throw error;
  }
}

Quality Gates

[ ] All HTTP requests include traceparent header propagation
[ ] Every log statement contains trace_id and span_id fields
[ ] RED metrics (Rate/Errors/Duration) available for each service endpoint
[ ] P99 latency SLO defined with < 2% error budget burn rate alerting
[ ] Tail-based sampling retains 100% of error traces and slow traces
[ ] Database queries instrumented with connection pool and query performance spans
[ ] Custom business metrics tagged with bounded cardinality labels (< 1000 values)
[ ] Alert runbooks linked and tested for each SLO violation scenario
[ ] Trace sampling decision logged and queryable for debugging
[ ] Cross-service dependency map auto-generated from span relationships

NOT-FOR Boundaries

Application Error Handling → Use relevant language skill (node-js-expert, python-expert, etc.) for try/catch, error boundaries, graceful degradation

Security Event Monitoring → Use security-expert skill for SIEM, threat detection, compliance logging

Log Storage Infrastructure → Use kubernetes-expert or cloud-expert for ELK stack deployment, log retention policies

Performance Testing → Use load-testing-expert for generating telemetry during performance validation

Cost Optimization → Use finops-expert for observability spend analysis and retention tuning

curiositech/observability-apm-expert

skills/observability-apm-expert/SKILL.md

OpenTelemetry, distributed tracing, Grafana, and Datadog for full-stack observability. Activate on: observability, tracing, OpenTelemetry, Grafana, Datadog, metrics, logging, APM, SLO, alerting. NOT for: application error handling (use relevant language skill), security monitoring (use relevant security skill).

testing

Updated Apr 4, 2026

$ install --global

skillsauth

npx skillsauth add curiositech/windags-skills observability-apm-expert

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 7:47 AM4.2s1 file scanned

SKILL.md

license:: Apache-2.0
name:: observability-apm-expert
description:: OpenTelemetry, distributed tracing, Grafana, and Datadog for full-stack observability. Activate on: observability, tracing, OpenTelemetry, Grafana, Datadog, metrics, logging, APM, SLO, alerting. NOT for: application error handling (use relevant language skill), security monitoring (use relevant security skill).
allowed-tools:: Read,Write,Edit,Bash(npm:*,npx:*,docker:*)
category:: DevOps & Infrastructure
- skill:: distributed-transaction-manager
reason:: Saga observability across distributed transactions

Observability & APM Expert

Implement comprehensive observability with distributed tracing, metrics, structured logging, and SLO-based alerting using OpenTelemetry and modern backends.

Decision Points

1. Sampling Strategy Selection

Error Rate Analysis:
├── Error rate < 0.1%
│   ├── Low cardinality service (< 10k spans/min) → 100% sampling
│   └── High cardinality service (> 10k spans/min) → Tail-based sampling
│       ├── Keep all error traces (100%)
│       ├── Keep slow traces > P95 latency (100%)
│       └── Sample successful traces (1-10%)
└── Error rate > 0.1%
    ├── Critical service → Keep all errors + 50% successful
    └── Non-critical service → Keep all errors + 10% successful

2. Backend Selection Strategy

If self-hosted tolerance = high AND cost sensitivity = high:
├── Use Grafana stack (Tempo + Mimir + Loki)
└── Export via OTLP to unified collector

If operational overhead tolerance = low OR compliance = strict:
├── Cloud vendors (Datadog, New Relic, Honeycomb)
└── Direct SDK exports + OTLP fallback

If hybrid requirements:
├── Critical services → SaaS backend
└── Development/staging → Self-hosted stack

3. Alert Configuration Logic

For each SLO:
├── Define error budget (e.g., 99.9% = 43.2min downtime/month)
├── Calculate burn rates:
│   ├── Fast burn (14.4x) over 1h → Critical alert (2min delay)
│   ├── Medium burn (6x) over 6h → Warning alert (15min delay)
│   └── Slow burn (3x) over 24h → Info alert (1h delay)
└── Link each alert to specific runbook action

Failure Modes

Schema Bloat

Trace Orphaning

Alert Fatigue Storm

Sampling Blind Spots

Context Propagation Gaps

Worked Examples

Production Incident Investigation

Scenario: Customer reports 5-second checkout timeouts starting 2 hours ago

Step 1 - Triage with SLO dashboard:

P99 latency jumped from 200ms → 5000ms at 14:30 UTC
Error rate spiked from 0.1% → 2.3%
SLO burn rate: 46x (critical threshold)

Step 2 - Trace analysis:

-- Find slow traces in time window
{service_name="checkout-service"} |= "POST /checkout" 
| json | duration > 2s | trace_id

Expert insight: Filter by duration first, then sample traces - don't analyze all traces

Step 3 - Root cause drill-down:

Selected trace_id: abc123
├── checkout-service: 50ms (normal)
├── payment-service: 4.8s (🚨 anomaly)
│   ├── validate_card: 45ms
│   ├── fraud_check: 12ms  
│   └── database_query: 4.7s (🚨 root cause)
└── inventory-service: 100ms

Step 4 - Correlate with infrastructure:

Database span attributes show:
- db.statement: "SELECT * FROM transactions WHERE user_id = ?"
- db.connection.pool.idle: 0
- db.connection.pool.max: 10

Expert insight: Connection pool exhaustion - scale pool or optimize queries

Resolution: Increased connection pool from 10 → 50, added query timeout

Custom Instrumentation Setup

Scenario: Add business metrics for order processing pipeline

// 1. Initialize custom meter
import { metrics } from '@opentelemetry/api';
const meter = metrics.getMeter('order-service', '1.0.0');

// 2. Define business metrics
const ordersTotal = meter.createCounter('orders_total', {
  description: 'Total orders processed',
  unit: '1'
});

const orderValue = meter.createHistogram('order_value_dollars', {
  description: 'Order value distribution',
  unit: 'USD'
});

// 3. Instrument business logic
async function processOrder(order: Order) {
  const span = trace.getActiveSpan();
  span?.setAttributes({
    'order.id': order.id,
    'order.user_id': order.userId,
    'order.value': order.totalValue
  });

  try {
    // Business logic here
    await validateOrder(order);
    await chargePayment(order);
    
    // Record success metrics
    ordersTotal.add(1, { 
      status: 'success',
      payment_method: order.paymentMethod 
    });
    orderValue.record(order.totalValue);
    
  } catch (error) {
    span?.recordException(error);
    span?.setStatus({ code: SpanStatusCode.ERROR });
    ordersTotal.add(1, { status: 'error' });
    throw error;
  }
}

Quality Gates

[ ] All HTTP requests include traceparent header propagation
[ ] Every log statement contains trace_id and span_id fields
[ ] RED metrics (Rate/Errors/Duration) available for each service endpoint
[ ] P99 latency SLO defined with < 2% error budget burn rate alerting
[ ] Tail-based sampling retains 100% of error traces and slow traces
[ ] Database queries instrumented with connection pool and query performance spans
[ ] Custom business metrics tagged with bounded cardinality labels (< 1000 values)
[ ] Alert runbooks linked and tested for each SLO violation scenario
[ ] Trace sampling decision logged and queryable for debugging
[ ] Cross-service dependency map auto-generated from span relationships

NOT-FOR Boundaries

Application Error Handling → Use relevant language skill (node-js-expert, python-expert, etc.) for try/catch, error boundaries, graceful degradation

Security Event Monitoring → Use security-expert skill for SIEM, threat detection, compliance logging

Log Storage Infrastructure → Use kubernetes-expert or cloud-expert for ELK stack deployment, log retention policies

Performance Testing → Use load-testing-expert for generating telemetry during performance validation

Cost Optimization → Use finops-expert for observability spend analysis and retention tuning

Related Skills

curiositech/revisiting-interview-data-analysing-turn

data-ai

VerifiedTrustedCommunity

license: Apache-2.0 NOT for unrelated tasks outside this domain.

8SKILL.mdUpdated Jul 19, 2026

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

development

VerifiedTrustedCommunity

Use when designing caching strategies (cache-aside, write-through, write-behind), implementing distributed locks, building rate limiters, leaderboards, real-time streams (XADD/consumer groups), pub/sub, or tuning eviction policies. Triggers: thundering-herd on cache miss, dogpile on key expiry, Redlock vs SET-NX-PX choice, sliding-window rate limiter, hot-key on a single cluster slot, big-key blowup, MULTI/EXEC across slots, KEYS in production. NOT for Redis Cluster operations/admin (different domain), embedded KV (SQLite, leveldb), in-process LRU caches, or Memcached.

8SKILL.mdUpdated Jul 19, 2026

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

tools

VerifiedTrustedCommunity

Drawing the `'use client'` boundary correctly in React Server Components apps (Next.js App Router, RSC frameworks) — leaf-pushing, slot composition, serialization rules, and environment poisoning prevention. Grounded in react.dev and Next.js 16 docs.

8SKILL.mdUpdated Jul 19, 2026

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy

development

VerifiedTrustedCommunity

Use when designing rate limiting for an API, choosing between token bucket / sliding window / leaky bucket / fixed window, implementing it in Redis, deciding edge (Cloudflare/Upstash) vs origin enforcement, sizing per-user vs per-IP vs per-endpoint quotas, returning the right 429 response with Retry-After, or fixing the boundary-burst bug in fixed-window limiters. Triggers: 429 too many requests, INCR + EXPIRE, ZADD + ZREMRANGEBYSCORE + ZCARD, X-RateLimit-Remaining header, Cloudflare WAF rate limiting rules, Upstash @upstash/ratelimit, leaky bucket shaping vs policing, distributed rate limiter consistency. NOT for DDoS mitigation specifically (different scale), CAPTCHA / bot management, full WAF design, or per-user quota billing.

8SKILL.mdUpdated Jul 19, 2026

curiositech/rate-limiting-strategy

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/curiositech/windags-skills.git

# Copy into Claude Code skills folder (global)
cp -r windags-skills/skills/observability-apm-expert ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

curiositech/windags-skills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT