.github/skills/tsh-implementing-observability/SKILL.md
Observability patterns for logging, monitoring, alerting, and distributed tracing. Use when implementing metrics collection, log aggregation, alerting rules, or distributed tracing across services.
npx skillsauth add thesoftwarehouse/copilot-collections tsh-implementing-observabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
| Pillar | Purpose | Tools | |--------|---------|-------| | Metrics | Quantitative measurements over time | Prometheus, CloudWatch, Datadog, Grafana | | Logs | Discrete events with context | ELK, Loki, CloudWatch Logs, Splunk | | Traces | Request flow across services | Jaeger, Zipkin, X-Ray, Tempo |
Check which observability stack the project uses:
prometheus.yml or ServiceMonitor → Prometheusfluent-bit.conf or fluentd.conf → Fluent Bit/Fluentdotel-collector-config.yaml → OpenTelemetryaws_cloudwatch_* resources → CloudWatchdatadog-agent or DD_* env vars → DatadogUse context7 to look up stack-specific configuration syntax.
| Scenario | Recommended Solution | |----------|---------------------| | Kubernetes-native, cost-sensitive | Prometheus + Grafana | | AWS-native, simple setup | CloudWatch Metrics | | Multi-cloud, enterprise | Datadog or New Relic | | OpenTelemetry-first | Prometheus with OTLP receiver |
| Scenario | Recommended Solution | |----------|---------------------| | Kubernetes, cost-sensitive | Loki + Grafana | | AWS-native | CloudWatch Logs | | High volume, complex queries | Elasticsearch (ELK) | | Multi-cloud, managed | Datadog Logs or Splunk |
| Scenario | Recommended Solution | |----------|---------------------| | Kubernetes, open-source | Jaeger or Tempo | | AWS-native | X-Ray | | Multi-cloud, correlated | Datadog APM | | Vendor-agnostic | OpenTelemetry → any backend |
┌─────────────────────────────────────────────────────┐
│ Applications │
│ (instrumented with OpenTelemetry SDK or auto-inst) │
└──────────────────────┬──────────────────────────────┘
│ OTLP
▼
┌─────────────────────────────────────────────────────┐
│ OpenTelemetry Collector │
│ (receives, processes, exports telemetry) │
└───────┬─────────────────┬─────────────────┬─────────┘
│ │ │
▼ ▼ ▼
Prometheus Loki Tempo/Jaeger
(metrics) (logs) (traces)
│ │ │
└────────────────┬┴─────────────────┘
▼
Grafana
(visualization)
| Metric | Description | Example SLI |
|--------|-------------|-------------|
| Rate | Requests per second | rate(http_requests_total[5m]) |
| Errors | Failed requests | rate(http_requests_total{status=~"5.."}[5m]) |
| Duration | Latency distribution | histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) |
| Metric | Description | Example | |--------|-------------|---------| | Utilization | % time resource is busy | CPU usage, memory usage | | Saturation | Queue depth, waiting | Pod pending, connection pool | | Errors | Error count | OOM kills, disk errors |
# Example: API availability SLO
slo:
name: api-availability
description: "API returns successful responses"
sli:
metric: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/
sum(rate(http_requests_total[5m]))
target: 99.9%
window: 30d
error_budget: 0.1% # ~43 minutes/month downtime allowed
| Severity | Response | Example | |----------|----------|---------| | Critical | Page on-call immediately | Service down, data loss risk | | Warning | Investigate within hours | Error rate elevated, disk 80% | | Info | Review during business hours | Deployment completed, scaling event |
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/
sum(rate(http_requests_total[5m])) > 0.01
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate detected"
description: "Error rate is {{ $value | humanizePercentage }} (threshold: 1%)"
runbook_url: "https://runbooks.example.com/high-error-rate"
{
"timestamp": "2024-01-15T10:30:00Z",
"level": "error",
"message": "Payment processing failed",
"service": "payment-api",
"trace_id": "abc123",
"span_id": "def456",
"user_id": "user-789",
"error": {
"type": "PaymentGatewayError",
"message": "Connection timeout"
},
"context": {
"payment_id": "pay-123",
"amount": 99.99
}
}
| Field | Purpose | Correlation |
|-------|---------|-------------|
| timestamp | When event occurred | Time-based queries |
| level | Severity (debug/info/warn/error) | Filtering |
| service | Source service name | Service filtering |
| trace_id | Distributed trace identifier | Cross-service correlation |
| message | Human-readable description | Search |
| Don't | Do | |-------|-----| | Alert on every metric threshold | Alert on user-impacting symptoms | | Log everything at DEBUG in production | Use appropriate log levels | | Unstructured log messages | Structured JSON logging | | Missing trace context | Propagate trace IDs across services | | Dashboards with 50+ panels | Focused dashboards per service/domain | | Alerts without runbooks | Every alert links to response procedure | | Store logs indefinitely | Define retention based on compliance needs |
tsh-implementing-kubernetes - For K8s-native observability setuptsh-implementing-ci-cd - For pipeline observability integrationtsh-managing-secrets - For secure credential storage for observability toolsdevelopment
Custom hook and composable patterns — naming, composition, stable return shapes, lifecycle cleanup, and testing strategies. Use when writing reusable logic units (React hooks, Vue composables), refactoring logic into hooks, debugging hook behavior, or reviewing hook implementations.
testing
UI verification criteria, structure checklists, severity definitions, and tolerance rules for comparing implementations against Figma designs. Use for verifying UI matches design, understanding what to check, and determining acceptable differences.
development
Clean raw workshop or meeting transcripts from small talk, filler words, and off-topic tangents. Extract and structure business-relevant content into a standardized format with discussion topics, key decisions, action items, and open questions.
development
Discover and establish technical context before implementing any feature. Prioritize project instructions, existing codebase patterns, and external documentation in that order. Use for any task requiring understanding of project conventions, coding standards, architecture patterns, and established practices before writing code.