skills/council/operator/observability-design/SKILL.md
Use when designing monitoring, alerting, logging, tracing, and SLI/SLO strategies for services or systems. Covers metric collection, structured logging, distributed tracing, dashboard design, and error budget management. Do not use for deployment pipeline design (use deployment-plan) or infrastructure cost modeling (use cost-analysis).
npx skillsauth add dtsong/my-claude-setup observability-designInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Design a comprehensive observability strategy covering metrics, logging, tracing, alerting, and SLI/SLO definitions. Produces a monitoring architecture that enables rapid incident detection, diagnosis, and resolution.
Reads system architecture documentation, existing monitoring configurations, and service definitions for observability analysis. Does not modify files, deploy monitoring agents, or access production telemetry data directly.
No user-provided values are used in commands or file paths. All inputs are treated as read-only analysis targets.
Establish the three pillars for this system:
Define the metric taxonomy:
Design the alerting strategy:
Design the logging architecture:
Plan request flow visibility:
service.operation format, consistent across servicesDefine dashboard hierarchy:
Establish reliability targets:
Compaction resilience: If context was lost during a long session, re-read the Inputs section to reconstruct what system is being analyzed, check the Progress Checklist for completed steps, then resume from the earliest incomplete step.
# Observability Design: [Service/Feature Name]
## Observability Architecture
[Application] → [Metrics Agent] → [Metrics Store] → [Dashboards] ↓ ↓ [Structured Logs] → [Log Aggregator] → [Log Search] [Alerts] → [On-call] ↓ [Trace SDK] → [Trace Collector] → [Trace UI]
## Metric Catalog
| Metric Name | Type | Labels | Description | Alert Threshold |
|-------------|------|--------|-------------|-----------------|
| http_requests_total | counter | method, path, status | Request count | N/A |
| http_request_duration_ms | histogram | method, path | Request latency | p95 > 500ms |
| ... | ... | ... | ... | ... |
## Alert Catalog
| Alert Name | Severity | Condition | Duration | Runbook |
|------------|----------|-----------|----------|---------|
| HighErrorRate | critical | error_rate > 5% | 5m | [link] |
| LatencyDegraded | warning | p95 > 500ms | 10m | [link] |
| ... | ... | ... | ... | ... |
## Logging Schema
```json
{
"timestamp": "ISO8601",
"level": "INFO",
"service": "api",
"request_id": "uuid",
"user_id": "string (optional)",
"action": "string",
"duration_ms": "number",
"status": "number",
"message": "string"
}
| SLI | Measurement | SLO Target | Error Budget (30d) | |-----|-------------|------------|-------------------| | Availability | successful requests / total | 99.9% | 43.2 min downtime | | Latency | requests < 200ms / total | 99.0% | 432 min slow | | Error Rate | non-error requests / total | 99.9% | 0.1% errors |
| Dashboard | Audience | Key Panels | Refresh | |-----------|----------|------------|---------| | Service Health | On-call | Traffic, errors, latency, saturation | 30s | | SLO Status | Engineering | Error budget, burn rate, compliance | 5m | | Business Metrics | Product | Adoption, conversions, revenue | 1h |
## Handoff
- Hand off to deployment-plan if observability findings reveal deployment pipeline gaps (e.g., missing health checks, no canary metrics integration).
- Hand off to cost-analysis if telemetry storage, metric cardinality, or log retention volumes raise infrastructure cost concerns.
## Quality Checks
- [ ] All three observability pillars (metrics, logs, traces) are covered
- [ ] Every alert has a defined severity, threshold, and linked runbook
- [ ] Structured logging schema includes correlation IDs for distributed tracing
- [ ] PII fields are identified with redaction strategy
- [ ] SLIs are measurable and SLO targets are realistic for the service tier
- [ ] Error budgets are calculated with burn rate alert thresholds
- [ ] Dashboard hierarchy covers operational, business, and SLO views
- [ ] Sampling strategy balances trace coverage with storage costs
## Evolution Notes
<!-- Observations appended after each use -->
testing
Use to convert a Word .docx file to PDF and/or verify its page count. Triggers on: converting docx to pdf, rendering a document, checking how many pages a docx produces, or asserting a page-count constraint (e.g. a resume must stay 2 pages). Wraps LibreOffice headless conversion.
development
Security audit checklist for web applications. Use when reviewing, auditing, or hardening a web app's security posture. Covers rate limiting, auth headers, IP blocking, CORS, security middleware, input validation, file upload limits, ORM usage, and password hashing. Triggers on requests like "review security", "harden this app", "security audit", "check for vulnerabilities", or when building/reviewing API endpoints.
development
Interactive wizard to craft effective prompts using Claude Code best practices
tools
Use when batch labeling, prioritizing, and assigning GitHub issues during triage sessions.