skills/mav-bp-observability/SKILL.md
Observability conventions for deployed applications. Covers metrics collection, distributed tracing, health checks, SLIs/SLOs, and dashboards. Complements the logging and alerting skills to complete the observability picture. Applied when designing or reviewing operational aspects of services.
npx skillsauth add thermiteau/maverick mav-bp-observabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Ensure all deployed services are observable -- problems are detected, diagnosed, and correlated across system boundaries. Observability is not just monitoring; it is the ability to understand internal system state from external outputs.
Before applying these standards, load the project-specific observability implementation:
digraph lookup {
"docs/maverick/skills/observability/SKILL.md exists?" [shape=diamond];
"Read and use alongside these standards" [shape=box];
"Invoke upskill" [shape=box];
"Read generated skill" [shape=box];
"docs/maverick/skills/observability/SKILL.md exists?" -> "Read and use alongside these standards" [label="yes"];
"docs/maverick/skills/observability/SKILL.md exists?" -> "Invoke upskill" [label="no"];
"Invoke upskill" -> "Read generated skill";
"Read generated skill" -> "Read and use alongside these standards";
}
docs/maverick/skills/observability/SKILL.mddo-upskill skill with:
meter|tracer|span|histogram|counter|gauge|HealthCheck|readiness|liveness|SLO|SLI**/metrics*.*, **/tracing*.*, **/telemetry*.*, **/health*.*, **/observability*.*, **/instrumentation*.*Observability is a composite discipline. Two of the three pillars are covered by dedicated skills:
| Pillar | Covered by | This skill's role | | ----------- | ----------------------------------------- | ---------------------------------------------- | | Logs | mav-bp-logging | Defer to the logging skill for log standards | | Metrics | This skill | Defines metric collection and instrumentation | | Traces | This skill | Defines distributed tracing standards | | Alerts | mav-bp-alerting | Defer to the alerting skill for notification |
This skill focuses on metrics, tracing, health checks, SLIs/SLOs, and dashboards. It assumes the logging and alerting skills are applied alongside it.
Metrics are numeric measurements collected over time. They answer "how much" and "how fast" questions.
| Category | What it measures | Examples | | --------------------- | ----------------------------------------- | ------------------------------------------------- | | Application metrics | Business logic and request processing | Request count, error count, response time, queue depth | | Business metrics | Domain-level outcomes | Orders placed, sign-ups completed, payments processed | | Resource metrics | Infrastructure utilisation | CPU usage, memory usage, disk I/O, network throughput |
For services that handle requests, track these three signals:
| Signal | What it measures | Example metric |
| ----------- | ------------------------- | ---------------------------------- |
| Rate | Requests per second | http_requests_total |
| Errors | Failed requests per second | http_errors_total |
| Duration | Time per request | http_request_duration_seconds |
For infrastructure components (databases, queues, caches), track:
| Signal | What it measures | Example metric |
| --------------- | ----------------------------------- | ---------------------------------- |
| Utilisation | Percentage of resource capacity used | cpu_utilisation_percent |
| Saturation | Degree of queuing or contention | thread_pool_queue_depth |
| Errors | Error events from the resource | db_connection_errors_total |
<namespace>_<name>_<unit> (e.g., http_request_duration_seconds)Traces follow a single request as it crosses service boundaries. They answer "where did the time go" and "which service failed."
| Concept | Definition | | ----------------- | ------------------------------------------------------------------ | | Trace | End-to-end record of a request across all services it touches | | Span | A single unit of work within a trace (e.g., an HTTP call, a DB query) | | Trace ID | Unique identifier shared by all spans in a trace | | Span context | Propagated metadata (trace ID, span ID, flags) passed between services | | Correlation ID | Application-level identifier linking logs, metrics, and traces |
The trace ID must appear in:
Health checks expose the current state of a service to orchestrators, load balancers, and monitoring systems.
| Type | Question it answers | Failure consequence | | ------------ | -------------------------------------- | -------------------------------------------- | | Liveness | Is the process running and not stuck? | Orchestrator restarts the instance | | Readiness | Can the service handle requests? | Load balancer stops sending traffic | | Startup | Has the service finished initialising? | Orchestrator waits before checking liveness |
/health, /healthz, or /ready are common conventions. Do not require authentication.Service Level Indicators (SLIs) and Service Level Objectives (SLOs) define what "healthy" means and measure against targets.
| Term | Definition | Example | | ------- | -------------------------------------------------------------------- | ------------------------------------------------ | | SLI | A quantitative measure of a specific aspect of service quality | Proportion of requests completing in under 200ms | | SLO | A target value or range for an SLI over a time window | 99.5% of requests complete in under 200ms over 30 days | | Error budget | The allowed amount of SLO violation before action is required | 0.5% of requests may exceed 200ms per 30-day window |
| SLI | Measurement | | ------------------ | --------------------------------------------------------- | | Availability | Proportion of successful (non-5xx) responses | | Latency | Proportion of requests faster than a threshold | | Throughput | Requests processed per unit of time | | Correctness | Proportion of responses that return the right data |
Dashboards provide at-a-glance visibility into service health. They are not replacements for alerting but support investigation and awareness.
Every deployed service should have a dashboard showing the four golden signals:
| Signal | What it shows | Typical visualisation | | ----------- | -------------------------------------- | ----------------------------- | | Latency | Response time distribution | Histogram or heatmap | | Traffic | Request volume | Time-series line graph | | Errors | Error rate or error count | Time-series with threshold line | | Saturation | Resource utilisation and queue depth | Gauge or time-series |
| Pattern | Issue | Fix | | ------------------------------------------------ | ------------------------------- | ---------------------------------------------------------- | | Service with no metrics instrumentation | Blind to performance issues | Add RED/USE metrics using a standard library | | No trace context propagation on outbound calls | Broken traces | Add middleware/interceptor for automatic context propagation | | Health endpoint checks all dependencies for liveness | Cascading restarts | Liveness: process only. Readiness: check dependencies. | | High-cardinality metric labels (user ID, request ID) | Metric explosion, cost growth | Use trace IDs in logs/traces, not in metric labels | | No correlation ID in log entries | Cannot link logs to traces | Include trace ID in structured log context | | SLO defined but not measured | False confidence in reliability | Instrument the SLI and track against the target | | Dashboard with no link to traces or logs | Slow investigation | Add drill-down links from dashboard panels | | Custom metric collection instead of standard library | Maintenance burden, incompatibility | Migrate to OpenTelemetry or equivalent |
<!-- maverick-plugin-version: 3.3.7 -->development
--- name: do-test description: Write or update tests for a code change. Operates in two modes: `unit` (module-scoped, fast, deterministic) and `integration` (crosses module / service / database boundaries). Intended to be invoked once per testable change from inside a do-issue-* or do-epic phase. Mode is required. argument-hint: mode: unit or integration user-invocable: true disable-model-invocation: false --- **Depends on:** mav-bp-unit-testing, mav-bp-integration-testing, mav-local-verificati
development
Implement a focused code change. Use this skill as the wrapper for any implementation work so the Maverick workflow report captures what was done and so the agent applies the project's coding standards before editing. Intended to be invoked once per task from inside a do-issue-* or do-epic phase, not standalone.
testing
How to stack a PR on top of an unmerged sibling branch, and how to retarget it to the repo's default branch once the sibling merges. Prevents orphan-merge incidents when a dependent story is ready before its parent.
development
Claim, lease, heartbeat, and release protocols for when multiple Claude Code instances may act on the same issue or epic concurrently. GitHub labels and marker comments are the coordination surface; local state is a cache.