skills/observability-and-instrumentation/SKILL.md
Instruments code so production behavior is visible and diagnosable. Use when adding logging, metrics, tracing, or alerting. Use when shipping any feature that runs in production and you need evidence it works. Use when production issues are reported but you can't tell what happened from the available data.
npx skillsauth add noahacgn/codex-config observability-and-instrumentationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Code you can't observe is code you can't operate. Observability is the ability to answer "what is the system doing and why?" from the outside, using the telemetry the code emits. Instrumentation is not a post-launch add-on — it's written alongside the feature, the same way tests are. If a feature ships without telemetry, the first user-reported bug becomes archaeology instead of a query.
NOT for:
debugging-and-error-recovery skill (observability is what makes that skill fast next time)performance-optimization skillshipping-and-launch skill; this skill covers the instrumentation that feeds themTelemetry without a question is noise. Before adding any instrumentation, write down 2–4 questions an on-call engineer will ask about this feature:
FEATURE: checkout payment retry
QUESTIONS ON-CALL WILL ASK:
1. What fraction of payments succeed on first attempt vs after retry?
2. When a payment fails permanently, why? (provider error? timeout? validation?)
3. Is the payment provider slower than usual?
→ Every signal below must help answer one of these.
If you can't name the questions, you're not ready to instrument — you'll log everything and learn nothing.
| Signal | Answers | Cost profile | Example |
|---|---|---|---|
| Structured log | "What happened in this specific case?" | Per-event; grows with traffic | payment_failed with provider error code |
| Metric | "How often / how fast, in aggregate?" | Fixed per series; cheap to query | p99 latency of provider calls |
| Trace | "Where did time go across services?" | Per-request; usually sampled | One slow checkout, broken down by hop |
Rule of thumb: metrics tell you that something is wrong, traces tell you where, logs tell you why.
Log events, not prose. Every log line is a JSON object with a stable event name and machine-readable fields:
// BAD: string interpolation — unqueryable, inconsistent
logger.info(`Payment ${id} failed for user ${userId} after ${n} retries`);
// GOOD: stable event name + structured fields
logger.warn({
event: 'payment_failed',
paymentId: id,
provider: 'stripe',
errorCode: err.code,
attempt: n,
}, 'payment failed');
Log levels — use them consistently:
| Level | Meaning | On-call action |
|---|---|---|
| error | Invariant broken; someone may need to act | Investigate |
| warn | Degraded but handled (retry succeeded, fallback used) | Watch for trends |
| info | Significant business event (order placed, job finished) | None |
| debug | Diagnostic detail | Off in production by default |
Correlation IDs are mandatory. Generate (or accept) a request ID at the system boundary and attach it to every log line, span, and outbound call. Without it, you cannot reconstruct a single request from interleaved logs:
// Express: child logger per request, ID propagated downstream
app.use((req, res, next) => {
req.id = req.headers['x-request-id'] ?? crypto.randomUUID();
req.log = logger.child({ requestId: req.id });
res.setHeader('x-request-id', req.id);
next();
});
Never log secrets, tokens, passwords, or full PII. This is a hard rule from the security-and-hardening skill — telemetry pipelines are a classic data-leak path. Allowlist fields; don't log whole request bodies.
For request-driven services, instrument RED on every endpoint and every external dependency: Rate (requests/sec), Errors (failure rate), Duration (latency histogram, not average). For resources (queues, pools, hosts), use USE: Utilization, Saturation, Errors.
As with tracing, the vendor-neutral path is the OpenTelemetry metrics API (same SDK and context as step 5). The example below uses Prometheus' prom-client — one common backend choice, not the only one; the RED/USE and cardinality rules are identical either way.
import { Histogram } from 'prom-client';
const httpDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration',
labelNames: ['method', 'route', 'status_class'], // '2xx', not '200'
buckets: [0.05, 0.1, 0.25, 0.5, 1, 2.5, 5],
});
Cardinality is the failure mode. Every unique label combination is a separate time series. Labels must come from small, fixed sets (route template, status class, provider name). Never use user IDs, raw URLs, error messages, or other unbounded values as labels — that belongs in logs and traces.
OK as label: route="/api/tasks/:id" status_class="5xx" provider="stripe"
NEVER a label: user_id, email, request_id, full URL, error message text
Track averages never, percentiles always: an average hides the 1% of users having a terrible time. Use histograms and read p50/p95/p99.
Use OpenTelemetry — it's the vendor-neutral standard, and auto-instrumentation covers HTTP, gRPC, and common DB clients with near-zero code:
// tracing.ts — must be imported before anything else
import { NodeSDK } from '@opentelemetry/sdk-node';
import { getNodeAutoInstrumentations } from '@opentelemetry/auto-instrumentations-node';
const sdk = new NodeSDK({
serviceName: 'checkout-service',
instrumentations: [getNodeAutoInstrumentations()],
});
sdk.start();
Add manual spans only around meaningful internal units of work (e.g., applyDiscounts, chargeProvider) and attach the attributes on-call will filter by. Propagate context across every async boundary — HTTP headers, queue message metadata — or the trace dies at the gap. Sample head-based at a low rate by default; keep 100% of errors if your backend supports tail sampling.
Alert on symptoms users feel, not on causes:
SYMPTOM (page-worthy): CAUSE (dashboard, not a page):
error rate > 1% for 5 min CPU at 85%
p99 latency > 2s one pod restarted
queue age > 10 min disk at 70%
Cause-based alerts fire when nothing is wrong and miss failures you didn't predict. Symptom-based alerts fire exactly when users are hurt, regardless of the cause.
Rules for every alert you create:
Instrumentation is code; it can be wrong. Before calling the work done, trigger the paths and look at the actual output:
requestId, confirm fields are structured (not [object Object])| Rationalization | Reality | |---|---| | "I'll add logging after it works" | "After" becomes "after the first incident", which is the most expensive moment to discover you're blind. Instrument as you build. | | "More logs = more observability" | Unstructured noise makes incidents slower, not faster. Three queryable events beat three hundred prose lines. | | "console.log is fine for now" | Unstructured output can't be filtered, correlated, or alerted on. The structured logger costs five extra minutes once. | | "We can just look at the dashboards when something breaks" | Dashboards built without defined questions show you everything except the answer. Start from on-call questions. | | "Alert on everything important, we'll tune later" | A noisy pager trains people to ignore it. The tuning never happens; the missed real page does. | | "User ID as a metric label makes debugging easier" | It also makes your metrics backend fall over. High-cardinality lookups belong in logs and traces. | | "Tracing is overkill for our two services" | Two services already means cross-service latency questions logs can't answer. Auto-instrumentation makes the cost trivial. |
After instrumenting a feature, confirm:
development
Only when explicitly invoked, use an ExecPlan from design to implementation for complex features or significant refactors. Do not use it automatically.
development
Extracts what the user actually wants instead of what they think they should want. Achieves this through one-question-at-a-time interview until ~95% confidence about the underlying intent. Use when an ask is underspecified ("build me X" without "for whom" or "why now"), when the user explicitly invokes ("interview me", "grill me", "are we sure?", "stress-test my thinking"), or when you catch yourself silently filling in ambiguous requirements before any plan, spec, or code exists.
development
Subjects every non-trivial decision to a fresh-context adversarial review before it stands. Use when correctness matters more than speed, when working in unfamiliar code, when stakes are high (production, security-sensitive logic, irreversible operations), or any time a confident output would be cheaper to verify now than to debug later.
development
Discovers and invokes agent skills. Use when starting a session or when you need to discover which skill applies to the current task. This is the meta-skill that governs how all other skills are discovered and invoked.