skills/logging-metrics/SKILL.md
Structured logging and metrics design for analytics pipelines and dashboards. Use when: implementing application logging, designing log schemas, adding metrics collection, handling exception logging, building observability into services, integrating with log aggregation and dashboarding tools, auditing existing logging for structure and completeness, or improving observability in an existing service.
npx skillsauth add michaelsvanbeek/personal-agent-skills logging-metricsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Logs are not debug printf statements. They are structured data events that feed analytics pipelines, dashboards, alerts, and incident investigation. Design them like you would design a database schema.
All logs must be JSON-formatted, one JSON object per line. Never use unstructured text, multi-line log messages, or custom delimiters.
Every log entry must include these base fields:
| Field | Type | Description |
|-------|------|-------------|
| timestamp | string | ISO 8601 with timezone: 2026-03-19T23:45:00.123Z |
| level | string | DEBUG, INFO, WARNING, ERROR, CRITICAL |
| service | string | Service/application name |
| environment | string | dev, staging, prod |
| message | string | Human-readable description of the event |
| Field | Type | When to include |
|-------|------|-----------------|
| request_id | string (UUID) | Any event within an HTTP request lifecycle |
| user_id | string | Any event tied to a user action (never log PII beyond ID) |
| trace_id | string | Distributed tracing across services |
| span_id | string | Sub-operation within a trace |
| route | string | HTTP endpoint path |
| method | string | HTTP method |
| status | integer | HTTP response status code |
| duration_ms | number | Operation timing |
| component | string | Module, class, or function name |
| action | string | What the code is doing: fetch_users, send_email, sync_cache |
Add fields relevant to the specific event. Use consistent names across the codebase:
{
"timestamp": "2026-03-19T23:45:00.123Z",
"level": "INFO",
"service": "project-api",
"environment": "prod",
"request_id": "a1b2c3d4-e5f6-7890-abcd-ef1234567890",
"action": "create_project",
"user_id": "usr_42",
"project_id": "prj_789",
"duration_ms": 142.5,
"message": "Project created successfully"
}
_id for identifiers, _ms for milliseconds, _at for timestamps, _count for counts, _bytes for sizes.duration_ms is a number in one log, it must always be a number.user_id not user.id.Use levels consistently across all services:
| Level | When to use | Example |
|-------|-------------|---------|
| DEBUG | Detailed diagnostic info for development. Disabled in production. | "Resolved user cache key: usr_42" |
| INFO | Normal operations, milestones, state transitions. | "Project created", "Sync completed: 42 items" |
| WARNING | Recoverable issue that may need attention. | "Rate limit approaching: 85% of quota" |
| ERROR | Operation failed but service continues. | "Failed to send notification email" |
| CRITICAL | Service-level failure, requires immediate action. | "Database connection pool exhausted" |
Rules:
When logging exceptions, include enough context to reproduce and diagnose the issue without accessing the running system:
import logging
logger = logging.getLogger(__name__)
try:
result = process_order(order_id=order_id, user_id=user_id)
except ExternalServiceError as exc:
logger.error(
"Order processing failed",
extra={
"action": "process_order",
"order_id": order_id,
"user_id": user_id,
"error_type": type(exc).__name__,
"error_message": str(exc),
"retry_count": attempt,
},
exc_info=True,
)
raise
exc_info=True (Python) or error.stack (JS). Include in dev/staging; optionally truncate in production.Build a sanitization layer that runs before logging:
SENSITIVE_KEYS = {"password", "token", "secret", "api_key", "authorization", "credit_card", "ssn"}
def sanitize_context(context: dict) -> dict:
"""Remove or mask sensitive fields before logging."""
return {
k: "***REDACTED***" if any(s in k.lower() for s in SENSITIVE_KEYS) else v
for k, v in context.items()
}
For JavaScript/TypeScript:
const SENSITIVE_PATTERNS = /password|token|secret|api_key|authorization|credit_card|ssn/i;
function sanitize(context: Record<string, unknown>): Record<string, unknown> {
return Object.fromEntries(
Object.entries(context).map(([k, v]) =>
SENSITIVE_PATTERNS.test(k) ? [k, "***REDACTED***"] : [k, v],
),
);
}
Use the stdlib logging module with a JSON formatter. Never use print().
import json
import logging
import os
import sys
from datetime import datetime, timezone
class JSONFormatter(logging.Formatter):
def format(self, record: logging.LogRecord) -> str:
entry = {
"timestamp": datetime.now(timezone.utc).isoformat(),
"level": record.levelname,
"service": os.environ.get("SERVICE_NAME", "unknown"),
"environment": os.environ.get("ENVIRONMENT", "dev"),
"component": record.name,
"message": record.getMessage(),
}
# Merge extra fields
if hasattr(record, "__dict__"):
for key, value in record.__dict__.items():
if key not in logging.LogRecord.__dict__ and key not in entry:
entry[key] = value
# Include exception info
if record.exc_info and record.exc_info[1]:
entry["error_type"] = type(record.exc_info[1]).__name__
entry["error_message"] = str(record.exc_info[1])
entry["traceback"] = self.formatException(record.exc_info)
return json.dumps(entry, default=str)
def setup_logging(level: str = "INFO") -> None:
handler = logging.StreamHandler(sys.stdout)
handler.setFormatter(JSONFormatter())
logging.root.handlers = [handler]
logging.root.setLevel(getattr(logging, level.upper()))
Use a structured logger like pino for Node.js services:
import pino from "pino";
const logger = pino({
level: process.env.LOG_LEVEL ?? "info",
formatters: {
level: (label) => ({ level: label.toUpperCase() }),
},
base: {
service: process.env.SERVICE_NAME ?? "unknown",
environment: process.env.ENVIRONMENT ?? "dev",
},
timestamp: pino.stdTimeFunctions.isoTime,
});
// Usage
logger.info({ action: "create_project", project_id: "prj_789", duration_ms: 142 }, "Project created");
Instrument the Four Golden Signals for every service:
| Signal | Metric | Example | |--------|--------|---------| | Latency | Request duration | p50, p95, p99 of response time in ms | | Traffic | Request throughput | Requests per second by endpoint | | Errors | Failure rate | Errors per second, error rate percentage | | Saturation | Resource utilization | CPU, memory, connection pool usage, queue depth |
Use a consistent naming scheme:
{service}_{component}_{metric}_{unit}
Examples:
api_http_request_duration_msapi_http_request_totalapi_http_error_totalworker_queue_depth_countcache_hit_total, cache_miss_total| Type | Use when | Example | |------|----------|---------| | Counter | Value only increases | Total requests, total errors | | Gauge | Value goes up and down | Queue depth, active connections, memory usage | | Histogram | Distribution of values | Request latency, response size |
service, environment, route, and status field names in both logs and metrics for join-ability.Design logs to flow through analytics pipelines:
Application → stdout (JSON) → Log collector → Storage → Query/Dashboard
request_id (UUID) at the entry point of every request. Pass it through all downstream calls and include it in every log entry for that request.trace_id and span_id in headers (W3C Trace Context or similar).request_id in API error responses so users can report it for debugging: {"error": "internal_error", "request_id": "abc-123"}.Alert on symptoms, not causes. Page when user-facing behavior is degraded, not on every noisy internal signal.
| Signal | Condition | Severity | |--------|-----------|----------| | Error rate | >1% of requests return 5xx over a 5-minute window | PAGE | | Error rate | >0.1% sustained over 30 minutes | WARN | | p95 latency | >2× baseline over a 10-minute window | PAGE | | p95 latency | >1.5× baseline over 30 minutes | WARN | | Queue depth | >80% of max capacity | WARN | | Queue depth | >95% of max capacity (unbounded growth) | PAGE | | Availability | Health check fails for 2 consecutive minutes | PAGE | | Error budget | SLO error budget <20% remaining in billing period | WARN |
An alert that fires constantly becomes noise — it trains engineers to ignore it.
resources:
Resources:
LambdaErrorRateAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: ${self:service}-${self:custom.stage}-error-rate
AlarmDescription: "Lambda errors >5 in 5 min. Runbook: docs/runbooks/lambda-errors.md"
MetricName: Errors
Namespace: AWS/Lambda
Dimensions:
- Name: FunctionName
Value: !Ref ApiLambdaFunction
Statistic: Sum
Period: 300
EvaluationPeriods: 1
Threshold: 5
ComparisonOperator: GreaterThanOrEqualToThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref OpsAlertTopic]
OKActions: [!Ref OpsAlertTopic]
LambdaLatencyAlarm:
Type: AWS::CloudWatch::Alarm
Properties:
AlarmName: ${self:service}-${self:custom.stage}-latency-p95
AlarmDescription: "Lambda p95 latency >2000ms. Runbook: docs/runbooks/latency.md"
MetricName: Duration
Namespace: AWS/Lambda
Dimensions:
- Name: FunctionName
Value: !Ref ApiLambdaFunction
ExtendedStatistic: p95
Period: 600
EvaluationPeriods: 1
Threshold: 2000
ComparisonOperator: GreaterThanOrEqualToThreshold
TreatMissingData: notBreaching
AlarmActions: [!Ref OpsAlertTopic]
OpsAlertTopic:
Type: AWS::SNS::Topic
Properties:
TopicName: ${self:service}-${self:custom.stage}-ops-alerts
# Subscribe Slack webhook via Lambda or PagerDuty endpoint via HTTPS subscription
For services queried via Grafana (Athena, InfluxDB, Prometheus):
severity=page → PagerDuty, severity=warn → Slack.development
TypeScript coding standards and type safety conventions. Use when: creating TypeScript files, defining interfaces and types, writing type-safe code, reviewing TypeScript for type correctness, auditing a codebase for type safety gaps, eliminating any or ts-ignore usage, or improving strict-mode compliance. Covers strict typing, avoiding any and ts-ignore, discriminated unions, Zod runtime validation, immutability patterns, and proper type definitions.
testing
Writing clear, actionable tickets in any issue tracker (Jira, Linear, GitHub Issues, ServiceNow, etc.). Use when: creating epics, stories, tasks, bugs, or spikes; writing acceptance criteria; decomposing work for a sprint; linking dependencies between tickets; auditing backlog items for clarity; or coaching a team on ticket quality. Covers title conventions, description templates, acceptance criteria, decomposition rules, dependency linking, and org-specific pluggable configuration.
development
Testing strategy, patterns, and evaluation for software and LLM/AI systems. Use when: writing tests, choosing test boundaries, designing test data, structuring test suites, evaluating LLM outputs, building evaluation pipelines, setting coverage thresholds, auditing test coverage gaps in existing projects, or improving test quality and structure.
development
Writing effective status updates for different audiences and cadences. Use when: writing a weekly status update, preparing a monthly summary, drafting a quarterly review, sending updates to leadership, sharing progress with stakeholders, or improving the clarity and impact of team communications. Covers weekly, monthly, and quarterly formats tailored for upward, lateral, and downward communication.