priority: medium

Monitoring & Observability Standards

Structured Logging with Tracing Crate

Use tracing crate: Unified logging, tracing, and metrics in Rust
Structured events: key=value pairs instead of f-string format
Span context: Automatic propagation of request IDs, user info, etc.
Multiple subscribers: Layer logs, metrics, and distributed traces together

Basic setup in Rust:

use tracing::{debug, info, warn, error, instrument, span, Level};
use tracing_subscriber::{fmt, prelude::*};

#[tokio::main]
async fn main() {
    // Initialize tracing with multiple layers
    tracing_subscriber::registry()
        .with(
            fmt::layer()
                .with_writer(std::io::stdout)
                .with_target(true)
                .with_thread_ids(true)
                .with_level(true)
                .json()  // Structured JSON output for log aggregation
        )
        .with(
            tracing_subscriber::EnvFilter::try_from_default_env()
                .unwrap_or_else(|_| "info".parse().unwrap())
        )
        .init();

    info!("Application starting", version = env!("CARGO_PKG_VERSION"));

    // Run application
    run().await;
}

#[instrument(skip(client), fields(user_id = %user_id))]
async fn process_request(user_id: i64, client: &HttpClient) -> Result<String> {
    debug!("Processing user request");

    let response = client.get(&format!("/users/{}", user_id))
        .await
        .map_err(|e| {
            error!(error = ?e, "Failed to fetch user");
            e
        })?;

    info!(status = response.status().as_u16(), "Request successful");
    Ok(response.text().await?)
}

Span Instrumentation

Automatic span creation: Use #[instrument] macro on functions
Manual spans: Create spans for logical operations
Nested spans: Parent-child relationship tracked automatically
Span fields: Add context about operation (user_id, request_id, etc.)

use tracing::{Span, span, Level};

#[instrument(skip(db, cache))]
async fn fetch_user_data(
    user_id: i64,
    db: &Database,
    cache: &Cache,
) -> Result<User> {
    // Span created automatically from function name
    // Fields: user_id added to span context

    let cache_span = span!(Level::DEBUG, "cache_lookup", user_id = %user_id);
    let _enter = cache_span.enter();

    if let Some(user) = cache.get(user_id).await {
        debug!("Cache hit");
        return Ok(user);
    }

    drop(_enter);  // Exit cache span

    // Switch to database lookup
    let db_span = span!(Level::DEBUG, "db_lookup", user_id = %user_id);
    let _enter = db_span.enter();

    let user = db.query_user(user_id).await?;
    cache.set(user_id, user.clone()).await;

    info!("User loaded from database");
    Ok(user)
}

Log Levels

ERROR: Application error, unrecoverable (invalid request, DB connection failed)
WARN: Unexpected condition, but application continues (deprecated API usage, missing optional config)
INFO: Important state changes (startup complete, request processed, migration started)
DEBUG: Detailed execution flow for debugging (entering function, variable values)
TRACE: Very detailed, typically off in production (field assignments, loop iterations)

Best practices:

// ERROR: User made invalid request or system failure
error!(error = ?e, "Failed to process payment");

// WARN: Degraded service or unusual state
warn!(deprecated_field = true, "Using deprecated API");

// INFO: Progress milestones
info!(user_count = count, "Database migration completed");

// DEBUG: Entry/exit and important values
debug!("Entering user validation");
debug!(validation_result = ?result, "Validation complete");

// TRACE: Detailed loops and assignments (usually disabled)
trace!(index = i, value = ?item, "Processing item");

Python Structured Logging

Use structlog for structured logging in Python:

import structlog
from typing import Any

# Configure structlog
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ],
    context_class=dict,
    logger_factory=structlog.PrintLoggerFactory(),
    cache_logger_on_first_use=True,
)

logger = structlog.get_logger()

def process_request(user_id: int) -> dict[str, Any]:
    logger.msg("Processing user request", user_id=user_id)

    try:
        result = fetch_user_data(user_id)
        logger.msg("Request successful", user_id=user_id, status=200)
        return result
    except Exception as e:
        logger.exception("Request failed", user_id=user_id, error=str(e))
        raise

Metrics Collection with Prometheus

Counter: Monotonically increasing value (requests, errors, bytes sent)
Gauge: Current value that can go up/down (CPU usage, queue size, connections)
Histogram: Distribution of values (request latency, payload size)
Summary: Percentiles of values (same as histogram, different output format)

Rust with prometheus crate:

use prometheus::{
    Counter, CounterVec, Gauge, Histogram, HistogramVec, Registry,
};
use once_cell::sync::Lazy;

// Define metrics
pub static REQUEST_COUNTER: Lazy<CounterVec> = Lazy::new(|| {
    CounterVec::new(
        prometheus::Opts::new("http_requests_total", "Total HTTP requests"),
        &["method", "status"],
    ).unwrap()
});

pub static REQUEST_DURATION: Lazy<HistogramVec> = Lazy::new(|| {
    HistogramVec::new(
        prometheus::HistogramOpts::new(
            "http_request_duration_seconds",
            "HTTP request duration in seconds"
        ),
        &["method", "path"],
    ).unwrap()
});

pub static ACTIVE_CONNECTIONS: Lazy<Gauge> = Lazy::new(|| {
    Gauge::new("connections_active", "Active connections").unwrap()
});

// Use metrics in handler
#[instrument(skip(db))]
async fn handle_request(method: &str, path: &str, db: &Database) -> Result<Response> {
    let timer = REQUEST_DURATION.with_label_values(&[method, path]).start_timer();

    ACTIVE_CONNECTIONS.inc();

    let result = process_request(db).await;

    ACTIVE_CONNECTIONS.dec();

    let status = if result.is_ok() { "200" } else { "500" };
    REQUEST_COUNTER.with_label_values(&[method, status]).inc();

    timer.observe_duration();
    result
}

Prometheus scrape config (prometheus.yml):

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'myapp'
    static_configs:
      - targets: ['localhost:9090']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

Integration with Cloud Observability

Datadog

use ddtrace::Tracer;

let tracer = Tracer::new();

#[instrument(skip(db))]
async fn fetch_data(db: &Database) {
    // Automatically traced and sent to Datadog
}

New Relic

use newrelic::{Config, App};

let config = Config::new("My App", "license-key");
let app = App::new(config)?;

let txn = app.start_transaction("fetch_user");
// Work happens here
txn.notice_error(error);

Google Cloud Trace

use opentelemetry_gcp::trace::CloudTraceExporter;
use opentelemetry_sdk::trace::TracerProvider;

let exporter = CloudTraceExporter::new();
let provider = TracerProvider::builder()
    .with_batch_exporter(exporter)
    .build();

let tracer = provider.tracer("myapp");

Health Check Endpoints

Always expose a /health endpoint for orchestrators:

#[get("/health")]
async fn health(db: &State<Database>) -> Json<HealthStatus> {
    let db_healthy = db.ping().await.is_ok();
    let status = if db_healthy { "healthy" } else { "unhealthy" };

    Json(HealthStatus {
        status,
        checks: json!({
            "database": db_healthy,
            "uptime_seconds": uptime(),
        }),
    })
}

// Kubernetes liveness probe
# pod.yaml
livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 10

Distributed Tracing

Propagate trace IDs across services using W3C Trace Context:

use http::HeaderMap;
use opentelemetry::api::TraceContextPropagator;

async fn call_downstream_service(
    headers: &HeaderMap,
    client: &HttpClient,
) -> Result<Response> {
    // Extract parent trace context from incoming request
    let propagator = TraceContextPropagator::new();
    let parent_context = propagator.extract(headers);

    let span = span!(
        parent: &parent_context,
        tracing::Level::DEBUG,
        "downstream_call",
        service = "user-service"
    );

    let _enter = span.enter();

    // Inject trace context into outgoing request
    let mut outgoing_headers = HeaderMap::new();
    propagator.inject_context(&parent_context, &mut outgoing_headers);

    client.get("/api/users")
        .headers(outgoing_headers)
        .send()
        .await
}

Anti-Patterns

f-string logging: Use structured key=value instead

// BAD
info!("User {} processed with status {}", user_id, status);

// GOOD
info!("User processed", user_id = user_id, status = status);

No span context: Always wrap request handling in spans with request ID
Synchronous logging in hot path: Use async subscribers in high-throughput services
Hardcoded log levels: Respect environment variable configuration
Logging sensitive data: Never log passwords, tokens, PII without redaction
No metrics: Always instrument critical paths (requests, errors, latency)

High cardinality labels: Avoid unbounded label values (user_id in label)

// BAD: Unbounded cardinality
REQUEST_COUNTER.with_label_values(&[method, &user_id.to_string()]).inc();

// GOOD: Fixed cardinality
REQUEST_COUNTER.with_label_values(&[method, "success"]).inc();

No health checks: Orchestrators can't detect unhealthy instances
Sampling off: Use tail-based sampling in production for cost

priority: medium

Monitoring & Observability Standards

Structured Logging with Tracing Crate

Use tracing crate: Unified logging, tracing, and metrics in Rust
Structured events: key=value pairs instead of f-string format
Span context: Automatic propagation of request IDs, user info, etc.
Multiple subscribers: Layer logs, metrics, and distributed traces together

Basic setup in Rust:

use tracing::{debug, info, warn, error, instrument, span, Level};
use tracing_subscriber::{fmt, prelude::*};

#[tokio::main]
async fn main() {
    // Initialize tracing with multiple layers
    tracing_subscriber::registry()
        .with(
            fmt::layer()
                .with_writer(std::io::stdout)
                .with_target(true)
                .with_thread_ids(true)
                .with_level(true)
                .json()  // Structured JSON output for log aggregation
        )
        .with(
            tracing_subscriber::EnvFilter::try_from_default_env()
                .unwrap_or_else(|_| "info".parse().unwrap())
        )
        .init();

    info!("Application starting", version = env!("CARGO_PKG_VERSION"));

    // Run application
    run().await;
}

#[instrument(skip(client), fields(user_id = %user_id))]
async fn process_request(user_id: i64, client: &HttpClient) -> Result<String> {
    debug!("Processing user request");

    let response = client.get(&format!("/users/{}", user_id))
        .await
        .map_err(|e| {
            error!(error = ?e, "Failed to fetch user");
            e
        })?;

    info!(status = response.status().as_u16(), "Request successful");
    Ok(response.text().await?)
}

Span Instrumentation

Automatic span creation: Use #[instrument] macro on functions
Manual spans: Create spans for logical operations
Nested spans: Parent-child relationship tracked automatically
Span fields: Add context about operation (user_id, request_id, etc.)

use tracing::{Span, span, Level};

#[instrument(skip(db, cache))]
async fn fetch_user_data(
    user_id: i64,
    db: &Database,
    cache: &Cache,
) -> Result<User> {
    // Span created automatically from function name
    // Fields: user_id added to span context

    let cache_span = span!(Level::DEBUG, "cache_lookup", user_id = %user_id);
    let _enter = cache_span.enter();

    if let Some(user) = cache.get(user_id).await {
        debug!("Cache hit");
        return Ok(user);
    }

    drop(_enter);  // Exit cache span

    // Switch to database lookup
    let db_span = span!(Level::DEBUG, "db_lookup", user_id = %user_id);
    let _enter = db_span.enter();

    let user = db.query_user(user_id).await?;
    cache.set(user_id, user.clone()).await;

    info!("User loaded from database");
    Ok(user)
}

Log Levels

ERROR: Application error, unrecoverable (invalid request, DB connection failed)
WARN: Unexpected condition, but application continues (deprecated API usage, missing optional config)
INFO: Important state changes (startup complete, request processed, migration started)
DEBUG: Detailed execution flow for debugging (entering function, variable values)
TRACE: Very detailed, typically off in production (field assignments, loop iterations)

Best practices:

// ERROR: User made invalid request or system failure
error!(error = ?e, "Failed to process payment");

// WARN: Degraded service or unusual state
warn!(deprecated_field = true, "Using deprecated API");

// INFO: Progress milestones
info!(user_count = count, "Database migration completed");

// DEBUG: Entry/exit and important values
debug!("Entering user validation");
debug!(validation_result = ?result, "Validation complete");

// TRACE: Detailed loops and assignments (usually disabled)
trace!(index = i, value = ?item, "Processing item");

Python Structured Logging

Use structlog for structured logging in Python:

import structlog
from typing import Any

# Configure structlog
structlog.configure(
    processors=[
        structlog.processors.TimeStamper(fmt="iso"),
        structlog.processors.JSONRenderer()
    ],
    context_class=dict,
    logger_factory=structlog.PrintLoggerFactory(),
    cache_logger_on_first_use=True,
)

logger = structlog.get_logger()

def process_request(user_id: int) -> dict[str, Any]:
    logger.msg("Processing user request", user_id=user_id)

    try:
        result = fetch_user_data(user_id)
        logger.msg("Request successful", user_id=user_id, status=200)
        return result
    except Exception as e:
        logger.exception("Request failed", user_id=user_id, error=str(e))
        raise

Metrics Collection with Prometheus

Counter: Monotonically increasing value (requests, errors, bytes sent)
Gauge: Current value that can go up/down (CPU usage, queue size, connections)
Histogram: Distribution of values (request latency, payload size)
Summary: Percentiles of values (same as histogram, different output format)

Rust with prometheus crate:

use prometheus::{
    Counter, CounterVec, Gauge, Histogram, HistogramVec, Registry,
};
use once_cell::sync::Lazy;

// Define metrics
pub static REQUEST_COUNTER: Lazy<CounterVec> = Lazy::new(|| {
    CounterVec::new(
        prometheus::Opts::new("http_requests_total", "Total HTTP requests"),
        &["method", "status"],
    ).unwrap()
});

pub static REQUEST_DURATION: Lazy<HistogramVec> = Lazy::new(|| {
    HistogramVec::new(
        prometheus::HistogramOpts::new(
            "http_request_duration_seconds",
            "HTTP request duration in seconds"
        ),
        &["method", "path"],
    ).unwrap()
});

pub static ACTIVE_CONNECTIONS: Lazy<Gauge> = Lazy::new(|| {
    Gauge::new("connections_active", "Active connections").unwrap()
});

// Use metrics in handler
#[instrument(skip(db))]
async fn handle_request(method: &str, path: &str, db: &Database) -> Result<Response> {
    let timer = REQUEST_DURATION.with_label_values(&[method, path]).start_timer();

    ACTIVE_CONNECTIONS.inc();

    let result = process_request(db).await;

    ACTIVE_CONNECTIONS.dec();

    let status = if result.is_ok() { "200" } else { "500" };
    REQUEST_COUNTER.with_label_values(&[method, status]).inc();

    timer.observe_duration();
    result
}

Prometheus scrape config (prometheus.yml):

global:
  scrape_interval: 15s
  evaluation_interval: 15s

scrape_configs:
  - job_name: 'myapp'
    static_configs:
      - targets: ['localhost:9090']
    relabel_configs:
      - source_labels: [__address__]
        target_label: instance

Integration with Cloud Observability

Datadog

use ddtrace::Tracer;

let tracer = Tracer::new();

#[instrument(skip(db))]
async fn fetch_data(db: &Database) {
    // Automatically traced and sent to Datadog
}

New Relic

use newrelic::{Config, App};

let config = Config::new("My App", "license-key");
let app = App::new(config)?;

let txn = app.start_transaction("fetch_user");
// Work happens here
txn.notice_error(error);

Google Cloud Trace

use opentelemetry_gcp::trace::CloudTraceExporter;
use opentelemetry_sdk::trace::TracerProvider;

let exporter = CloudTraceExporter::new();
let provider = TracerProvider::builder()
    .with_batch_exporter(exporter)
    .build();

let tracer = provider.tracer("myapp");

Health Check Endpoints

Always expose a /health endpoint for orchestrators:

#[get("/health")]
async fn health(db: &State<Database>) -> Json<HealthStatus> {
    let db_healthy = db.ping().await.is_ok();
    let status = if db_healthy { "healthy" } else { "unhealthy" };

    Json(HealthStatus {
        status,
        checks: json!({
            "database": db_healthy,
            "uptime_seconds": uptime(),
        }),
    })
}

// Kubernetes liveness probe
# pod.yaml
livenessProbe:
  httpGet:
    path: /health
    port: 3000
  initialDelaySeconds: 10
  periodSeconds: 10

Distributed Tracing

Propagate trace IDs across services using W3C Trace Context:

use http::HeaderMap;
use opentelemetry::api::TraceContextPropagator;

async fn call_downstream_service(
    headers: &HeaderMap,
    client: &HttpClient,
) -> Result<Response> {
    // Extract parent trace context from incoming request
    let propagator = TraceContextPropagator::new();
    let parent_context = propagator.extract(headers);

    let span = span!(
        parent: &parent_context,
        tracing::Level::DEBUG,
        "downstream_call",
        service = "user-service"
    );

    let _enter = span.enter();

    // Inject trace context into outgoing request
    let mut outgoing_headers = HeaderMap::new();
    propagator.inject_context(&parent_context, &mut outgoing_headers);

    client.get("/api/users")
        .headers(outgoing_headers)
        .send()
        .await
}

Anti-Patterns

f-string logging: Use structured key=value instead

// BAD
info!("User {} processed with status {}", user_id, status);

// GOOD
info!("User processed", user_id = user_id, status = status);

No span context: Always wrap request handling in spans with request ID
Synchronous logging in hot path: Use async subscribers in high-throughput services
Hardcoded log levels: Respect environment variable configuration
Logging sensitive data: Never log passwords, tokens, PII without redaction
No metrics: Always instrument critical paths (requests, errors, latency)

High cardinality labels: Avoid unbounded label values (user_id in label)

// BAD: Unbounded cardinality
REQUEST_COUNTER.with_label_values(&[method, &user_id.to_string()]).inc();

// GOOD: Fixed cardinality
REQUEST_COUNTER.with_label_values(&[method, "success"]).inc();

No health checks: Orchestrators can't detect unhealthy instances
Sampling off: Use tail-based sampling in production for cost

Adoption

kreuzberg-dev/.remote-cache/kreuzberg-shared-rules/.ai-rulez/skills/monitoring-observability-standards

$ install --global

Security Scan Results

SKILL.md

priority: medium

Monitoring & Observability Standards

Structured Logging with Tracing Crate

Span Instrumentation

Log Levels

Python Structured Logging

Metrics Collection with Prometheus

Integration with Cloud Observability

Datadog

New Relic

Google Cloud Trace

Health Check Endpoints

Distributed Tracing

Anti-Patterns

Related Skills

kreuzberg-dev/html-to-markdown

kreuzberg-dev/quick-start

kreuzberg-dev/common-task-commands

kreuzberg-dev/.remote-cache/kreuzberg-shared-rules/.ai-rulez/skills/workspace-structure-project-organization

kreuzberg-dev/.remote-cache/kreuzberg-shared-rules/.ai-rulez/skills/monitoring-observability-standards

$ install --global

Security Scan Results

SKILL.md

priority: medium

Monitoring & Observability Standards

Structured Logging with Tracing Crate

Span Instrumentation

Log Levels

Python Structured Logging

Metrics Collection with Prometheus

Integration with Cloud Observability

Datadog

New Relic

Google Cloud Trace

Health Check Endpoints

Distributed Tracing

Anti-Patterns

Related Skills

kreuzberg-dev/html-to-markdown

kreuzberg-dev/quick-start

kreuzberg-dev/common-task-commands

kreuzberg-dev/.remote-cache/kreuzberg-shared-rules/.ai-rulez/skills/workspace-structure-project-organization