Log Analysis

Find answers in logs efficiently. Structured analysis techniques for local files, cloud log aggregators, and production incident investigation.

Principles of Effective Log Analysis

Start with time — narrow to the window around the incident before anything else
Use trace/request IDs — follow a single request across services rather than scanning all logs
Group and count — find patterns before reading individual entries
Work backwards from the symptom — errors first, then the events that preceded them
Correlation beats isolation — logs alone tell half the story; combine with metrics and traces

Command-Line Log Analysis

Essential Unix Tools

# --- BASIC FILTERING ---

# Filter by log level
grep -i "error\|warn\|fatal" app.log

# Filter by time window (ISO timestamps in logs)
grep "2024-01-15T14:[3-5][0-9]" app.log  # 14:30–14:59

# Filter by field value in JSON logs
grep '"status":5' app.log
grep '"user_id":"u_123"' app.log

# Combine filters (AND)
grep '"level":"error"' app.log | grep "payment-service"

# --- STRUCTURED JSON LOG PARSING ---

# jq — best tool for JSON log files
cat app.log | jq 'select(.level == "error") | {time: .timestamp, msg: .message, err: .error}'

# Count errors by type
cat app.log | jq -r 'select(.level == "error") | .error.type // "unknown"' | sort | uniq -c | sort -rn | head 20

# Summarize errors in a time window
cat app.log | jq -r 'select(.timestamp > "2024-01-15T14:00" and .level == "error") | .message' | \
  sort | uniq -c | sort -rn | head 30

# Extract all unique request IDs that hit errors
cat app.log | jq -r 'select(.level == "error") | .trace_id' | sort -u

# --- FREQUENCY AND PATTERN ANALYSIS ---

# Count log lines per minute (ISO timestamps)
grep . app.log | cut -c1-16 | sort | uniq -c

# Top error messages
grep '"level":"error"' app.log | jq -r '.message' | sort | uniq -c | sort -rn | head 20

# Error rate per 5-minute bucket
awk -F'T' '{print $1"T"substr($2,1,5)}' app.log | grep "error" | sort | uniq -c

# --- FOLLOWING A SINGLE REQUEST ---

# Find request ID from an error, then show all its log lines
grep "trace_id_here" app.log | jq .

# Follow multiple services for same request
grep "req-abc-123" service-a.log service-b.log service-c.log | sort -t'"' -k4  # sort by time field

Multi-File / Multi-Service Analysis

# Interleave logs from multiple services, sorted by timestamp
sort -m <(cat api.log) <(cat worker.log) <(cat db-proxy.log) | \
  grep "req-abc-123"

# Or with awk to prepend service name
for f in api.log worker.log db.log; do
  awk -v svc="${f%.log}" '{print svc" "$0}' "$f"
done | sort -k2 | grep "req-abc-123"

# Watch live logs with filtering (streaming)
tail -f app.log | grep --line-buffered '"level":"error"' | jq .

Loki / LogQL

Loki uses LogQL — a log query language similar to PromQL.

LogQL Fundamentals

# Stream selector — REQUIRED, selects which log streams
{app="api-server", env="production"}

# Add a filter expression
{app="api-server"} |= "error"           # Contains string
{app="api-server"} != "healthcheck"     # Does not contain
{app="api-server"} |~ "user_id=\d+"    # Regex match
{app="api-server"} !~ "GET /health"    # Regex not match

# JSON parser — extract fields from JSON logs
{app="api-server"} | json
{app="api-server"} | json | level="error"
{app="api-server"} | json | duration > 1000   # duration > 1000ms

# Label filter after parsing
{app="api-server"} | json | status_code >= 500 | line_format "{{.timestamp}} [{{.level}}] {{.message}}"

# Pattern parser (for non-JSON structured logs)
{app="nginx"} | pattern `<ip> - - [<ts>] "<method> <path> <proto>" <status> <size>`
             | status >= 500

Metric Queries (LogQL aggregations)

# Error rate per minute (errors per second)
rate({app="api-server"} | json | level="error" [1m])

# Request rate
rate({app="api-server"} | json [1m])

# Error ratio
sum(rate({app="api-server"} | json | level="error" [5m])) 
  / 
sum(rate({app="api-server"} | json [5m]))

# Top 10 most frequent error messages
topk(10,
  sum by (message) (
    count_over_time({app="api-server"} | json | level="error" [1h])
  )
)

# P99 latency from logs (when metrics aren't available)
quantile_over_time(0.99,
  {app="api-server"} | json | unwrap duration [5m]
) by (path)

Useful Loki Patterns for Incidents

# All errors in the last 15 minutes across all services in prod
{env="production"} | json | level="error" | line_format "{{.service}} | {{.message}} | {{.error}}"

# Slowest requests (P99 path analysis)
{app="api-server"} | json 
  | duration > 2000 
  | line_format "{{.method}} {{.path}} {{.duration}}ms {{.trace_id}}"

# Follow a specific user's session
{env="production"} | json | user_id="u_123456"

# Find all log lines for a distributed trace
{env="production"} | json | trace_id="abc-def-123"

Elasticsearch / OpenSearch Query DSL

Basic Queries

// Match logs with level=error in the last 15 minutes
POST /logs-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "term": { "level": "error" } }
      ],
      "filter": [
        { "range": { "@timestamp": { "gte": "now-15m" } } }
      ]
    }
  },
  "sort": [{ "@timestamp": { "order": "desc" } }],
  "size": 50
}

// Full-text search in message field
{
  "query": {
    "bool": {
      "must": [
        { "match": { "message": "connection refused" } }
      ],
      "filter": [
        { "range": { "@timestamp": { "gte": "now-1h" } } },
        { "term": { "service.name": "payment-api" } }
      ]
    }
  }
}

Aggregations (for pattern analysis)

// Error counts by service and hour
POST /logs-*/_search
{
  "size": 0,
  "query": { 
    "bool": { 
      "filter": [
        { "term": { "level": "error" } },
        { "range": { "@timestamp": { "gte": "now-24h" } } }
      ] 
    }
  },
  "aggs": {
    "by_service": {
      "terms": { "field": "service.name", "size": 20 },
      "aggs": {
        "over_time": {
          "date_histogram": {
            "field": "@timestamp",
            "calendar_interval": "1h"
          }
        }
      }
    }
  }
}

// Top error messages with most frequent first
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        { "term": { "level": "error" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  },
  "aggs": {
    "top_errors": {
      "terms": {
        "field": "message.keyword",
        "size": 25,
        "order": { "_count": "desc" }
      }
    }
  }
}

Structured Logging Best Practices

The Right Way to Structure Logs

// Bad — hard to parse, hard to alert on
logger.info(`User ${userId} purchased ${itemId} for ${price} in ${duration}ms`);

// Good — every field is queryable
logger.info({
  event: 'purchase.completed',
  user_id: userId,
  item_id: itemId,
  amount_cents: price,
  duration_ms: duration,
  trace_id: ctx.traceId,
  span_id: ctx.spanId,
});

Required Fields for Every Log Entry

interface BaseLogContext {
  timestamp: string;        // ISO 8601
  level: 'debug' | 'info' | 'warn' | 'error' | 'fatal';
  message: string;          // Human-readable summary
  service: string;          // Service name
  version: string;          // Service version / git SHA
  environment: string;      // production, staging, dev
  trace_id?: string;        // Distributed trace ID (from W3C traceparent)
  span_id?: string;         // Current span ID
  user_id?: string;         // When in user context
  request_id?: string;      // HTTP request ID
  
  // For error logs, always include:
  error?: {
    type: string;           // Error class name
    message: string;
    stack?: string;
  };
}

Using pino (Node.js)

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  base: {
    service: 'payment-api',
    version: process.env.GIT_SHA ?? 'unknown',
    env: process.env.NODE_ENV ?? 'production',
  },
  // Redact sensitive fields before writing
  redact: {
    paths: ['req.headers.authorization', 'body.password', 'body.credit_card'],
    censor: '[REDACTED]',
  },
  // Production: write JSON; Development: pretty-print
  transport: process.env.NODE_ENV === 'development'
    ? { target: 'pino-pretty', options: { colorize: true } }
    : undefined,
});

export default logger;

Log Levels — When to Use Each

| Level | When to use | Example | |-------|------------|---------| | debug | Internal state during development | Loop iteration values, intermediate results | | info | Normal events important to track | Request handled, job completed, record created | | warn | Something unexpected but handled | Retry attempt, fallback used, deprecated endpoint called | | error | Something failed but service continues | Failed to process a specific request, DB query failed | | fatal | Service cannot continue | Cannot connect to database, out of memory |

Production: info or warn threshold. Never debug in production (log volume/cost).

Tracing Requests Across Services

The Correlation ID Pattern

Every external request should get a unique ID that propagates through all downstream calls.

// Express middleware — assign or forward trace ID
app.use((req, res, next) => {
  // Use incoming W3C traceparent, or generate new trace
  const traceParent = req.headers['traceparent'];
  const traceId = traceParent
    ? extractTraceId(traceParent)
    : crypto.randomUUID().replace(/-/g, '');
  
  // Store on request context
  req.traceId = traceId;
  
  // Forward to downstream services
  res.setHeader('traceparent', buildTraceParent(traceId));
  
  next();
});

// Pass trace ID to all downstream HTTP calls
async function callPaymentService(traceId: string, payload: PaymentPayload) {
  return fetch('https://payment.internal/charge', {
    method: 'POST',
    headers: { 
      'traceparent': buildTraceParent(traceId),
      'Content-Type': 'application/json',
    },
    body: JSON.stringify(payload),
  });
}

With trace IDs in every log entry, finding all logs for one request across 10 services is a single query:

{env="production"} | json | trace_id="ab12cd34ef56"

Incident Log Investigation Playbook

Step 1: Establish the Timeline

# When did errors start?
cat app.log | jq -r 'select(.level == "error") | .timestamp' | sort | head -5

# What was error rate before vs. during incident?
# Count errors per minute for last 2 hours
cat app.log | jq -r 'select(.level == "error") | .timestamp[:16]' | sort | uniq -c

Step 2: Identify the First Failure

# Find the very first error of the incident type
grep -m1 '"level":"error"' app.log | jq .

# Look at logs 30 seconds BEFORE the first error
# (What changed? What was the last normal request?)

Step 3: Characterize the Failure

# What fraction of requests are failing?
echo "Total requests:"; grep '"event":"request.completed"' app.log | wc -l
echo "Errors:"; grep '"level":"error"' app.log | wc -l

# What error types are present?
jq -r 'select(.level == "error") | .error.type // .message' app.log | \
  sort | uniq -c | sort -rn | head 20

# Which endpoints / code paths?
jq -r 'select(.level == "error") | .path // .handler // "unknown"' app.log | \
  sort | uniq -c | sort -rn | head 10

Step 4: Sample a Failing Request End-to-End

# Get a trace ID from a failed request
TRACE_ID=$(jq -r 'select(.level == "error") | .trace_id' app.log | head -1)

# Find all log lines for that trace across all log files
grep "$TRACE_ID" *.log | sort -t'"' -k4  # sort by timestamp field

# Build a timeline: what happened before the error?

Step 5: Look for Correlated Changes

# Were there deployments around the incident time?  
git log --oneline --after="2 hours ago" --before="now"

# Kubernetes — any pod restarts?
kubectl get events --sort-by='.lastTimestamp' -n production | grep -i "restart\|backoff\|oom\|kill"

# Configuration changes?
git log --oneline -p config/ -- "*.yaml" "*.env" | head 50

Log Sampling and Volume Management

At high throughput (>100k req/s), logging every request becomes expensive and noisy.

// Sample debug/info logs; always log errors
const SAMPLE_RATE = 0.01; // Log 1% of successful requests

function shouldLog(level: string): boolean {
  if (level === 'error' || level === 'warn' || level === 'fatal') return true;
  return Math.random() < SAMPLE_RATE;
}

// Or use head-based sampling tied to the trace
function shouldLogFromTrace(traceId: string, level: string): boolean {
  if (level === 'error' || level === 'fatal') return true;
  // Hash the trace ID — deterministic sampling
  // All services will log or skip the same trace
  const hash = parseInt(traceId.slice(0, 8), 16);
  return (hash % 100) < (SAMPLE_RATE * 100);
}

Always log 100% of:

Errors and fatal events
Authentication events (login, logout, token refresh, failures)
Authorization failures (403s)
State-changing mutations (write operations)
Slow requests (duration > threshold)
External API calls (request + response status)

Deeper Reference

For structured log query patterns and correlation techniques, see:

references/log-patterns.md — structlog/zerolog/zap configuration, Loki LogQL query library, alert rule templates, and log sampling strategies for high-throughput services
references/correlation-techniques.md — distributed request correlation via request_id propagation, session correlation, event timeline reconstruction, and head-based log sampling implementation

Log Analysis

Find answers in logs efficiently. Structured analysis techniques for local files, cloud log aggregators, and production incident investigation.

Principles of Effective Log Analysis

Start with time — narrow to the window around the incident before anything else
Use trace/request IDs — follow a single request across services rather than scanning all logs
Group and count — find patterns before reading individual entries
Work backwards from the symptom — errors first, then the events that preceded them
Correlation beats isolation — logs alone tell half the story; combine with metrics and traces

Command-Line Log Analysis

Essential Unix Tools

# --- BASIC FILTERING ---

# Filter by log level
grep -i "error\|warn\|fatal" app.log

# Filter by time window (ISO timestamps in logs)
grep "2024-01-15T14:[3-5][0-9]" app.log  # 14:30–14:59

# Filter by field value in JSON logs
grep '"status":5' app.log
grep '"user_id":"u_123"' app.log

# Combine filters (AND)
grep '"level":"error"' app.log | grep "payment-service"

# --- STRUCTURED JSON LOG PARSING ---

# jq — best tool for JSON log files
cat app.log | jq 'select(.level == "error") | {time: .timestamp, msg: .message, err: .error}'

# Count errors by type
cat app.log | jq -r 'select(.level == "error") | .error.type // "unknown"' | sort | uniq -c | sort -rn | head 20

# Summarize errors in a time window
cat app.log | jq -r 'select(.timestamp > "2024-01-15T14:00" and .level == "error") | .message' | \
  sort | uniq -c | sort -rn | head 30

# Extract all unique request IDs that hit errors
cat app.log | jq -r 'select(.level == "error") | .trace_id' | sort -u

# --- FREQUENCY AND PATTERN ANALYSIS ---

# Count log lines per minute (ISO timestamps)
grep . app.log | cut -c1-16 | sort | uniq -c

# Top error messages
grep '"level":"error"' app.log | jq -r '.message' | sort | uniq -c | sort -rn | head 20

# Error rate per 5-minute bucket
awk -F'T' '{print $1"T"substr($2,1,5)}' app.log | grep "error" | sort | uniq -c

# --- FOLLOWING A SINGLE REQUEST ---

# Find request ID from an error, then show all its log lines
grep "trace_id_here" app.log | jq .

# Follow multiple services for same request
grep "req-abc-123" service-a.log service-b.log service-c.log | sort -t'"' -k4  # sort by time field

Multi-File / Multi-Service Analysis

# Interleave logs from multiple services, sorted by timestamp
sort -m <(cat api.log) <(cat worker.log) <(cat db-proxy.log) | \
  grep "req-abc-123"

# Or with awk to prepend service name
for f in api.log worker.log db.log; do
  awk -v svc="${f%.log}" '{print svc" "$0}' "$f"
done | sort -k2 | grep "req-abc-123"

# Watch live logs with filtering (streaming)
tail -f app.log | grep --line-buffered '"level":"error"' | jq .

Loki / LogQL

Loki uses LogQL — a log query language similar to PromQL.

LogQL Fundamentals

# Stream selector — REQUIRED, selects which log streams
{app="api-server", env="production"}

# Add a filter expression
{app="api-server"} |= "error"           # Contains string
{app="api-server"} != "healthcheck"     # Does not contain
{app="api-server"} |~ "user_id=\d+"    # Regex match
{app="api-server"} !~ "GET /health"    # Regex not match

# JSON parser — extract fields from JSON logs
{app="api-server"} | json
{app="api-server"} | json | level="error"
{app="api-server"} | json | duration > 1000   # duration > 1000ms

# Label filter after parsing
{app="api-server"} | json | status_code >= 500 | line_format "{{.timestamp}} [{{.level}}] {{.message}}"

# Pattern parser (for non-JSON structured logs)
{app="nginx"} | pattern `<ip> - - [<ts>] "<method> <path> <proto>" <status> <size>`
             | status >= 500

Metric Queries (LogQL aggregations)

# Error rate per minute (errors per second)
rate({app="api-server"} | json | level="error" [1m])

# Request rate
rate({app="api-server"} | json [1m])

# Error ratio
sum(rate({app="api-server"} | json | level="error" [5m])) 
  / 
sum(rate({app="api-server"} | json [5m]))

# Top 10 most frequent error messages
topk(10,
  sum by (message) (
    count_over_time({app="api-server"} | json | level="error" [1h])
  )
)

# P99 latency from logs (when metrics aren't available)
quantile_over_time(0.99,
  {app="api-server"} | json | unwrap duration [5m]
) by (path)

Useful Loki Patterns for Incidents

# All errors in the last 15 minutes across all services in prod
{env="production"} | json | level="error" | line_format "{{.service}} | {{.message}} | {{.error}}"

# Slowest requests (P99 path analysis)
{app="api-server"} | json 
  | duration > 2000 
  | line_format "{{.method}} {{.path}} {{.duration}}ms {{.trace_id}}"

# Follow a specific user's session
{env="production"} | json | user_id="u_123456"

# Find all log lines for a distributed trace
{env="production"} | json | trace_id="abc-def-123"

Elasticsearch / OpenSearch Query DSL

Basic Queries

// Match logs with level=error in the last 15 minutes
POST /logs-*/_search
{
  "query": {
    "bool": {
      "must": [
        { "term": { "level": "error" } }
      ],
      "filter": [
        { "range": { "@timestamp": { "gte": "now-15m" } } }
      ]
    }
  },
  "sort": [{ "@timestamp": { "order": "desc" } }],
  "size": 50
}

// Full-text search in message field
{
  "query": {
    "bool": {
      "must": [
        { "match": { "message": "connection refused" } }
      ],
      "filter": [
        { "range": { "@timestamp": { "gte": "now-1h" } } },
        { "term": { "service.name": "payment-api" } }
      ]
    }
  }
}

Aggregations (for pattern analysis)

// Error counts by service and hour
POST /logs-*/_search
{
  "size": 0,
  "query": { 
    "bool": { 
      "filter": [
        { "term": { "level": "error" } },
        { "range": { "@timestamp": { "gte": "now-24h" } } }
      ] 
    }
  },
  "aggs": {
    "by_service": {
      "terms": { "field": "service.name", "size": 20 },
      "aggs": {
        "over_time": {
          "date_histogram": {
            "field": "@timestamp",
            "calendar_interval": "1h"
          }
        }
      }
    }
  }
}

// Top error messages with most frequent first
{
  "size": 0,
  "query": {
    "bool": {
      "filter": [
        { "term": { "level": "error" } },
        { "range": { "@timestamp": { "gte": "now-1h" } } }
      ]
    }
  },
  "aggs": {
    "top_errors": {
      "terms": {
        "field": "message.keyword",
        "size": 25,
        "order": { "_count": "desc" }
      }
    }
  }
}

Structured Logging Best Practices

The Right Way to Structure Logs

// Bad — hard to parse, hard to alert on
logger.info(`User ${userId} purchased ${itemId} for ${price} in ${duration}ms`);

// Good — every field is queryable
logger.info({
  event: 'purchase.completed',
  user_id: userId,
  item_id: itemId,
  amount_cents: price,
  duration_ms: duration,
  trace_id: ctx.traceId,
  span_id: ctx.spanId,
});

Required Fields for Every Log Entry

interface BaseLogContext {
  timestamp: string;        // ISO 8601
  level: 'debug' | 'info' | 'warn' | 'error' | 'fatal';
  message: string;          // Human-readable summary
  service: string;          // Service name
  version: string;          // Service version / git SHA
  environment: string;      // production, staging, dev
  trace_id?: string;        // Distributed trace ID (from W3C traceparent)
  span_id?: string;         // Current span ID
  user_id?: string;         // When in user context
  request_id?: string;      // HTTP request ID
  
  // For error logs, always include:
  error?: {
    type: string;           // Error class name
    message: string;
    stack?: string;
  };
}

Using pino (Node.js)

import pino from 'pino';

const logger = pino({
  level: process.env.LOG_LEVEL ?? 'info',
  base: {
    service: 'payment-api',
    version: process.env.GIT_SHA ?? 'unknown',
    env: process.env.NODE_ENV ?? 'production',
  },
  // Redact sensitive fields before writing
  redact: {
    paths: ['req.headers.authorization', 'body.password', 'body.credit_card'],
    censor: '[REDACTED]',
  },
  // Production: write JSON; Development: pretty-print
  transport: process.env.NODE_ENV === 'development'
    ? { target: 'pino-pretty', options: { colorize: true } }
    : undefined,
});

export default logger;

Log Levels — When to Use Each

Production: info or warn threshold. Never debug in production (log volume/cost).

Tracing Requests Across Services

The Correlation ID Pattern

Every external request should get a unique ID that propagates through all downstream calls.

// Express middleware — assign or forward trace ID
app.use((req, res, next) => {
  // Use incoming W3C traceparent, or generate new trace
  const traceParent = req.headers['traceparent'];
  const traceId = traceParent
    ? extractTraceId(traceParent)
    : crypto.randomUUID().replace(/-/g, '');
  
  // Store on request context
  req.traceId = traceId;
  
  // Forward to downstream services
  res.setHeader('traceparent', buildTraceParent(traceId));
  
  next();
});

// Pass trace ID to all downstream HTTP calls
async function callPaymentService(traceId: string, payload: PaymentPayload) {
  return fetch('https://payment.internal/charge', {
    method: 'POST',
    headers: { 
      'traceparent': buildTraceParent(traceId),
      'Content-Type': 'application/json',
    },
    body: JSON.stringify(payload),
  });
}

With trace IDs in every log entry, finding all logs for one request across 10 services is a single query:

{env="production"} | json | trace_id="ab12cd34ef56"

Incident Log Investigation Playbook

Step 1: Establish the Timeline

# When did errors start?
cat app.log | jq -r 'select(.level == "error") | .timestamp' | sort | head -5

# What was error rate before vs. during incident?
# Count errors per minute for last 2 hours
cat app.log | jq -r 'select(.level == "error") | .timestamp[:16]' | sort | uniq -c

Step 2: Identify the First Failure

# Find the very first error of the incident type
grep -m1 '"level":"error"' app.log | jq .

# Look at logs 30 seconds BEFORE the first error
# (What changed? What was the last normal request?)

Step 3: Characterize the Failure

# What fraction of requests are failing?
echo "Total requests:"; grep '"event":"request.completed"' app.log | wc -l
echo "Errors:"; grep '"level":"error"' app.log | wc -l

# What error types are present?
jq -r 'select(.level == "error") | .error.type // .message' app.log | \
  sort | uniq -c | sort -rn | head 20

# Which endpoints / code paths?
jq -r 'select(.level == "error") | .path // .handler // "unknown"' app.log | \
  sort | uniq -c | sort -rn | head 10

Step 4: Sample a Failing Request End-to-End

# Get a trace ID from a failed request
TRACE_ID=$(jq -r 'select(.level == "error") | .trace_id' app.log | head -1)

# Find all log lines for that trace across all log files
grep "$TRACE_ID" *.log | sort -t'"' -k4  # sort by timestamp field

# Build a timeline: what happened before the error?

Step 5: Look for Correlated Changes

# Were there deployments around the incident time?  
git log --oneline --after="2 hours ago" --before="now"

# Kubernetes — any pod restarts?
kubectl get events --sort-by='.lastTimestamp' -n production | grep -i "restart\|backoff\|oom\|kill"

# Configuration changes?
git log --oneline -p config/ -- "*.yaml" "*.env" | head 50

Log Sampling and Volume Management

At high throughput (>100k req/s), logging every request becomes expensive and noisy.

// Sample debug/info logs; always log errors
const SAMPLE_RATE = 0.01; // Log 1% of successful requests

function shouldLog(level: string): boolean {
  if (level === 'error' || level === 'warn' || level === 'fatal') return true;
  return Math.random() < SAMPLE_RATE;
}

// Or use head-based sampling tied to the trace
function shouldLogFromTrace(traceId: string, level: string): boolean {
  if (level === 'error' || level === 'fatal') return true;
  // Hash the trace ID — deterministic sampling
  // All services will log or skip the same trace
  const hash = parseInt(traceId.slice(0, 8), 16);
  return (hash % 100) < (SAMPLE_RATE * 100);
}

Always log 100% of:

Errors and fatal events
Authentication events (login, logout, token refresh, failures)
Authorization failures (403s)
State-changing mutations (write operations)
Slow requests (duration > threshold)
External API calls (request + response status)

Deeper Reference

For structured log query patterns and correlation techniques, see:

references/log-patterns.md — structlog/zerolog/zap configuration, Loki LogQL query library, alert rule templates, and log sampling strategies for high-throughput services
references/correlation-techniques.md — distributed request correlation via request_id propagation, session correlation, event timeline reconstruction, and head-based log sampling implementation

Adoption

harsh040506/log-analysis

$ install --global

Security Scan Results

SKILL.md

Log Analysis

Principles of Effective Log Analysis

Command-Line Log Analysis

Essential Unix Tools

Multi-File / Multi-Service Analysis

Loki / LogQL

LogQL Fundamentals

Metric Queries (LogQL aggregations)

Useful Loki Patterns for Incidents

Elasticsearch / OpenSearch Query DSL

Basic Queries

Aggregations (for pattern analysis)

Structured Logging Best Practices

The Right Way to Structure Logs

Required Fields for Every Log Entry

Using pino (Node.js)

Log Levels — When to Use Each

Tracing Requests Across Services

The Correlation ID Pattern

Incident Log Investigation Playbook

Step 1: Establish the Timeline

Step 2: Identify the First Failure

Step 3: Characterize the Failure

Step 4: Sample a Failing Request End-to-End

Step 5: Look for Correlated Changes

Log Sampling and Volume Management

Deeper Reference

Related Skills

harsh040506/single-cell-rna-qc

harsh040506/scvi-tools

harsh040506/scientific-problem-selection

harsh040506/nextflow-development

harsh040506/log-analysis

$ install --global

Security Scan Results

SKILL.md

Log Analysis

Principles of Effective Log Analysis

Command-Line Log Analysis

Essential Unix Tools

Multi-File / Multi-Service Analysis

Loki / LogQL

LogQL Fundamentals

Metric Queries (LogQL aggregations)

Useful Loki Patterns for Incidents

Elasticsearch / OpenSearch Query DSL

Basic Queries

Aggregations (for pattern analysis)

Structured Logging Best Practices

The Right Way to Structure Logs

Required Fields for Every Log Entry

Using pino (Node.js)

Log Levels — When to Use Each

Tracing Requests Across Services

The Correlation ID Pattern

Incident Log Investigation Playbook

Step 1: Establish the Timeline

Step 2: Identify the First Failure

Step 3: Characterize the Failure

Step 4: Sample a Failing Request End-to-End

Step 5: Look for Correlated Changes

Log Sampling and Volume Management

Deeper Reference

Related Skills

harsh040506/single-cell-rna-qc

harsh040506/scvi-tools

harsh040506/scientific-problem-selection

harsh040506/nextflow-development