engineering/debugging/skills/log-analysis/SKILL.md
This skill should be used when the user asks about reading, searching, or analyzing logs. Trigger on "logs", "log message", "log file", "grep logs", "log parsing", "log aggregation", "structured logs", "JSON logs", "log level", "debug log", "error in logs", "warning in logs", "LogQL", "Loki", "Elasticsearch query", "Kibana", "Splunk", "Datadog logs", "Cloudwatch Logs", "log correlation", "trace ID", "request ID", "log shipping", "log rotation", "log sampling", "high-cardinality logging". Also trigger for "I see this error in logs", "what does this log message mean", "how do I find related log entries", or "why is there a spike in errors".
npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library log-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Find answers in logs efficiently. Structured analysis techniques for local files, cloud log aggregators, and production incident investigation.
# --- BASIC FILTERING ---
# Filter by log level
grep -i "error\|warn\|fatal" app.log
# Filter by time window (ISO timestamps in logs)
grep "2024-01-15T14:[3-5][0-9]" app.log # 14:30–14:59
# Filter by field value in JSON logs
grep '"status":5' app.log
grep '"user_id":"u_123"' app.log
# Combine filters (AND)
grep '"level":"error"' app.log | grep "payment-service"
# --- STRUCTURED JSON LOG PARSING ---
# jq — best tool for JSON log files
cat app.log | jq 'select(.level == "error") | {time: .timestamp, msg: .message, err: .error}'
# Count errors by type
cat app.log | jq -r 'select(.level == "error") | .error.type // "unknown"' | sort | uniq -c | sort -rn | head 20
# Summarize errors in a time window
cat app.log | jq -r 'select(.timestamp > "2024-01-15T14:00" and .level == "error") | .message' | \
sort | uniq -c | sort -rn | head 30
# Extract all unique request IDs that hit errors
cat app.log | jq -r 'select(.level == "error") | .trace_id' | sort -u
# --- FREQUENCY AND PATTERN ANALYSIS ---
# Count log lines per minute (ISO timestamps)
grep . app.log | cut -c1-16 | sort | uniq -c
# Top error messages
grep '"level":"error"' app.log | jq -r '.message' | sort | uniq -c | sort -rn | head 20
# Error rate per 5-minute bucket
awk -F'T' '{print $1"T"substr($2,1,5)}' app.log | grep "error" | sort | uniq -c
# --- FOLLOWING A SINGLE REQUEST ---
# Find request ID from an error, then show all its log lines
grep "trace_id_here" app.log | jq .
# Follow multiple services for same request
grep "req-abc-123" service-a.log service-b.log service-c.log | sort -t'"' -k4 # sort by time field
# Interleave logs from multiple services, sorted by timestamp
sort -m <(cat api.log) <(cat worker.log) <(cat db-proxy.log) | \
grep "req-abc-123"
# Or with awk to prepend service name
for f in api.log worker.log db.log; do
awk -v svc="${f%.log}" '{print svc" "$0}' "$f"
done | sort -k2 | grep "req-abc-123"
# Watch live logs with filtering (streaming)
tail -f app.log | grep --line-buffered '"level":"error"' | jq .
Loki uses LogQL — a log query language similar to PromQL.
# Stream selector — REQUIRED, selects which log streams
{app="api-server", env="production"}
# Add a filter expression
{app="api-server"} |= "error" # Contains string
{app="api-server"} != "healthcheck" # Does not contain
{app="api-server"} |~ "user_id=\d+" # Regex match
{app="api-server"} !~ "GET /health" # Regex not match
# JSON parser — extract fields from JSON logs
{app="api-server"} | json
{app="api-server"} | json | level="error"
{app="api-server"} | json | duration > 1000 # duration > 1000ms
# Label filter after parsing
{app="api-server"} | json | status_code >= 500 | line_format "{{.timestamp}} [{{.level}}] {{.message}}"
# Pattern parser (for non-JSON structured logs)
{app="nginx"} | pattern `<ip> - - [<ts>] "<method> <path> <proto>" <status> <size>`
| status >= 500
# Error rate per minute (errors per second)
rate({app="api-server"} | json | level="error" [1m])
# Request rate
rate({app="api-server"} | json [1m])
# Error ratio
sum(rate({app="api-server"} | json | level="error" [5m]))
/
sum(rate({app="api-server"} | json [5m]))
# Top 10 most frequent error messages
topk(10,
sum by (message) (
count_over_time({app="api-server"} | json | level="error" [1h])
)
)
# P99 latency from logs (when metrics aren't available)
quantile_over_time(0.99,
{app="api-server"} | json | unwrap duration [5m]
) by (path)
# All errors in the last 15 minutes across all services in prod
{env="production"} | json | level="error" | line_format "{{.service}} | {{.message}} | {{.error}}"
# Slowest requests (P99 path analysis)
{app="api-server"} | json
| duration > 2000
| line_format "{{.method}} {{.path}} {{.duration}}ms {{.trace_id}}"
# Follow a specific user's session
{env="production"} | json | user_id="u_123456"
# Find all log lines for a distributed trace
{env="production"} | json | trace_id="abc-def-123"
// Match logs with level=error in the last 15 minutes
POST /logs-*/_search
{
"query": {
"bool": {
"must": [
{ "term": { "level": "error" } }
],
"filter": [
{ "range": { "@timestamp": { "gte": "now-15m" } } }
]
}
},
"sort": [{ "@timestamp": { "order": "desc" } }],
"size": 50
}
// Full-text search in message field
{
"query": {
"bool": {
"must": [
{ "match": { "message": "connection refused" } }
],
"filter": [
{ "range": { "@timestamp": { "gte": "now-1h" } } },
{ "term": { "service.name": "payment-api" } }
]
}
}
}
// Error counts by service and hour
POST /logs-*/_search
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "term": { "level": "error" } },
{ "range": { "@timestamp": { "gte": "now-24h" } } }
]
}
},
"aggs": {
"by_service": {
"terms": { "field": "service.name", "size": 20 },
"aggs": {
"over_time": {
"date_histogram": {
"field": "@timestamp",
"calendar_interval": "1h"
}
}
}
}
}
}
// Top error messages with most frequent first
{
"size": 0,
"query": {
"bool": {
"filter": [
{ "term": { "level": "error" } },
{ "range": { "@timestamp": { "gte": "now-1h" } } }
]
}
},
"aggs": {
"top_errors": {
"terms": {
"field": "message.keyword",
"size": 25,
"order": { "_count": "desc" }
}
}
}
}
// Bad — hard to parse, hard to alert on
logger.info(`User ${userId} purchased ${itemId} for ${price} in ${duration}ms`);
// Good — every field is queryable
logger.info({
event: 'purchase.completed',
user_id: userId,
item_id: itemId,
amount_cents: price,
duration_ms: duration,
trace_id: ctx.traceId,
span_id: ctx.spanId,
});
interface BaseLogContext {
timestamp: string; // ISO 8601
level: 'debug' | 'info' | 'warn' | 'error' | 'fatal';
message: string; // Human-readable summary
service: string; // Service name
version: string; // Service version / git SHA
environment: string; // production, staging, dev
trace_id?: string; // Distributed trace ID (from W3C traceparent)
span_id?: string; // Current span ID
user_id?: string; // When in user context
request_id?: string; // HTTP request ID
// For error logs, always include:
error?: {
type: string; // Error class name
message: string;
stack?: string;
};
}
import pino from 'pino';
const logger = pino({
level: process.env.LOG_LEVEL ?? 'info',
base: {
service: 'payment-api',
version: process.env.GIT_SHA ?? 'unknown',
env: process.env.NODE_ENV ?? 'production',
},
// Redact sensitive fields before writing
redact: {
paths: ['req.headers.authorization', 'body.password', 'body.credit_card'],
censor: '[REDACTED]',
},
// Production: write JSON; Development: pretty-print
transport: process.env.NODE_ENV === 'development'
? { target: 'pino-pretty', options: { colorize: true } }
: undefined,
});
export default logger;
| Level | When to use | Example |
|-------|------------|---------|
| debug | Internal state during development | Loop iteration values, intermediate results |
| info | Normal events important to track | Request handled, job completed, record created |
| warn | Something unexpected but handled | Retry attempt, fallback used, deprecated endpoint called |
| error | Something failed but service continues | Failed to process a specific request, DB query failed |
| fatal | Service cannot continue | Cannot connect to database, out of memory |
Production: info or warn threshold. Never debug in production (log volume/cost).
Every external request should get a unique ID that propagates through all downstream calls.
// Express middleware — assign or forward trace ID
app.use((req, res, next) => {
// Use incoming W3C traceparent, or generate new trace
const traceParent = req.headers['traceparent'];
const traceId = traceParent
? extractTraceId(traceParent)
: crypto.randomUUID().replace(/-/g, '');
// Store on request context
req.traceId = traceId;
// Forward to downstream services
res.setHeader('traceparent', buildTraceParent(traceId));
next();
});
// Pass trace ID to all downstream HTTP calls
async function callPaymentService(traceId: string, payload: PaymentPayload) {
return fetch('https://payment.internal/charge', {
method: 'POST',
headers: {
'traceparent': buildTraceParent(traceId),
'Content-Type': 'application/json',
},
body: JSON.stringify(payload),
});
}
With trace IDs in every log entry, finding all logs for one request across 10 services is a single query:
{env="production"} | json | trace_id="ab12cd34ef56"
# When did errors start?
cat app.log | jq -r 'select(.level == "error") | .timestamp' | sort | head -5
# What was error rate before vs. during incident?
# Count errors per minute for last 2 hours
cat app.log | jq -r 'select(.level == "error") | .timestamp[:16]' | sort | uniq -c
# Find the very first error of the incident type
grep -m1 '"level":"error"' app.log | jq .
# Look at logs 30 seconds BEFORE the first error
# (What changed? What was the last normal request?)
# What fraction of requests are failing?
echo "Total requests:"; grep '"event":"request.completed"' app.log | wc -l
echo "Errors:"; grep '"level":"error"' app.log | wc -l
# What error types are present?
jq -r 'select(.level == "error") | .error.type // .message' app.log | \
sort | uniq -c | sort -rn | head 20
# Which endpoints / code paths?
jq -r 'select(.level == "error") | .path // .handler // "unknown"' app.log | \
sort | uniq -c | sort -rn | head 10
# Get a trace ID from a failed request
TRACE_ID=$(jq -r 'select(.level == "error") | .trace_id' app.log | head -1)
# Find all log lines for that trace across all log files
grep "$TRACE_ID" *.log | sort -t'"' -k4 # sort by timestamp field
# Build a timeline: what happened before the error?
# Were there deployments around the incident time?
git log --oneline --after="2 hours ago" --before="now"
# Kubernetes — any pod restarts?
kubectl get events --sort-by='.lastTimestamp' -n production | grep -i "restart\|backoff\|oom\|kill"
# Configuration changes?
git log --oneline -p config/ -- "*.yaml" "*.env" | head 50
At high throughput (>100k req/s), logging every request becomes expensive and noisy.
// Sample debug/info logs; always log errors
const SAMPLE_RATE = 0.01; // Log 1% of successful requests
function shouldLog(level: string): boolean {
if (level === 'error' || level === 'warn' || level === 'fatal') return true;
return Math.random() < SAMPLE_RATE;
}
// Or use head-based sampling tied to the trace
function shouldLogFromTrace(traceId: string, level: string): boolean {
if (level === 'error' || level === 'fatal') return true;
// Hash the trace ID — deterministic sampling
// All services will log or skip the same trace
const hash = parseInt(traceId.slice(0, 8), 16);
return (hash % 100) < (SAMPLE_RATE * 100);
}
Always log 100% of:
For structured log query patterns and correlation techniques, see:
references/log-patterns.md — structlog/zerolog/zap configuration, Loki LogQL query library, alert rule templates, and log sampling strategies for high-throughput servicesreferences/correlation-techniques.md — distributed request correlation via request_id propagation, session correlation, event timeline reconstruction, and head-based log sampling implementationtesting
Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.
tools
Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.
testing
This skill should be used when scientists need help with research problem selection, project ideation, troubleshooting stuck projects, or strategic scientific decisions. Use this skill when users ask to pitch a new research idea, work through a project problem, evaluate project risks, plan research strategy, navigate decision trees, or get help choosing what scientific problem to work on. Typical requests include "I have an idea for a project", "I'm stuck on my research", "help me evaluate this project", "what should I work on", or "I need strategic advice about my research".
development
Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.