skills/log-aggregation-architect/SKILL.md
Centralized log pipeline architect with structured logging, Fluentd/Vector, and retention policies. Activate on: log aggregation, structured logging, Fluentd, Vector, Loki, ELK stack, log pipeline, log retention, centralized logging. NOT for: metrics and dashboards (use monitoring-stack-deployer), distributed tracing (use logging-observability), alerting rules (use site-reliability-engineer).
npx skillsauth add curiositech/windags-skills log-aggregation-architectInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Expert in designing centralized log pipelines with structured logging, efficient collection, and cost-effective retention.
Activate on: "log aggregation", "structured logging", "Fluentd config", "Vector pipeline", "Loki setup", "ELK stack", "log pipeline", "log retention policy", "centralized logging", "log shipping"
NOT for: Metrics/dashboards → monitoring-stack-deployer | Distributed tracing → logging-observability | Alerting → site-reliability-engineer
| Domain | Technologies | |--------|-------------| | Collection | Vector 0.43, Fluentd 1.17, Fluent Bit 3.2, OTEL Collector | | Storage | Grafana Loki 3.x, Elasticsearch 8.x, ClickHouse, S3 archive | | Structured Logging | JSON, logfmt, OpenTelemetry Logs, pino, winston, slog (Go) | | Pipeline | Transform, filter, route, sample, deduplicate, redact PII | | Visualization | Grafana (Loki), Kibana (Elastic), Grafana Explore |
# vector.toml — collect, transform, route
[sources.kubernetes]
type = "kubernetes_logs"
auto_partial_merge = true
[transforms.structured]
type = "remap"
inputs = ["kubernetes"]
source = '''
# Parse JSON logs, fallback to raw message
. = parse_json(.message) ?? {"message": .message}
.timestamp = now()
.service = .kubernetes.pod_labels."app.kubernetes.io/name" ?? "unknown"
.environment = .kubernetes.pod_namespace
# Redact PII
.message = redact(.message, filters: ["pattern"],
patterns: [r'\b[A-Za-z0-9._%+-]+@[A-Za-z0-9.-]+\.[A-Z|a-z]{2,}\b'])
'''
[transforms.sampler]
type = "sample"
inputs = ["structured"]
rate = 10 # Keep 1 in 10 debug logs
exclude."level" = ["error", "warn", "info"] # Always keep non-debug
[sinks.loki]
type = "loki"
inputs = ["sampler"]
endpoint = "http://loki-gateway:3100"
labels.service = "{{ service }}"
labels.level = "{{ level }}"
encoding.codec = "json"
{
"timestamp": "2026-03-20T14:30:00.000Z",
"level": "info",
"message": "Order processed successfully",
"service": "order-api",
"trace_id": "abc123def456",
"span_id": "789ghi",
"user_id": "usr_masked",
"order_id": "ord_12345",
"duration_ms": 142,
"http": {
"method": "POST",
"path": "/api/v1/orders",
"status": 201
}
}
HOT (0-7 days):
├─ Full-text searchable in Loki/Elasticsearch
├─ Instant query response (<1s)
└─ Cost: $$$ (SSD, indexed)
WARM (7-30 days):
├─ Compressed, queryable with delay
├─ Query response 5-30s
└─ Cost: $$ (HDD, partial index)
COLD (30-365 days):
├─ S3/GCS archive, queryable via Athena/BigQuery
├─ Query response: minutes
└─ Cost: $ (object storage, no index)
DELETED (365+ days):
└─ Lifecycle policy auto-deletes (compliance permitting)
console.log("User " + id + " did thing") is unsearchable. Use structured JSON with consistent field names.[ ] All services emit JSON structured logs
[ ] Consistent field schema across services (timestamp, level, service, trace_id)
[ ] Log collection agents deployed as DaemonSet (Vector or Fluent Bit)
[ ] PII redaction applied in pipeline before storage
[ ] Debug logs sampled (not all collected in production)
[ ] Retention tiers defined: hot, warm, cold with lifecycle policies
[ ] Trace IDs propagated into log context
[ ] Log-based alerts configured for error rate spikes
[ ] Grafana Explore or Kibana connected for log search
[ ] Storage costs monitored and budget-capped
[ ] Log pipeline has backpressure handling (no data loss under load)
[ ] Compliance requirements met (GDPR right-to-erasure for logs with PII)
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.