devops/observability/opentelemetry/SKILL.md
Instrument applications and infrastructure with OpenTelemetry for unified traces, metrics, and logs. Use when implementing distributed tracing, service-level troubleshooting, or vendor-neutral observability.
npx skillsauth add bagelhole/devops-security-agent-skills opentelemetryInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Adopt vendor-neutral telemetry with consistent instrumentation across services.
# otel-collector-config.yaml
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# Scrape Prometheus endpoints
prometheus:
config:
scrape_configs:
- job_name: "kubernetes-pods"
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: "true"
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
target_label: __address__
regex: (.+)
replacement: $$1
# Host metrics for infrastructure monitoring
hostmetrics:
collection_interval: 30s
scrapers:
cpu: {}
memory: {}
disk: {}
network: {}
load: {}
processors:
batch:
send_batch_size: 1024
timeout: 5s
memory_limiter:
check_interval: 1s
limit_mib: 512
spike_limit_mib: 128
attributes:
actions:
- key: deployment.environment
value: production
action: upsert
# Drop high-cardinality attributes to control cost
filter/drop-debug:
traces:
span:
- 'attributes["http.request.header.x-debug"] == "true"'
# Reduce cardinality on URL paths
transform/normalize-routes:
trace_statements:
- context: span
statements:
- replace_pattern(attributes["url.path"], "/users/[0-9]+", "/users/{id}")
- replace_pattern(attributes["url.path"], "/orders/[0-9]+", "/orders/{id}")
# Resource detection for cloud environments
resourcedetection:
detectors: [env, system, gcp, aws, azure]
timeout: 5s
exporters:
# Send traces to Tempo/Jaeger
otlp/traces:
endpoint: tempo:4317
tls:
insecure: true
# Send metrics to Prometheus via remote write
prometheusremotewrite:
endpoint: http://mimir:9009/api/v1/push
tls:
insecure: true
# Send logs to Loki
otlp/logs:
endpoint: loki:4317
tls:
insecure: true
# Debug exporter for development
debug:
verbosity: basic
service:
telemetry:
logs:
level: info
metrics:
address: 0.0.0.0:8888
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, resourcedetection, transform/normalize-routes, batch, attributes]
exporters: [otlp/traces]
metrics:
receivers: [otlp, prometheus, hostmetrics]
processors: [memory_limiter, resourcedetection, batch, attributes]
exporters: [prometheusremotewrite]
logs:
receivers: [otlp]
processors: [memory_limiter, resourcedetection, batch, attributes]
exporters: [otlp/logs]
# otel-collector-deployment.yaml
apiVersion: apps/v1
kind: Deployment
metadata:
name: otel-collector
namespace: observability
spec:
replicas: 2
selector:
matchLabels:
app: otel-collector
template:
metadata:
labels:
app: otel-collector
spec:
containers:
- name: collector
image: otel/opentelemetry-collector-contrib:0.98.0
args: ["--config=/etc/otel/config.yaml"]
ports:
- containerPort: 4317
name: otlp-grpc
- containerPort: 4318
name: otlp-http
- containerPort: 8888
name: metrics
resources:
requests:
cpu: 200m
memory: 256Mi
limits:
cpu: "1"
memory: 512Mi
volumeMounts:
- name: config
mountPath: /etc/otel
livenessProbe:
httpGet:
path: /
port: 13133
readinessProbe:
httpGet:
path: /
port: 13133
volumes:
- name: config
configMap:
name: otel-collector-config
---
apiVersion: v1
kind: Service
metadata:
name: otel-collector
namespace: observability
spec:
selector:
app: otel-collector
ports:
- name: otlp-grpc
port: 4317
targetPort: 4317
- name: otlp-http
port: 4318
targetPort: 4318
- name: metrics
port: 8888
targetPort: 8888
# tracing_setup.py
"""Initialize OpenTelemetry tracing and metrics for a Python service."""
from opentelemetry import trace, metrics
from opentelemetry.sdk.trace import TracerProvider
from opentelemetry.sdk.trace.export import BatchSpanProcessor
from opentelemetry.sdk.metrics import MeterProvider
from opentelemetry.sdk.metrics.export import PeriodicExportingMetricReader
from opentelemetry.sdk.resources import Resource
from opentelemetry.exporter.otlp.proto.grpc.trace_exporter import OTLPSpanExporter
from opentelemetry.exporter.otlp.proto.grpc.metric_exporter import OTLPMetricExporter
from opentelemetry.instrumentation.requests import RequestsInstrumentor
from opentelemetry.instrumentation.flask import FlaskInstrumentor
from opentelemetry.instrumentation.sqlalchemy import SQLAlchemyInstrumentor
import os
def init_telemetry(service_name: str, service_version: str):
"""Initialize OTel SDK with traces and metrics."""
resource = Resource.create({
"service.name": service_name,
"service.version": service_version,
"deployment.environment": os.getenv("DEPLOY_ENV", "development"),
})
# Traces
trace_exporter = OTLPSpanExporter(
endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://otel-collector:4317"),
insecure=True,
)
tracer_provider = TracerProvider(resource=resource)
tracer_provider.add_span_processor(BatchSpanProcessor(trace_exporter))
trace.set_tracer_provider(tracer_provider)
# Metrics
metric_exporter = OTLPMetricExporter(
endpoint=os.getenv("OTEL_EXPORTER_OTLP_ENDPOINT", "http://otel-collector:4317"),
insecure=True,
)
metric_reader = PeriodicExportingMetricReader(metric_exporter, export_interval_millis=15000)
meter_provider = MeterProvider(resource=resource, metric_readers=[metric_reader])
metrics.set_meter_provider(meter_provider)
# Auto-instrument common libraries
RequestsInstrumentor().instrument()
SQLAlchemyInstrumentor().instrument()
return trace.get_tracer(service_name), metrics.get_meter(service_name)
# Usage example
tracer, meter = init_telemetry("order-service", "1.2.0")
# Custom span
with tracer.start_as_current_span("process_order") as span:
span.set_attribute("order.id", order_id)
span.set_attribute("order.total", total)
# ... business logic ...
# Custom metric
request_counter = meter.create_counter(
"app.requests",
description="Total application requests",
)
request_counter.add(1, {"route": "/api/orders", "method": "POST"})
// tracing.js
// Initialize OpenTelemetry for a Node.js service.
// Load this file BEFORE any other imports: node -r ./tracing.js app.js
const { NodeSDK } = require("@opentelemetry/sdk-node");
const { OTLPTraceExporter } = require("@opentelemetry/exporter-trace-otlp-grpc");
const { OTLPMetricExporter } = require("@opentelemetry/exporter-metrics-otlp-grpc");
const { PeriodicExportingMetricReader } = require("@opentelemetry/sdk-metrics");
const { getNodeAutoInstrumentations } = require("@opentelemetry/auto-instrumentations-node");
const { Resource } = require("@opentelemetry/resources");
const { ATTR_SERVICE_NAME, ATTR_SERVICE_VERSION } = require("@opentelemetry/semantic-conventions");
const resource = new Resource({
[ATTR_SERVICE_NAME]: process.env.SERVICE_NAME || "node-service",
[ATTR_SERVICE_VERSION]: process.env.SERVICE_VERSION || "1.0.0",
"deployment.environment": process.env.DEPLOY_ENV || "development",
});
const sdk = new NodeSDK({
resource,
traceExporter: new OTLPTraceExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://otel-collector:4317",
}),
metricReader: new PeriodicExportingMetricReader({
exporter: new OTLPMetricExporter({
url: process.env.OTEL_EXPORTER_OTLP_ENDPOINT || "http://otel-collector:4317",
}),
exportIntervalMillis: 15000,
}),
instrumentations: [
getNodeAutoInstrumentations({
"@opentelemetry/instrumentation-http": {
ignoreIncomingPaths: ["/health", "/ready"],
},
"@opentelemetry/instrumentation-express": { enabled: true },
"@opentelemetry/instrumentation-pg": { enabled: true },
"@opentelemetry/instrumentation-redis": { enabled: true },
}),
],
});
sdk.start();
process.on("SIGTERM", () => sdk.shutdown());
# otel-auto-instrumentation.yaml
# Install the OTel Operator first:
# helm install opentelemetry-operator open-telemetry/opentelemetry-operator \
# --namespace observability --create-namespace
# Define instrumentation for Python services
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: python-instrumentation
namespace: default
spec:
exporter:
endpoint: http://otel-collector.observability:4317
propagators:
- tracecontext
- baggage
sampler:
type: parentbased_traceidratio
argument: "0.25"
python:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-python:0.44b0
env:
- name: OTEL_PYTHON_LOG_CORRELATION
value: "true"
---
# Define instrumentation for Node.js services
apiVersion: opentelemetry.io/v1alpha1
kind: Instrumentation
metadata:
name: nodejs-instrumentation
namespace: default
spec:
exporter:
endpoint: http://otel-collector.observability:4317
propagators:
- tracecontext
- baggage
sampler:
type: parentbased_traceidratio
argument: "0.25"
nodejs:
image: ghcr.io/open-telemetry/opentelemetry-operator/autoinstrumentation-nodejs:0.49.1
To instrument a pod, add the annotation:
# For Python:
metadata:
annotations:
instrumentation.opentelemetry.io/inject-python: "true"
# For Node.js:
metadata:
annotations:
instrumentation.opentelemetry.io/inject-nodejs: "true"
# Tail-based sampling config (in collector)
processors:
tail_sampling:
decision_wait: 10s
num_traces: 100000
policies:
# Always keep error traces
- name: errors
type: status_code
status_code:
status_codes: [ERROR]
# Always keep slow traces (> 2s)
- name: slow-traces
type: latency
latency:
threshold_ms: 2000
# Sample 10% of successful traces
- name: normal-traffic
type: probabilistic
probabilistic:
sampling_percentage: 10
# Always keep traces with specific attributes
- name: important-users
type: string_attribute
string_attribute:
key: user.tier
values: [enterprise, premium]
# Rate limit per service to prevent one service from dominating
- name: rate-limit
type: rate_limiting
rate_limiting:
spans_per_second: 500
service.name, service.version, and deployment.environment.| Symptom | Check | Fix |
|---------|-------|-----|
| No traces arriving at backend | Collector logs for export errors | Verify endpoint URL and network policy |
| Missing spans in a trace | Propagation headers stripped by proxy | Configure proxy to pass traceparent header |
| High memory on collector | Too many in-flight traces for tail sampling | Reduce num_traces or increase memory limit |
| Metric cardinality explosion | Unbounded label values (user IDs, URLs) | Add transform processor to normalize values |
| Auto-instrumentation not working | Pod annotation missing or operator not running | Verify operator is healthy and annotation is correct |
| Duplicate metrics | Both SDK and auto-instrumentation active | Use only one instrumentation method per signal |
development
Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
testing
Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.
devops
Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.
testing
Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.