.claude/skills/observability-engineer/SKILL.md
--- name: observability-engineer description: Design and implement observability stack with metrics, logs, and traces. Use for Prometheus, Grafana, Loki, Tempo, OpenTelemetry, alerting, and SLO/SLI design. Keywords: observability, monitoring, tracing, Prometheus, Grafana, Loki, Tempo, OpenTelemetry, OTEL, alerting, SLO, SLI. --- # Observability Engineer Expert in designing and implementing comprehensive observability solutions for Kubernetes environments. Covers the three pillars: metrics, log
npx skillsauth add adask-b/agent-ready-k8s .claude/skills/observability-engineerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Expert in designing and implementing comprehensive observability solutions for Kubernetes environments. Covers the three pillars: metrics, logs, and traces.
┌─────────────────────────────────────────────────────────┐
│ Grafana UI │
│ (Unified visualization for all observability data) │
└─────────────────────────────────────────────────────────┘
│ │ │
▼ ▼ ▼
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Prometheus │ │ Loki │ │ Tempo │
│ (Metrics) │ │ (Logs) │ │ (Traces) │
└─────────────┘ └─────────────┘ └─────────────┘
│ │ │
└──────────────┼───────────────┘
│
┌─────────────────────┐
│ OpenTelemetry │
│ Collector │
│ (Unified ingestion)│
└─────────────────────┘
│
┌─────────────────────┐
│ Applications │
│ (OTEL SDK) │
└─────────────────────┘
| Component | Tool | Purpose | |-----------|------|---------| | Metrics | Prometheus + kube-prometheus-stack | Time-series metrics | | Logs | Loki + Promtail | Log aggregation | | Traces | Tempo | Distributed tracing | | Collection | OpenTelemetry Collector | Unified data ingestion | | Visualization | Grafana | Dashboards and exploration | | Alerting | Alertmanager | Alert routing and silencing |
The easiest way to deploy a complete metrics stack:
# values.yaml for kube-prometheus-stack
prometheus:
prometheusSpec:
retention: 15d
resources:
requests:
memory: 512Mi
cpu: 250m
limits:
memory: 2Gi
# Enable exemplars for trace correlation
enableFeatures:
- exemplar-storage
# ServiceMonitor selector
serviceMonitorSelector: {}
serviceMonitorNamespaceSelector: {}
grafana:
defaultDashboardsEnabled: true
adminPassword: ${GRAFANA_ADMIN_PASSWORD}
datasources:
datasources.yaml:
apiVersion: 1
datasources:
- name: Prometheus
type: prometheus
url: http://prometheus-operated:9090
isDefault: true
- name: Loki
type: loki
url: http://loki:3100
- name: Tempo
type: tempo
url: http://tempo:3100
alertmanager:
config:
global:
resolve_timeout: 5m
route:
receiver: 'slack'
group_by: ['alertname', 'namespace']
group_wait: 30s
group_interval: 5m
repeat_interval: 4h
receivers:
- name: 'slack'
slack_configs:
- api_url: '${SLACK_WEBHOOK_URL}'
channel: '#alerts'
Install:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install kube-prometheus prometheus-community/kube-prometheus-stack \
-n monitoring --create-namespace \
-f values.yaml
Enable Prometheus scraping for any service:
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: my-app
namespace: monitoring
labels:
release: kube-prometheus # Must match stack's selector
spec:
namespaceSelector:
matchNames:
- my-app-namespace
selector:
matchLabels:
app: my-app
endpoints:
- port: metrics
interval: 30s
path: /metrics
# loki-values.yaml
loki:
auth_enabled: false
storage:
type: filesystem
compactor:
retention_enabled: true
retention_delete_delay: 2h
limits_config:
retention_period: 30d
promtail:
config:
clients:
- url: http://loki:3100/loki/api/v1/push
scrape_configs:
- job_name: kubernetes-pods
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_label_app]
target_label: app
- source_labels: [__meta_kubernetes_namespace]
target_label: namespace
# tempo-values.yaml
tempo:
storage:
trace:
backend: local
local:
path: /var/tempo/traces
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
# Enable search
search_enabled: true
# Metrics generator for trace-derived metrics
metricsGenerator:
enabled: true
remoteWriteUrl: http://prometheus:9090/api/v1/write
Central collection point for all telemetry:
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 1s
send_batch_size: 1024
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
prometheus:
endpoint: "0.0.0.0:8889"
loki:
endpoint: http://loki:3100/loki/api/v1/push
otlp:
endpoint: tempo:4317
tls:
insecure: true
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [otlp]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [prometheus]
logs:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [loki]
import (
"go.opentelemetry.io/otel"
"go.opentelemetry.io/otel/exporters/otlp/otlptrace/otlptracegrpc"
"go.opentelemetry.io/otel/sdk/trace"
)
func initTracer() (*trace.TracerProvider, error) {
exporter, err := otlptracegrpc.New(context.Background(),
otlptracegrpc.WithEndpoint("otel-collector:4317"),
otlptracegrpc.WithInsecure(),
)
if err != nil {
return nil, err
}
tp := trace.NewTracerProvider(
trace.WithBatcher(exporter),
trace.WithResource(resource.NewWithAttributes(
semconv.SchemaURL,
semconv.ServiceNameKey.String("my-service"),
)),
)
otel.SetTracerProvider(tp)
return tp, nil
}
Configure via environment (no code changes):
env:
- name: OTEL_EXPORTER_OTLP_ENDPOINT
value: "http://otel-collector:4317"
- name: OTEL_SERVICE_NAME
value: "my-service"
- name: OTEL_TRACES_SAMPLER
value: "parentbased_traceidratio"
- name: OTEL_TRACES_SAMPLER_ARG
value: "0.1" # 10% sampling
# PrometheusRule for RED method SLOs
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: my-app-slos
spec:
groups:
- name: my-app.slos
rules:
# Error rate SLI (target: 99.9% success)
- record: my_app:error_rate:5m
expr: |
sum(rate(http_requests_total{job="my-app",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="my-app"}[5m]))
# Latency SLI (target: p99 < 500ms)
- record: my_app:latency_p99:5m
expr: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="my-app"}[5m])) by (le)
)
# Burn rate alert (1h budget burn)
- alert: HighErrorBurnRate
expr: my_app:error_rate:5m > 0.001 * 14.4
for: 5m
labels:
severity: critical
annotations:
summary: "High error burn rate for my-app"
sum(rate(http_requests_total{job="my-app"}[5m])) by (method, status)
sum(rate(http_requests_total{job="my-app",status=~"5.."}[5m]))
/
sum(rate(http_requests_total{job="my-app"}[5m]))
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket{job="my-app"}[5m])) by (le)
)
container_memory_working_set_bytes{namespace="my-ns", pod=~"my-app.*"}
/ container_spec_memory_limit_bytes{namespace="my-ns", pod=~"my-app.*"}
rate(container_cpu_cfs_throttled_seconds_total{namespace="my-ns"}[5m])
{namespace="my-ns", app="my-app"} |= "error" | json | level="error"
{namespace="my-ns"} | json | trace_id="abc123"
sum(rate({namespace="my-ns"} |= "error" [5m]))
development
--- name: security-compliance-guard description: Implement zero-trust security, secrets management, and compliance. Use for Vault, ESO, Kyverno, OPA, Pod Security, RBAC, and supply chain security. Keywords: security, secrets, Vault, ESO, Kyverno, OPA, RBAC, compliance, SBOM, Cosign. --- # Security & Compliance Guard Expert in implementing zero-trust security posture, secrets management, and compliance controls for Kubernetes environments. ## When to Use This Skill - Setting up secrets manage
devops
--- name: multi-cloud-architect description: Design and implement portable Kubernetes infrastructure across cloud providers. Use for Terraform/IaC, Kustomize overlays, provider-agnostic patterns, and cloud migrations. Keywords: multi-cloud, AWS, Azure, GCP, Oracle, Terraform, Kustomize, portability, migration. --- # Multi-Cloud Architect Expert in designing portable Kubernetes infrastructure that can run on any cloud provider (Oracle, Azure, AWS, GCP) or on-premises with minimal changes. ## W
testing
--- name: k8s-platform-expert description: Complete Kubernetes platform expertise - deployment, security hardening, and systematic troubleshooting. Use for workload deployment, Helm charts, RBAC, NetworkPolicies, incident response, and diagnostics. Keywords: Kubernetes, K8s, kubectl, Helm, RBAC, troubleshooting, incident response, GitOps. --- # Kubernetes Platform Expert A comprehensive Kubernetes skill combining deployment expertise with systematic troubleshooting capabilities. Covers the ful
tools
--- name: gitops-pipeline-master description: Design and implement GitOps workflows with ArgoCD and CI/CD pipelines. Use for GitHub Actions, image promotion, rollout strategies, and deployment automation. Keywords: GitOps, ArgoCD, CI/CD, GitHub Actions, deployment, rollout, canary, blue-green. --- # GitOps Pipeline Master Expert in designing GitOps-based deployment workflows with Argo CD and CI/CD automation. ## When to Use This Skill - Setting up Argo CD Applications and ApplicationSets - D