.claude/skills/eks-observability/SKILL.md
EKS observability with metrics, logging, and tracing. Use when setting up monitoring, configuring logging pipelines, implementing distributed tracing, building production dashboards, troubleshooting EKS issues, optimizing observability costs, or establishing SLOs.
npx skillsauth add adaptationio/skrillz eks-observabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Complete observability solution for Amazon EKS using AWS-native managed services and open-source tools. This skill implements the three-pillar approach (metrics, logs, traces) with 2025 best practices including ADOT, Amazon Managed Prometheus, Fluent Bit, and OpenTelemetry.
Keywords: EKS monitoring, CloudWatch Container Insights, Prometheus, Grafana, ADOT, Fluent Bit, X-Ray, OpenTelemetry, distributed tracing, log aggregation, metrics collection, observability stack
Status: Production-ready with 2025 best practices
CloudWatch Container Insights + Amazon Managed Prometheus (AMP)
Fluent Bit → CloudWatch Logs
ADOT → AWS X-Ray
Using EKS Add-on (Recommended):
# Create IAM policy for CloudWatch access
aws iam create-policy \
--policy-name CloudWatchAgentServerPolicy \
--policy-document file://cloudwatch-policy.json
# Create IRSA for CloudWatch
eksctl create iamserviceaccount \
--name cloudwatch-agent \
--namespace amazon-cloudwatch \
--cluster my-cluster \
--attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
--approve \
--override-existing-serviceaccounts
# Install Container Insights add-on
aws eks create-addon \
--cluster-name my-cluster \
--addon-name amazon-cloudwatch-observability \
--service-account-role-arn arn:aws:iam::ACCOUNT_ID:role/CloudWatchAgentRole
Verify Installation:
# Check add-on status
aws eks describe-addon \
--cluster-name my-cluster \
--addon-name amazon-cloudwatch-observability
# Verify pods running
kubectl get pods -n amazon-cloudwatch
What You Get:
Create AMP Workspace:
# Create workspace
aws amp create-workspace \
--alias my-cluster-metrics \
--region us-west-2
# Get workspace ID
WORKSPACE_ID=$(aws amp list-workspaces \
--alias my-cluster-metrics \
--query 'workspaces[0].workspaceId' \
--output text)
# Create IRSA for AMP ingestion
eksctl create iamserviceaccount \
--name amp-ingest \
--namespace prometheus \
--cluster my-cluster \
--attach-policy-arn arn:aws:iam::aws:policy/AmazonPrometheusRemoteWriteAccess \
--approve
Deploy kube-prometheus-stack:
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
# Install with AMP remote write
helm install kube-prometheus-stack prometheus-community/kube-prometheus-stack \
--namespace prometheus \
--create-namespace \
--set prometheus.prometheusSpec.remoteWrite[0].url=https://aps-workspaces.us-west-2.amazonaws.com/workspaces/${WORKSPACE_ID}/api/v1/remote_write \
--set prometheus.prometheusSpec.remoteWrite[0].sigv4.region=us-west-2 \
--set prometheus.serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::ACCOUNT_ID:role/AMPIngestRole"
What You Get:
Create IRSA for Fluent Bit:
eksctl create iamserviceaccount \
--name fluent-bit \
--namespace logging \
--cluster my-cluster \
--attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
--approve
Deploy Fluent Bit:
helm repo add fluent https://fluent.github.io/helm-charts
helm install fluent-bit fluent/fluent-bit \
--namespace logging \
--create-namespace \
--set serviceAccount.annotations."eks\.amazonaws\.com/role-arn"="arn:aws:iam::ACCOUNT_ID:role/FluentBitRole" \
--set cloudWatch.enabled=true \
--set cloudWatch.region=us-west-2 \
--set cloudWatch.logGroupName=/aws/eks/my-cluster/logs \
--set cloudWatch.autoCreateGroup=true
What You Get:
Install ADOT Operator:
# Create IRSA for ADOT
eksctl create iamserviceaccount \
--name adot-collector \
--namespace adot \
--cluster my-cluster \
--attach-policy-arn arn:aws:iam::aws:policy/AWSXRayDaemonWriteAccess \
--attach-policy-arn arn:aws:iam::aws:policy/CloudWatchAgentServerPolicy \
--approve
# Install ADOT add-on
aws eks create-addon \
--cluster-name my-cluster \
--addon-name adot \
--service-account-role-arn arn:aws:iam::ACCOUNT_ID:role/ADOTCollectorRole
Deploy ADOT Collector:
# adot-collector.yaml
apiVersion: opentelemetry.io/v1alpha1
kind: OpenTelemetryCollector
metadata:
name: adot-collector
namespace: adot
spec:
mode: deployment
serviceAccount: adot-collector
config: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
processors:
batch:
timeout: 30s
send_batch_size: 50
memory_limiter:
check_interval: 1s
limit_mib: 512
exporters:
awsxray:
region: us-west-2
awsemf:
region: us-west-2
namespace: EKS/Observability
service:
pipelines:
traces:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [awsxray]
metrics:
receivers: [otlp]
processors: [memory_limiter, batch]
exporters: [awsemf]
kubectl apply -f adot-collector.yaml
What You Get:
Create AMG Workspace:
# Create workspace (via AWS Console recommended)
# Or use AWS CLI:
aws grafana create-workspace \
--workspace-name my-cluster-grafana \
--account-access-type CURRENT_ACCOUNT \
--authentication-providers AWS_SSO \
--permission-type SERVICE_MANAGED
Add Data Sources:
Import Dashboards:
# EKS Container Insights Dashboard
Dashboard ID: 16028
# Node Exporter Full Dashboard
Dashboard ID: 1860
# Kubernetes Cluster Monitoring
Dashboard ID: 15760
┌─────────────────────────────────────────────────────────────┐
│ EKS Cluster │
│ │
│ ┌──────────────┐ ┌──────────────┐ ┌──────────────┐ │
│ │ Application │ │ Application │ │ Application │ │
│ │ + OTel SDK │ │ + OTel SDK │ │ + OTel SDK │ │
│ └──────┬───────┘ └──────┬───────┘ └──────┬───────┘ │
│ │ │ │ │
│ └──────────────────┴──────────────────┘ │
│ │ │
│ ┌────────▼────────┐ │
│ │ ADOT Collector │ │
│ │ (OTel) │ │
│ └────────┬────────┘ │
│ │ │
│ ┌──────────────────┼──────────────────┐ │
│ │ │ │ │
│ ┌────▼─────┐ ┌────▼─────┐ ┌────▼─────┐ │
│ │Prometheus│ │Fluent Bit│ │Container │ │
│ │ (local) │ │DaemonSet │ │ Insights │ │
│ └────┬─────┘ └────┬─────┘ └────┬─────┘ │
└─────────┼──────────────────┼──────────────────┼────────────┘
│ │ │
│ │ │
┌─────▼─────┐ ┌────▼─────┐ ┌────▼─────┐
│ AMP │ │CloudWatch│ │ X-Ray │
│(Managed │ │ Logs │ │ │
│Prometheus)│ └────┬─────┘ └────┬─────┘
└─────┬─────┘ │ │
│ │ │
└─────────────────┴──────────────────┘
│
┌────────▼────────┐
│Amazon Managed │
│ Grafana │
└─────────────────┘
For comprehensive guides on each observability component:
Metrics Collection: references/metrics.md
Centralized Logging: references/logging.md
Distributed Tracing: references/tracing.md
Typical Monthly Costs:
Metrics → Logs:
# Find pods with high error rates
rate(http_requests_total{status=~"5.."}[5m]) > 0.1
# Then search CloudWatch Logs for those pod names
Logs → Traces:
// Include trace_id in structured logs
{
"timestamp": "2025-01-27T10:30:00Z",
"level": "error",
"message": "Database connection failed",
"trace_id": "1-67a2f3b1-12456789abcdef012345678",
"span_id": "abcdef0123456789"
}
Traces → Metrics:
Unified view combining:
# Enable ServiceLens (automatic with Container Insights + X-Ray)
aws servicelens get-service-lens-metrics \
--service-name my-app \
--start-time 2025-01-27T00:00:00Z \
--end-time 2025-01-27T23:59:59Z
| Issue | Cause | Fix |
|-------|-------|-----|
| No metrics in AMP | Missing IRSA or remote write config | Check Prometheus pod logs, verify IAM role |
| Logs not appearing | Fluent Bit not running or wrong IAM | kubectl logs -n logging fluent-bit-xxx |
| Traces not in X-Ray | ADOT not deployed or app not instrumented | Verify ADOT pods, check OTel SDK setup |
| High costs | Too much data ingestion | Enable sampling, reduce log verbosity |
| Missing pod metrics | kube-state-metrics not running | Check kube-prometheus-stack installation |
| Grafana can't connect | Data source IAM permissions | Add CloudWatch/AMP read policies to AMG role |
Define SLIs (Service Level Indicators):
# Availability SLI
- metric: probe_success
target: 99.9%
window: 30d
# Latency SLI
- metric: http_request_duration_seconds
percentile: p99
target: < 500ms
window: 30d
# Error Rate SLI
- metric: http_requests_total{status=~"5.."}
target: < 0.1%
window: 30d
Calculate Error Budget:
Error Budget = 100% - SLO Target
Example: 99.9% SLO = 0.1% error budget
= 43.2 minutes downtime/month
Burn Rate Alerts:
# Fast burn (5% budget in 1 hour)
(1 - slo:availability:ratio_rate_1h) > 0.05
# Slow burn (10% budget in 6 hours)
(1 - slo:availability:ratio_rate_6h) > 0.1
Stack: CloudWatch Container Insights, AMP, Fluent Bit, ADOT, AMG, X-Ray Standards: OpenTelemetry, IRSA, EKS Add-ons Last Updated: January 2025 (2025 Best Practices)
development
Setup secure web-based terminal access to WSL2 from mobile/tablet via ttyd + ngrok/Cloudflare/Tailscale. One-command install, start, stop, status. Use when you need remote terminal access, web terminal, browser-based shell, or mobile access to WSL2 environment.
development
Complete development workflows where Claude writes the code while Gemini and Codex provide research, planning, reviews, and different perspectives. Claude remains the main developer. Use for complex projects requiring expert planning and multi-perspective reviews.
development
Systematic progress tracking for skill development. Manages task states (pending/in_progress/completed), updates in real-time, reports progress, identifies blockers, and maintains momentum. Use when tracking skill development, coordinating work, or reporting progress.
testing
Comprehensive testing workflow orchestrating functional testing, example validation, integration testing, and usability assessment. Sequential workflow for complete skill testing from examples through scenarios to integration validation. Use when conducting thorough testing, pre-deployment validation, ensuring skill functionality, or comprehensive quality checks.