skills/service-mesh-observability/SKILL.md
Implement comprehensive observability for service meshes including distributed tracing, metrics, and visualization. Use when setting up mesh monitoring, debugging latency issues, or implementing SLOs for service communication.
npx skillsauth add agent-skills-hub/agent-skills-hub service-mesh-observabilityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Complete guide to observability patterns for Istio, Linkerd, and service mesh deployments.
resources/implementation-playbook.md.┌─────────────────────────────────────────────────────┐
│ Observability │
├─────────────────┬─────────────────┬─────────────────┤
│ Metrics │ Traces │ Logs │
│ │ │ │
│ • Request rate │ • Span context │ • Access logs │
│ • Error rate │ • Latency │ • Error details │
│ • Latency P50 │ • Dependencies │ • Debug info │
│ • Saturation │ • Bottlenecks │ • Audit trail │
└─────────────────┴─────────────────┴─────────────────┘
| Signal | Description | Alert Threshold | |--------|-------------|-----------------| | Latency | Request duration P50, P99 | P99 > 500ms | | Traffic | Requests per second | Anomaly detection | | Errors | 5xx error rate | > 1% | | Saturation | Resource utilization | > 80% |
# Install Prometheus
apiVersion: v1
kind: ConfigMap
metadata:
name: prometheus
namespace: istio-system
data:
prometheus.yml: |
global:
scrape_interval: 15s
scrape_configs:
- job_name: 'istio-mesh'
kubernetes_sd_configs:
- role: endpoints
namespaces:
names:
- istio-system
relabel_configs:
- source_labels: [__meta_kubernetes_service_name]
action: keep
regex: istio-telemetry
---
# ServiceMonitor for Prometheus Operator
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: istio-mesh
namespace: istio-system
spec:
selector:
matchLabels:
app: istiod
endpoints:
- port: http-monitoring
interval: 15s
# Request rate by service
sum(rate(istio_requests_total{reporter="destination"}[5m])) by (destination_service_name)
# Error rate (5xx)
sum(rate(istio_requests_total{reporter="destination", response_code=~"5.."}[5m]))
/ sum(rate(istio_requests_total{reporter="destination"}[5m])) * 100
# P99 latency
histogram_quantile(0.99,
sum(rate(istio_request_duration_milliseconds_bucket{reporter="destination"}[5m]))
by (le, destination_service_name))
# TCP connections
sum(istio_tcp_connections_opened_total{reporter="destination"}) by (destination_service_name)
# Request size
histogram_quantile(0.99,
sum(rate(istio_request_bytes_bucket{reporter="destination"}[5m]))
by (le, destination_service_name))
# Jaeger installation for Istio
apiVersion: install.istio.io/v1alpha1
kind: IstioOperator
spec:
meshConfig:
enableTracing: true
defaultConfig:
tracing:
sampling: 100.0 # 100% in dev, lower in prod
zipkin:
address: jaeger-collector.istio-system:9411
---
# Jaeger deployment
apiVersion: apps/v1
kind: Deployment
metadata:
name: jaeger
namespace: istio-system
spec:
selector:
matchLabels:
app: jaeger
template:
metadata:
labels:
app: jaeger
spec:
containers:
- name: jaeger
image: jaegertracing/all-in-one:1.50
ports:
- containerPort: 5775 # UDP
- containerPort: 6831 # Thrift
- containerPort: 6832 # Thrift
- containerPort: 5778 # Config
- containerPort: 16686 # UI
- containerPort: 14268 # HTTP
- containerPort: 14250 # gRPC
- containerPort: 9411 # Zipkin
env:
- name: COLLECTOR_ZIPKIN_HOST_PORT
value: ":9411"
# Install Linkerd viz extension
linkerd viz install | kubectl apply -f -
# Access dashboard
linkerd viz dashboard
# CLI commands for observability
# Top requests
linkerd viz top deploy/my-app
# Per-route metrics
linkerd viz routes deploy/my-app --to deploy/backend
# Live traffic inspection
linkerd viz tap deploy/my-app --to deploy/backend
# Service edges (dependencies)
linkerd viz edges deployment -n my-namespace
{
"dashboard": {
"title": "Service Mesh Overview",
"panels": [
{
"title": "Request Rate",
"type": "graph",
"targets": [
{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (destination_service_name)",
"legendFormat": "{{destination_service_name}}"
}
]
},
{
"title": "Error Rate",
"type": "gauge",
"targets": [
{
"expr": "sum(rate(istio_requests_total{response_code=~\"5..\"}[5m])) / sum(rate(istio_requests_total[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"value": 0, "color": "green"},
{"value": 1, "color": "yellow"},
{"value": 5, "color": "red"}
]
}
}
}
},
{
"title": "P99 Latency",
"type": "graph",
"targets": [
{
"expr": "histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket{reporter=\"destination\"}[5m])) by (le, destination_service_name))",
"legendFormat": "{{destination_service_name}}"
}
]
},
{
"title": "Service Topology",
"type": "nodeGraph",
"targets": [
{
"expr": "sum(rate(istio_requests_total{reporter=\"destination\"}[5m])) by (source_workload, destination_service_name)"
}
]
}
]
}
}
# Kiali installation
apiVersion: kiali.io/v1alpha1
kind: Kiali
metadata:
name: kiali
namespace: istio-system
spec:
auth:
strategy: anonymous # or openid, token
deployment:
accessible_namespaces:
- "**"
external_services:
prometheus:
url: http://prometheus.istio-system:9090
tracing:
url: http://jaeger-query.istio-system:16686
grafana:
url: http://grafana.istio-system:3000
# OpenTelemetry Collector for mesh
apiVersion: v1
kind: ConfigMap
metadata:
name: otel-collector-config
data:
config.yaml: |
receivers:
otlp:
protocols:
grpc:
endpoint: 0.0.0.0:4317
http:
endpoint: 0.0.0.0:4318
zipkin:
endpoint: 0.0.0.0:9411
processors:
batch:
timeout: 10s
exporters:
jaeger:
endpoint: jaeger-collector:14250
tls:
insecure: true
prometheus:
endpoint: 0.0.0.0:8889
service:
pipelines:
traces:
receivers: [otlp, zipkin]
processors: [batch]
exporters: [jaeger]
metrics:
receivers: [otlp]
processors: [batch]
exporters: [prometheus]
---
# Istio Telemetry v2 with OTel
apiVersion: telemetry.istio.io/v1alpha1
kind: Telemetry
metadata:
name: mesh-default
namespace: istio-system
spec:
tracing:
- providers:
- name: otel
randomSamplingPercentage: 10
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: mesh-alerts
namespace: istio-system
spec:
groups:
- name: mesh.rules
rules:
- alert: HighErrorRate
expr: |
sum(rate(istio_requests_total{response_code=~"5.."}[5m])) by (destination_service_name)
/ sum(rate(istio_requests_total[5m])) by (destination_service_name) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate for {{ $labels.destination_service_name }}"
- alert: HighLatency
expr: |
histogram_quantile(0.99, sum(rate(istio_request_duration_milliseconds_bucket[5m]))
by (le, destination_service_name)) > 1000
for: 5m
labels:
severity: warning
annotations:
summary: "High P99 latency for {{ $labels.destination_service_name }}"
- alert: MeshCertExpiring
expr: |
(certmanager_certificate_expiration_timestamp_seconds - time()) / 86400 < 7
labels:
severity: warning
annotations:
summary: "Mesh certificate expiring in less than 7 days"
tools
Multi-agent autonomous startup system for Claude Code. Triggers on "Loki Mode". Orchestrates 100+ specialized agents across engineering, QA, DevOps, security, data/ML, business operations, marketing, HR, and customer success. Takes PRD to fully deployed, revenue-generating product with zero human intervention. Features Task tool for subagent dispatch, parallel code review with 3 specialized reviewers, severity-based issue triage, distributed task queue with dead letter handling, automatic deployment to cloud providers, A/B testing, customer feedback loops, incident response, circuit breakers, and self-healing. Handles rate limits via distributed state checkpoints and auto-resume with exponential backoff. Requires --dangerously-skip-permissions flag.
tools
Formula WorkPaper runtime and MCP server for AI agents and Node.js services. Use when an agent needs spreadsheet-style formulas, cell edits, recalculation, readback verification, or persisted WorkPaper JSON without driving Excel UI.
data-ai
Project scaffolding templates for new applications. Use when creating new projects from scratch. Contains 12 templates for various tech stacks.
development
Main application building orchestrator. Creates full-stack applications from natural language requests. Determines project type, selects tech stack, coordinates agents.