skills/observability/monitoring/SKILL.md
Cloud monitoring with Prometheus, Grafana, and cloud-native tools. Use when setting up metrics, alerts, dashboards, or troubleshooting performance issues.
npx skillsauth add liauw-media/codeassist cloud-monitoringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Comprehensive observability with metrics, logs, and traces.
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring --create-namespace
apiVersion: monitoring.coreos.com/v1
kind: ServiceMonitor
metadata:
name: myapp
labels:
release: prometheus
spec:
selector:
matchLabels:
app: myapp
endpoints:
- port: metrics
interval: 30s
path: /metrics
apiVersion: monitoring.coreos.com/v1
kind: PrometheusRule
metadata:
name: myapp-alerts
spec:
groups:
- name: myapp
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m])) /
sum(rate(http_requests_total[5m])) > 0.05
for: 5m
labels:
severity: critical
annotations:
summary: "High error rate ({{ $value | humanizePercentage }})"
- alert: HighLatency
expr: |
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m])) > 1
for: 5m
labels:
severity: warning
annotations:
summary: "P95 latency above 1s"
- alert: PodCrashLooping
expr: |
increase(kube_pod_container_status_restarts_total[1h]) > 5
for: 5m
labels:
severity: critical
annotations:
summary: "Pod {{ $labels.pod }} is crash looping"
# Request rate
rate(http_requests_total[5m])
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m])) / sum(rate(http_requests_total[5m])) * 100
# P95 latency
histogram_quantile(0.95, rate(http_request_duration_seconds_bucket[5m]))
# CPU usage by container
sum(rate(container_cpu_usage_seconds_total[5m])) by (container)
# Memory usage
container_memory_working_set_bytes / container_spec_memory_limit_bytes * 100
# Saturation - CPU throttling
rate(container_cpu_cfs_throttled_periods_total[5m]) / rate(container_cpu_cfs_periods_total[5m]) * 100
{
"title": "Application Overview",
"panels": [
{
"title": "Request Rate",
"type": "stat",
"targets": [
{
"expr": "sum(rate(http_requests_total[5m]))",
"legendFormat": "req/s"
}
]
},
{
"title": "Error Rate",
"type": "gauge",
"targets": [
{
"expr": "sum(rate(http_requests_total{status=~\"5..\"}[5m])) / sum(rate(http_requests_total[5m])) * 100"
}
],
"fieldConfig": {
"defaults": {
"thresholds": {
"steps": [
{"color": "green", "value": null},
{"color": "yellow", "value": 1},
{"color": "red", "value": 5}
]
}
}
}
}
]
}
import { Registry, Counter, Histogram, collectDefaultMetrics } from 'prom-client';
const register = new Registry();
collectDefaultMetrics({ register });
// Custom metrics
const httpRequestsTotal = new Counter({
name: 'http_requests_total',
help: 'Total HTTP requests',
labelNames: ['method', 'path', 'status'],
registers: [register],
});
const httpRequestDuration = new Histogram({
name: 'http_request_duration_seconds',
help: 'HTTP request duration in seconds',
labelNames: ['method', 'path'],
buckets: [0.01, 0.05, 0.1, 0.5, 1, 2, 5],
registers: [register],
});
// Middleware
app.use((req, res, next) => {
const end = httpRequestDuration.startTimer({ method: req.method, path: req.path });
res.on('finish', () => {
httpRequestsTotal.inc({ method: req.method, path: req.path, status: res.statusCode });
end();
});
next();
});
// Metrics endpoint
app.get('/metrics', async (req, res) => {
res.set('Content-Type', register.contentType);
res.end(await register.metrics());
});
from prometheus_client import Counter, Histogram, generate_latest, REGISTRY
from functools import wraps
import time
REQUEST_COUNT = Counter(
'http_requests_total',
'Total HTTP requests',
['method', 'endpoint', 'status']
)
REQUEST_LATENCY = Histogram(
'http_request_duration_seconds',
'HTTP request duration',
['method', 'endpoint'],
buckets=[.01, .05, .1, .5, 1, 2, 5]
)
def track_requests(func):
@wraps(func)
def wrapper(*args, **kwargs):
start = time.time()
response = func(*args, **kwargs)
REQUEST_LATENCY.labels(
method=request.method,
endpoint=request.endpoint
).observe(time.time() - start)
REQUEST_COUNT.labels(
method=request.method,
endpoint=request.endpoint,
status=response.status_code
).inc()
return response
return wrapper
@app.route('/metrics')
def metrics():
return generate_latest(REGISTRY), 200, {'Content-Type': 'text/plain'}
| SLI | Definition | Target | |-----|------------|--------| | Availability | Successful requests / Total requests | 99.9% | | Latency | P95 response time | < 200ms | | Error Rate | 5xx errors / Total requests | < 0.1% |
# Error budget remaining
1 - (
sum(rate(http_requests_total{status=~"5.."}[30d])) /
sum(rate(http_requests_total[30d]))
) / (1 - 0.999) # 99.9% SLO
| Metric | Warning | Critical | |--------|---------|----------| | Error Rate | > 1% | > 5% | | P95 Latency | > 500ms | > 2s | | CPU Usage | > 70% | > 90% | | Memory Usage | > 80% | > 95% | | Disk Usage | > 80% | > 90% |
resource "aws_cloudwatch_metric_alarm" "cpu_high" {
alarm_name = "${var.project}-cpu-high"
comparison_operator = "GreaterThanThreshold"
evaluation_periods = 2
metric_name = "CPUUtilization"
namespace = "AWS/ECS"
period = 300
statistic = "Average"
threshold = 80
alarm_actions = [aws_sns_topic.alerts.arn]
}
resource "google_monitoring_alert_policy" "cpu_high" {
display_name = "${var.project}-cpu-high"
combiner = "OR"
conditions {
display_name = "CPU utilization"
condition_threshold {
filter = "resource.type=\"gce_instance\" AND metric.type=\"compute.googleapis.com/instance/cpu/utilization\""
comparison = "COMPARISON_GT"
threshold_value = 0.8
duration = "300s"
}
}
notification_channels = [google_monitoring_notification_channel.email.id]
}
Works with:
/devops - Deployment monitoring/k8s - Kubernetes observability/aws, /gcp, /azure - Cloud monitoring/security - Security monitoringdevelopment
Use when decomposing complex work. Dispatch fresh subagent per task, review between tasks. Flow: Load plan → Dispatch task → Review output → Apply feedback → Mark complete → Next task. No skipping reviews, no parallel dispatch.
development
# Server Documentation System Set up a documentation system that tracks changes and maintains server/project documentation with Claude Code hooks. ## When to Use - Setting up a new server or development environment - Need to track configuration changes over time - Want automatic documentation of work sessions - Maintaining changelog for infrastructure ## Directory Structure ``` ~/docs/ # User home directory (cross-platform) ├── changelog.md # Global over
development
Delegate tasks to remote Claude Code agent containers for parallel execution, long-running analysis, or resource-intensive operations.
development
Use when working on multiple features simultaneously. Creates isolated workspaces without branch switching, enabling parallel development.