skills_antigravity/skills/prometheus-configuration/SKILL.md
Set up Prometheus for comprehensive metric collection, storage, and monitoring of infrastructure and applications. Use when implementing metrics collection, setting up monitoring infrastructure, or configuring alerting systems.
npx skillsauth add alexsander532/atlas prometheus-configurationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Complete guide to Prometheus setup, metric collection, scrape configuration, and recording rules.
resources/implementation-playbook.md.Configure Prometheus for comprehensive metric collection, alerting, and monitoring of infrastructure and applications.
┌──────────────┐
│ Applications │ ← Instrumented with client libraries
└──────┬───────┘
│ /metrics endpoint
↓
┌──────────────┐
│ Prometheus │ ← Scrapes metrics periodically
│ Server │
└──────┬───────┘
│
├─→ AlertManager (alerts)
├─→ Grafana (visualization)
└─→ Long-term storage (Thanos/Cortex)
helm repo add prometheus-community https://prometheus-community.github.io/helm-charts
helm repo update
helm install prometheus prometheus-community/kube-prometheus-stack \
--namespace monitoring \
--create-namespace \
--set prometheus.prometheusSpec.retention=30d \
--set prometheus.prometheusSpec.storageVolumeSize=50Gi
version: '3.8'
services:
prometheus:
image: prom/prometheus:latest
ports:
- "9090:9090"
volumes:
- ./prometheus.yml:/etc/prometheus/prometheus.yml
- prometheus-data:/prometheus
command:
- '--config.file=/etc/prometheus/prometheus.yml'
- '--storage.tsdb.path=/prometheus'
- '--storage.tsdb.retention.time=30d'
volumes:
prometheus-data:
prometheus.yml:
global:
scrape_interval: 15s
evaluation_interval: 15s
external_labels:
cluster: 'production'
region: 'us-west-2'
# Alertmanager configuration
alerting:
alertmanagers:
- static_configs:
- targets:
- alertmanager:9093
# Load rules files
rule_files:
- /etc/prometheus/rules/*.yml
# Scrape configurations
scrape_configs:
# Prometheus itself
- job_name: 'prometheus'
static_configs:
- targets: ['localhost:9090']
# Node exporters
- job_name: 'node-exporter'
static_configs:
- targets:
- 'node1:9100'
- 'node2:9100'
- 'node3:9100'
relabel_configs:
- source_labels: [__address__]
target_label: instance
regex: '([^:]+)(:[0-9]+)?'
replacement: '${1}'
# Kubernetes pods with annotations
- job_name: 'kubernetes-pods'
kubernetes_sd_configs:
- role: pod
relabel_configs:
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_pod_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
- source_labels: [__address__, __meta_kubernetes_pod_annotation_prometheus_io_port]
action: replace
regex: ([^:]+)(?::\d+)?;(\d+)
replacement: $1:$2
target_label: __address__
- source_labels: [__meta_kubernetes_namespace]
action: replace
target_label: namespace
- source_labels: [__meta_kubernetes_pod_name]
action: replace
target_label: pod
# Application metrics
- job_name: 'my-app'
static_configs:
- targets:
- 'app1.example.com:9090'
- 'app2.example.com:9090'
metrics_path: '/metrics'
scheme: 'https'
tls_config:
ca_file: /etc/prometheus/ca.crt
cert_file: /etc/prometheus/client.crt
key_file: /etc/prometheus/client.key
Reference: See assets/prometheus.yml.template
scrape_configs:
- job_name: 'static-targets'
static_configs:
- targets: ['host1:9100', 'host2:9100']
labels:
env: 'production'
region: 'us-west-2'
scrape_configs:
- job_name: 'file-sd'
file_sd_configs:
- files:
- /etc/prometheus/targets/*.json
- /etc/prometheus/targets/*.yml
refresh_interval: 5m
targets/production.json:
[
{
"targets": ["app1:9090", "app2:9090"],
"labels": {
"env": "production",
"service": "api"
}
}
]
scrape_configs:
- job_name: 'kubernetes-services'
kubernetes_sd_configs:
- role: service
relabel_configs:
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scrape]
action: keep
regex: true
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_scheme]
action: replace
target_label: __scheme__
regex: (https?)
- source_labels: [__meta_kubernetes_service_annotation_prometheus_io_path]
action: replace
target_label: __metrics_path__
regex: (.+)
Reference: See references/scrape-configs.md
Create pre-computed metrics for frequently queried expressions:
# /etc/prometheus/rules/recording_rules.yml
groups:
- name: api_metrics
interval: 15s
rules:
# HTTP request rate per service
- record: job:http_requests:rate5m
expr: sum by (job) (rate(http_requests_total[5m]))
# Error rate percentage
- record: job:http_requests_errors:rate5m
expr: sum by (job) (rate(http_requests_total{status=~"5.."}[5m]))
- record: job:http_requests_error_rate:percentage
expr: |
(job:http_requests_errors:rate5m / job:http_requests:rate5m) * 100
# P95 latency
- record: job:http_request_duration:p95
expr: |
histogram_quantile(0.95,
sum by (job, le) (rate(http_request_duration_seconds_bucket[5m]))
)
- name: resource_metrics
interval: 30s
rules:
# CPU utilization percentage
- record: instance:node_cpu:utilization
expr: |
100 - (avg by (instance) (rate(node_cpu_seconds_total{mode="idle"}[5m])) * 100)
# Memory utilization percentage
- record: instance:node_memory:utilization
expr: |
100 - ((node_memory_MemAvailable_bytes / node_memory_MemTotal_bytes) * 100)
# Disk usage percentage
- record: instance:node_disk:utilization
expr: |
100 - ((node_filesystem_avail_bytes / node_filesystem_size_bytes) * 100)
Reference: See references/recording-rules.md
# /etc/prometheus/rules/alert_rules.yml
groups:
- name: availability
interval: 30s
rules:
- alert: ServiceDown
expr: up{job="my-app"} == 0
for: 1m
labels:
severity: critical
annotations:
summary: "Service {{ $labels.instance }} is down"
description: "{{ $labels.job }} has been down for more than 1 minute"
- alert: HighErrorRate
expr: job:http_requests_error_rate:percentage > 5
for: 5m
labels:
severity: warning
annotations:
summary: "High error rate for {{ $labels.job }}"
description: "Error rate is {{ $value }}% (threshold: 5%)"
- alert: HighLatency
expr: job:http_request_duration:p95 > 1
for: 5m
labels:
severity: warning
annotations:
summary: "High latency for {{ $labels.job }}"
description: "P95 latency is {{ $value }}s (threshold: 1s)"
- name: resources
interval: 1m
rules:
- alert: HighCPUUsage
expr: instance:node_cpu:utilization > 80
for: 5m
labels:
severity: warning
annotations:
summary: "High CPU usage on {{ $labels.instance }}"
description: "CPU usage is {{ $value }}%"
- alert: HighMemoryUsage
expr: instance:node_memory:utilization > 85
for: 5m
labels:
severity: warning
annotations:
summary: "High memory usage on {{ $labels.instance }}"
description: "Memory usage is {{ $value }}%"
- alert: DiskSpaceLow
expr: instance:node_disk:utilization > 90
for: 5m
labels:
severity: critical
annotations:
summary: "Low disk space on {{ $labels.instance }}"
description: "Disk usage is {{ $value }}%"
# Validate configuration
promtool check config prometheus.yml
# Validate rules
promtool check rules /etc/prometheus/rules/*.yml
# Test query
promtool query instant http://localhost:9090 'up'
Reference: See scripts/validate-prometheus.sh
Check scrape targets:
curl http://localhost:9090/api/v1/targets
Check configuration:
curl http://localhost:9090/api/v1/status/config
Test query:
curl 'http://localhost:9090/api/v1/query?query=up'
assets/prometheus.yml.template - Complete configuration templatereferences/scrape-configs.md - Scrape configuration patternsreferences/recording-rules.md - Recording rule examplesscripts/validate-prometheus.sh - Validation scriptgrafana-dashboards - For visualizationslo-implementation - For SLO monitoringdistributed-tracing - For request tracingtools
Multi-agent autonomous startup system for Claude Code. Triggers on "Loki Mode". Orchestrates 100+ specialized agents across engineering, QA, DevOps, security, data/ML, business operations, marketing, HR, and customer success. Takes PRD to fully deployed, revenue-generating product with zero human intervention. Features Task tool for subagent dispatch, parallel code review with 3 specialized reviewers, severity-based issue triage, distributed task queue with dead letter handling, automatic deployment to cloud providers, A/B testing, customer feedback loops, incident response, circuit breakers, and self-healing. Handles rate limits via distributed state checkpoints and auto-resume with exponential backoff. Requires --dangerously-skip-permissions flag.
development
Best practices for Remotion - Video creation in React
content-media
When the user wants to create, optimize, or analyze a referral program, affiliate program, or word-of-mouth strategy. Also use when the user mentions 'referral,' 'affiliate,' 'ambassador,' 'word of mouth,' 'viral loop,' 'refer a friend,' or 'partner program.' This skill covers program design, incentive structure, and growth optimization.
development
Creates exhaustive technical references and API documentation. Generates comprehensive parameter listings, configuration guides, and searchable reference materials. Use PROACTIVELY for API docs, configuration references, or complete technical specifications.