i18n/de/skills/define-slo-sli-sla/SKILL.md
Service Level Objectives (SLO), Service Level Indicators (SLI) und Service Level Agreements (SLA) mit Fehlerbudget-Tracking, Burn-Rate-Alerts und automatisierter Berichterstellung mit Prometheus und Tools wie Sloth oder Pyrra definieren. Verwenden, wenn Zuverlaessigkeitsziele fuer kundenseitige Services festgelegt werden, Feature- Geschwindigkeit und Systemzuverlaessigkeit durch Fehlerbudgets abgewogen werden, von willkuerlichen Uptime-Zielen zu datengetriebenen Metriken migriert wird oder Site-Reliability-Engineering-Praktiken implementiert werden.
npx skillsauth add pjt222/agent-almanac define-slo-sli-slaInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Messbare Zuverlaessigkeitsziele mit Service Level Objectives festlegen, mit Indikatoren verfolgen und Fehlerbudgets verwalten.
Unter Extended Examples sind vollstaendige Konfigurationsdateien und Templates verfuegbar.
Die Beziehungen und Unterschiede zwischen diesen drei Konzepten erlernen.
Definitionen:
SLI (Service Level Indicator)
- **What**: A quantitative measure of service behavior
- **Example**: Request success rate, request latency, system throughput
- **Measurement**: `successful_requests / total_requests * 100`
SLO (Service Level Objective)
- **What**: Target value or range for an SLI over a time window
- **Example**: 99.9% of requests succeed in 30-day window
- **Purpose**: Internal reliability target to guide operations
SLA (Service Level Agreement)
- **What**: Contractual commitment with consequences for missing SLO
- **Example**: 99.9% uptime SLA with refunds if breached
- **Purpose**: External promise to customers with penalties
Hierarchie:
SLA (99.9% uptime, customer refunds)
├─ SLO (99.95% success rate, internal target)
│ └─ SLI (actual measured: 99.97% success rate)
└─ Error Budget (0.05% failures allowed per month)
Schluessprinzip: Das SLO sollte strenger als das SLA sein, um einen Puffer zu bieten, bevor Kunden betroffen sind.
Beispiel:
Erwartet: Team versteht Unterschiede, Einigung, welche Metriken zu SLIs werden, Abstimmung zu SLO-Zielen.
Bei Fehler:
SLIs auswaehlen, die Benutzererfahrung und Geschaeftsauswirkungen widerspiegeln.
Die vier goldenen Signale (Google SRE):
Latenz: Zeit zur Bearbeitung einer Anfrage
# P95 latency
histogram_quantile(0.95,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
)
Traffic: Nachfrage an das System
# Requests per second
sum(rate(http_requests_total[5m]))
Fehler: Rate fehlgeschlagener Anfragen
# Error rate percentage
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) * 100
Saettigung: Wie "voll" das System ist
# CPU saturation
avg(rate(node_cpu_seconds_total{mode!="idle"}[5m]))
Gaengige SLI-Muster:
# Availability SLI
availability:
description: "Percentage of successful requests"
query: |
sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m]))
good_threshold: 0.999 # 99.9%
# Latency SLI
latency:
description: "P99 request latency under 500ms"
query: |
histogram_quantile(0.99,
sum(rate(http_request_duration_seconds_bucket[5m])) by (le)
) < 0.5
good_threshold: 0.95 # 95% of windows meet target
# Throughput SLI
throughput:
description: "Requests processed per second"
query: |
sum(rate(http_requests_total[5m]))
good_threshold: 1000 # Minimum 1000 req/s
# Data freshness SLI (for batch jobs)
freshness:
description: "Data updated within last hour"
query: |
(time() - max(data_last_updated_timestamp)) < 3600
good_threshold: 1 # Always fresh
SLI-Auswahlkriterien:
Vermeiden:
Erwartet: 2-4 SLIs pro Service ausgewaehlt, mindestens Verfuegbarkeit und Latenz abdeckend, Team-Einigung zu Messabfragen.
Bei Fehler:
Realistische und erreichbare Zuverlaessigkeitsziele definieren.
SLO-Spezifikationsformat:
service: user-api
slos:
- name: availability
objective: 99.9
description: |
99.9% of requests return non-5xx status codes
# ... (see EXAMPLES.md for complete configuration)
Auswahl des Zeitfensters:
Gaengige Fenster:
Beispiel-Fehlerbudget fuer 30-Tage-Fenster:
SLO: 99.9% availability over 30 days
Allowed failures: 0.1%
Total requests per month: 100M
Error budget: 100,000 failed requests
Daily budget: ~3,333 failed requests
Realistische Ziele setzen:
Aktuelle Performance als Ausgangspunkt nehmen:
# Check actual availability over past 90 days
avg_over_time(
(sum(rate(http_requests_total{status!~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])))[90d:5m]
)
# Result: 99.95% → Set SLO at 99.9% (safer than current)
Kosten der Neunen berechnen:
99% → 7.2 hours downtime/month (low reliability)
99.9% → 43 minutes downtime/month (good)
99.95% → 22 minutes downtime/month (very good)
99.99% → 4.3 minutes downtime/month (expensive)
99.999% → 26 seconds downtime/month (very expensive)
Benutzer-Glueck gegenueber Engineering-Kosten abwaegen:
Erwartet: SLO-Ziele mit Zustimmung der Geschaefts-Stakeholder gesetzt, mit Begruendung dokumentiert, Fehlerbudget berechnet.
Bei Fehler:
Sloth verwenden, um Prometheus-Recording-Rules und Alerts aus SLO-Spezifikationen zu generieren.
Sloth installieren:
# Binary installation
wget https://github.com/slok/sloth/releases/download/v0.11.0/sloth-linux-amd64
chmod +x sloth-linux-amd64
sudo mv sloth-linux-amd64 /usr/local/bin/sloth
# Or Docker
docker pull ghcr.io/slok/sloth:latest
Sloth-SLO-Spezifikation erstellen (slos/user-api.yml):
version: "prometheus/v1"
service: "user-api"
labels:
team: "platform"
tier: "1"
slos:
# ... (see EXAMPLES.md for complete configuration)
Prometheus-Regeln generieren:
# Generate recording and alerting rules
sloth generate -i slos/user-api.yml -o prometheus/rules/user-api-slo.yml
# Validate generated rules
promtool check rules prometheus/rules/user-api-slo.yml
Generierte Recording-Rules (Auszug):
groups:
- name: sloth-slo-sli-recordings-user-api-requests-availability
interval: 30s
rules:
# SLI: Ratio of good events
- record: slo:sli_error:ratio_rate5m
# ... (see EXAMPLES.md for complete configuration)
Generierte Alerts:
groups:
- name: sloth-slo-alerts-user-api-requests-availability
rules:
# Fast burn: 2% budget consumed in 1 hour
- alert: UserAPIHighErrorRate
expr: |
# ... (see EXAMPLES.md for complete configuration)
Regeln in Prometheus laden:
# prometheus.yml
rule_files:
- "rules/user-api-slo.yml"
Prometheus neu laden:
curl -X POST http://localhost:9090/-/reload
Erwartet: Sloth generiert Multi-Window-Multi-Burn-Rate-Alerts, Recording-Rules werden erfolgreich ausgewertet, Alerts werden waehrend Incidents entsprechend ausgeloest.
Bei Fehler:
yamllint slos/user-api.yml validierencurl http://localhost:9090/api/v1/rulesSLO-Compliance und Fehlerbudget-Verbrauch in Grafana visualisieren.
Grafana-Dashboard-JSON (Auszug):
{
"dashboard": {
"title": "SLO Dashboard - User API",
"panels": [
{
"type": "stat",
# ... (see EXAMPLES.md for complete configuration)
Zu visualisierende Schluesselmetriken:
Dashboard fuer Fehlerbudget-Richtlinien (Markdown-Panel):
## Error Budget Policy
**Current Status**: 78% budget remaining
### If Error Budget > 50%
- ✅ Full speed ahead on new features
# ... (see EXAMPLES.md for complete configuration)
Erwartet: Dashboards zeigen Echtzeit-SLO-Compliance, Fehlerbudget-Abbau sichtbar, Team kann fundierte Entscheidungen zur Feature-Geschwindigkeit treffen.
Bei Fehler:
curl http://localhost:9090/api/v1/rules | jq '.data.groups[].rules[] | select(.name | contains("slo:"))'Organisatorischen Prozess zur Verwaltung von Fehlerbudgets definieren.
Vorlage fuer Fehlerbudget-Richtlinien:
service: user-api
slo:
availability: 99.9%
latency_p99: 200ms
window: 30 days
# ... (see EXAMPLES.md for complete configuration)
Richtliniendurchsetzung automatisieren:
# Example: Deployment gate script
import requests
import sys
def check_error_budget(service):
# Query Prometheus for error budget
# ... (see EXAMPLES.md for complete configuration)
In CI/CD-Pipeline integrieren:
# .github/workflows/deploy.yml
jobs:
check-error-budget:
runs-on: ubuntu-latest
steps:
- name: Check SLO Error Budget
run: |
python scripts/check_error_budget.py user-api
- name: Deploy
if: success()
run: |
kubectl apply -f deploy/
Erwartet: Klare Richtlinie dokumentiert, automatisierte Gates verhindern riskante Deployments waehrend der Budget-Erschoepfung, Team-Abstimmung zu Zuverlaessigkeitsprioritaeten.
Bei Fehler:
setup-prometheus-monitoring - Prometheus konfigurieren, um Metriken fuer SLI-Berechnung zu sammelnconfigure-alerting-rules - SLO-Burn-Rate-Alerts mit Alertmanager fuer On-Call-Benachrichtigungen integrierenbuild-grafana-dashboards - SLO-Compliance und Fehlerbudget-Verbrauch visualisierenwrite-incident-runbook - SLO-Auswirkungen in Runbooks fuer die Priorisierung der Incident-Reaktion einschliessentesting
Launch all available agents in parallel waves for open-ended hypothesis generation on problems where the correct domain is unknown. Use when facing a cross-domain problem with no clear starting point, when single-agent approaches have stalled, or when diverse perspectives are more valuable than deep expertise. Produces a ranked hypothesis set with convergence analysis and adversarial refinement.
tools
Write integration tests for a Node.js CLI application using the built-in node:test module. Covers the exec helper pattern, output assertions, filesystem state verification, cleanup hooks, JSON output parsing, error case testing, and state restoration after destructive tests. Use when adding tests to an existing CLI, testing a new command, verifying adapter behavior across frameworks, or setting up CI for a CLI tool.
development
Screen a proposed trademark for conflicts and distinctiveness before filing. Covers trademark database searches (TMview, WIPO Global Brand Database, USPTO TESS), distinctiveness analysis using the Abercrombie spectrum, likelihood of confusion assessment using DuPont factors and EUIPO relative grounds, common law rights evaluation, and goods/services overlap analysis. Produces a conflict report with a risk matrix. Use before adopting a new brand name, logo, or slogan — distinct from patent prior art search, which uses different databases, legal frameworks, and analysis methods.
tools
Scaffold a new CLI command using Commander.js with options, action handler, three output modes (human-readable, quiet, JSON), and optional ceremony variant. Covers command naming, option design, shared context patterns, error handling, and integration testing. Use when adding a command to an existing Commander.js CLI, designing a new CLI tool from scratch, or standardizing command structure across a multi-command CLI.