areas/software/platform/skills/observability-setup/SKILL.md
# Skill: Observability Setup ## When to load When setting up monitoring for a new service, configuring alerts, debugging production issues. ## Golden Signals (Mandatory) Every service must expose: 1. **Latency**: p50, p95, p99 response times 2. **Traffic**: requests per second 3. **Errors**: 4xx/5xx rate 4. **Saturation**: CPU %, memory %, queue depth ## Prometheus Alert Rules ```yaml groups: - name: api-alerts rules: - alert: HighErrorRate expr: | sum(rate(ht
npx skillsauth add sawrus/agent-guides areas/software/platform/skills/observability-setupInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When setting up monitoring for a new service, configuring alerts, debugging production issues.
Every service must expose:
groups:
- name: api-alerts
rules:
- alert: HighErrorRate
expr: |
sum(rate(http_requests_total{status=~"5.."}[5m]))
/ sum(rate(http_requests_total[5m])) > 0.01
for: 2m
labels: { severity: critical }
annotations:
summary: "Error rate > 1% for 2 minutes"
runbook: "https://runbooks.internal/high-error-rate"
- alert: HighLatency
expr: histogram_quantile(0.99, rate(http_request_duration_seconds_bucket[5m])) > 2
for: 5m
labels: { severity: warning }
{
"timestamp": "2026-02-16T10:30:00Z",
"level": "ERROR",
"service": "payments-api",
"trace_id": "abc123",
"message": "Payment processing failed",
"error": { "type": "PaymentGatewayError", "code": "CARD_DECLINED" },
"duration_ms": 1240
}
testing
QA Expert for writing E2E tests, test scenarios, test plans, and ensuring test coverage quality.
development
Expert UI/UX design intelligence for creating distinctive, high-craft, and mobile-first interfaces. Focuses on premium aesthetics, touch-first ergonomics, and Flutter performance.
development
Code Review Expert for static analysis, security auditing, architecture review, and ensuring code quality standards.
development
Babysit a GitHub pull request after creation by continuously polling review comments, CI checks/workflow runs, and mergeability state until the PR is merged/closed or user help is required. Diagnose failures, retry likely flaky failures up to 3 times, auto-fix/push branch-related issues when appropriate, and keep watching open PRs so fresh review feedback is surfaced promptly. Use when the user asks Codex to monitor a PR, watch CI, handle review comments, or keep an eye on failures and feedback on an open PR.