Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

bagelhole/sre-dashboards

Name: sre-dashboards
Author: bagelhole

devops/observability/sre-dashboards/SKILL.md

npx skillsauth add bagelhole/devops-security-agent-skills sre-dashboards

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

SRE Dashboards

Build dashboards that help teams detect, triage, and prevent reliability incidents.

When to Use This Skill

Use this skill when:

Defining service-level dashboards for production systems
Tracking SLO health and error-budget burn
Creating incident command-center views
Standardizing dashboard patterns across teams

Prerequisites

Metrics pipeline (Prometheus, OpenTelemetry, or vendor equivalent)
Logs/traces linked to services and environments
Agreed service taxonomy (team, service, tier, environment)

Dashboard Architecture

Structure dashboards in layers:

Executive Reliability View: SLO attainment, incident counts, MTTR trends.
Service Health View: RED/USE metrics, dependency health, release markers.
Deep-Dive View: Per-endpoint latency, resource saturation, error categories.

Keep each view answer-oriented:

Are customers impacted?
What changed?
Where is the bottleneck?

Core SRE Panels

Golden Signals

Latency: p50/p95/p99 request duration by endpoint
Traffic: request throughput and queue depth
Errors: 5xx rate, failed jobs, timeout ratio
Saturation: CPU, memory, disk I/O, thread/connection pool exhaustion

SLO Panels

Current SLI value (rolling windows: 5m, 1h, 24h, 30d)
Error-budget remaining (%)
Burn-rate panels (fast and slow windows)
Multi-window burn alert status

Change Correlation

Deployment markers and config-change annotations
Feature flag state overlays
Upstream/downstream dependency error rates

Example PromQL Snippets

# API error rate (%)
100 * sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))

# p95 latency by route
histogram_quantile(0.95,
  sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))
)

# Fast burn rate (5m / 1h)
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))
)
/
(
  sum(rate(http_requests_total{status=~"5.."}[1h]))
  / sum(rate(http_requests_total[1h]))
)

Operational Guidelines

Use consistent color semantics (green=healthy, yellow=degrading, red=breach)
Label units explicitly (ms, req/s, %, cores)
Default time windows to incident-friendly ranges (15m, 1h, 6h, 24h)
Minimize panel count per dashboard to reduce cognitive load
Add runbook links directly in panel descriptions

Troubleshooting

Panel appears flat or empty

Verify label cardinality and filters (service, env, region)
Confirm scrape/ingest latency is within expected range
Check metric rename regressions after instrumentation updates

High cardinality slows dashboards

Aggregate by stable dimensions (service, route_group) instead of raw IDs
Use recording rules for expensive percentile and ratio queries
Split deep-dive dashboards from NOC summary dashboards

Related Skills

prometheus-grafana - Dashboard implementation and PromQL
opentelemetry - Standardized telemetry instrumentation
alerting-oncall - Reliability alert routing and escalation
agent-observability - AI workload reliability telemetry

bagelhole/sre-dashboards

devops/observability/sre-dashboards/SKILL.md

Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.

28 stars

development

Updated May 23, 2026

$ install --global

skillsauth

npx skillsauth add bagelhole/devops-security-agent-skills sre-dashboards

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 23, 2026, 3:02 AM92.0s1 file scanned

SKILL.md

name:: sre-dashboards
description:: Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.
license:: MIT
author:: devops-skills
version:: 1.0

SRE Dashboards

Build dashboards that help teams detect, triage, and prevent reliability incidents.

When to Use This Skill

Use this skill when:

Defining service-level dashboards for production systems
Tracking SLO health and error-budget burn
Creating incident command-center views
Standardizing dashboard patterns across teams

Prerequisites

Metrics pipeline (Prometheus, OpenTelemetry, or vendor equivalent)
Logs/traces linked to services and environments
Agreed service taxonomy (team, service, tier, environment)

Dashboard Architecture

Structure dashboards in layers:

Executive Reliability View: SLO attainment, incident counts, MTTR trends.
Service Health View: RED/USE metrics, dependency health, release markers.
Deep-Dive View: Per-endpoint latency, resource saturation, error categories.

Keep each view answer-oriented:

Are customers impacted?
What changed?
Where is the bottleneck?

Core SRE Panels

Golden Signals

Latency: p50/p95/p99 request duration by endpoint
Traffic: request throughput and queue depth
Errors: 5xx rate, failed jobs, timeout ratio
Saturation: CPU, memory, disk I/O, thread/connection pool exhaustion

SLO Panels

Current SLI value (rolling windows: 5m, 1h, 24h, 30d)
Error-budget remaining (%)
Burn-rate panels (fast and slow windows)
Multi-window burn alert status

Change Correlation

Deployment markers and config-change annotations
Feature flag state overlays
Upstream/downstream dependency error rates

Example PromQL Snippets

# API error rate (%)
100 * sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))

# p95 latency by route
histogram_quantile(0.95,
  sum by (le, route) (rate(http_request_duration_seconds_bucket[5m]))
)

# Fast burn rate (5m / 1h)
(
  sum(rate(http_requests_total{status=~"5.."}[5m]))
  / sum(rate(http_requests_total[5m]))
)
/
(
  sum(rate(http_requests_total{status=~"5.."}[1h]))
  / sum(rate(http_requests_total[1h]))
)

Operational Guidelines

Use consistent color semantics (green=healthy, yellow=degrading, red=breach)
Label units explicitly (ms, req/s, %, cores)
Default time windows to incident-friendly ranges (15m, 1h, 6h, 24h)
Minimize panel count per dashboard to reduce cognitive load
Add runbook links directly in panel descriptions

Troubleshooting

Panel appears flat or empty

Verify label cardinality and filters (service, env, region)
Confirm scrape/ingest latency is within expected range
Check metric rename regressions after instrumentation updates

High cardinality slows dashboards

Aggregate by stable dimensions (service, route_group) instead of raw IDs
Use recording rules for expensive percentile and ratio queries
Split deep-dive dashboards from NOC summary dashboards

Related Skills

prometheus-grafana - Dashboard implementation and PromQL
opentelemetry - Standardized telemetry instrumentation
alerting-oncall - Reliability alert routing and escalation
agent-observability - AI workload reliability telemetry

Related Skills

bagelhole/openclaw-security-hardening

testing

VerifiedTrustedCommunity

Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.

28SKILL.mdUpdated Apr 3, 2026

bagelhole/openclaw-security-hardening

bagelhole/vector-database-ops

devops

VerifiedTrustedCommunity

Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.

28SKILL.mdUpdated Apr 3, 2026

bagelhole/vector-database-ops

bagelhole/model-serving-kubernetes

testing

VerifiedTrustedCommunity

Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.

28SKILL.mdUpdated Apr 3, 2026

bagelhole/model-serving-kubernetes

bagelhole/llm-cost-optimization

development

VerifiedTrustedCommunity

Reduce LLM API and infrastructure costs through model selection, prompt caching, batching, caching, quantization, and self-hosting strategies. Track spend by team and model, set budgets, and implement cost-aware routing.

28SKILL.mdUpdated Apr 3, 2026

bagelhole/llm-cost-optimization

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/bagelhole/devops-security-agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r devops-security-agent-skills/devops/observability/sre-dashboards ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

bagelhole/devops-security-agent-skills

28 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT