skills/holmesgpt/SKILL.md
Guide for implementing HolmesGPT - an AI agent for troubleshooting cloud-native environments. Use when investigating Kubernetes issues, analyzing alerts from Prometheus/AlertManager/PagerDuty, performing root cause analysis, configuring HolmesGPT installations (CLI/Helm/Docker), setting up AI providers (OpenAI/Anthropic/Azure), creating custom toolsets, or integrating with observability platforms (Grafana, Loki, Tempo, DataDog).
npx skillsauth add julianobarbosa/claude-code-skills holmesgptInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
AI-powered troubleshooting for Kubernetes and cloud-native environments.
HolmesGPT is a CNCF Sandbox project that connects AI models with live observability data to investigate infrastructure problems, find root causes, and suggest remediations. It operates with read-only access and respects RBAC permissions, making it safe for production environments.
| Topic | Reference |
|-------|-----------|
| Installation | references/installation.md |
| Configuration | references/configuration.md |
| Data Sources | references/data-sources.md |
| Commands | references/commands.md |
| Troubleshooting | references/troubleshooting.md |
| HTTP API | references/http-api.md |
| Integrations | references/integrations.md |
/run, /show, /clearbrew tap robusta-dev/homebrew-holmesgpt
brew install holmesgpt
export ANTHROPIC_API_KEY="your-key" # or OPENAI_API_KEY
holmes ask "what pods are unhealthy?"
helm repo add robusta https://robusta-charts.storage.googleapis.com
helm repo update
helm install holmesgpt robusta/holmes -f values.yaml
docker run -it --net=host \
-e OPENAI_API_KEY="your-key" \
-v ~/.kube/config:/root/.kube/config \
us-central1-docker.pkg.dev/genuine-flight-317411/devel/holmes \
ask "what pods are crashing?"
# Basic investigation
holmes ask "what pods are unhealthy and why?"
holmes ask "why is my deployment failing?"
# Interactive mode
holmes ask "investigate issue" --interactive
# Alert investigation
holmes investigate alertmanager --alertmanager-url http://localhost:9093
holmes investigate pagerduty --pagerduty-api-key <KEY> --update
# With file context
holmes ask "summarize the key points" -f ./logs.txt
# CI/CD integration
holmes ask "why did deployment fail?" --destination slack --slack-token <TOKEN>
| Provider | Environment Variable | Models |
|----------|---------------------|--------|
| Anthropic | ANTHROPIC_API_KEY | Sonnet 4, Opus 4.5 |
| OpenAI | OPENAI_API_KEY | GPT-4.1, GPT-4o |
| Azure OpenAI | AZURE_API_KEY | GPT-4.1 |
| AWS Bedrock | AWS credentials | Claude 3.5 Sonnet |
| Google Gemini | GEMINI_API_KEY | Gemini 1.5 Pro |
| Vertex AI | VERTEXAI_PROJECT | Gemini 1.5 Pro |
| Ollama | Local install | Llama 3.1, Mistral |
# values.yaml for Kubernetes deployment
image:
repository: robustadev/holmes
tag: latest
env:
- name: ANTHROPIC_API_KEY
valueFrom:
secretKeyRef:
name: holmesgpt-secrets
key: anthropic-api-key
# Model configuration
modelList:
sonnet:
api_key: "{{ env.ANTHROPIC_API_KEY }}"
model: anthropic/claude-sonnet-4-20250514
temperature: 0
# Toolsets to enable
toolsets:
kubernetes/core:
enabled: true
kubernetes/logs:
enabled: true
prometheus/metrics:
enabled: true
# Resources
resources:
requests:
memory: "1024Mi"
cpu: "100m"
limits:
memory: "1024Mi"
# RBAC (read-only by default)
createServiceAccount: true
| Command | Description |
|---------|-------------|
| /clear | Reset context when changing topics |
| /run | Execute custom commands and share output with AI |
| /show | Display complete tool outputs |
| /context | Review accumulated investigation information |
# custom-toolset.yaml
toolsets:
my-custom-tool:
description: "Custom diagnostic tool"
tools:
- name: check_service_health
description: "Check health of a specific service"
command: |
curl -s http://{{ service_name }}.{{ namespace }}.svc.cluster.local/health
parameters:
- name: service_name
description: "Name of the service"
- name: namespace
description: "Kubernetes namespace"
Use with: holmes ask "check health" -t custom-toolset.yaml
# Add to Services/Deployments for HolmesGPT context
metadata:
annotations:
holmesgpt.dev/runbook: |
This service handles payment processing.
Common issues: database connectivity, API rate limits.
Check: kubectl logs -l app=payment-service
| Variable | Description | Default |
|----------|-------------|---------|
| HOLMES_CONFIG_PATH | Config file path | ~/.holmes/config.yaml |
| HOLMES_LOG_LEVEL | Log verbosity | INFO |
| PROMETHEUS_URL | Prometheus server URL | - |
| GITHUB_TOKEN | GitHub API token | - |
| DATADOG_API_KEY | DataDog API key | - |
| CONFLUENCE_BASE_URL | Confluence URL | - |
get, list, watch only)secrets:get on a specific namespace or accept the blind spot — don't broaden cluster-wide.temperature: 0 in modelList or compare results between runs and lose trust.holmes investigate alertmanager --alertmanager-url http://localhost:9093 works from a laptop but fails inside the cluster — use the in-cluster service DNS (http://kube-prometheus-stack-alertmanager.monitoring:9093)./clear doesn't reset toolset context, only conversation history: Cached tool outputs from prior investigation persist within the session. Long interactive sessions accumulate stale Prometheus data that contaminates new questions. Restart the CLI between unrelated incidents.modelList need the anthropic/ prefix: model: claude-sonnet-4-20250514 fails silently with provider-not-found; correct form is model: anthropic/claude-sonnet-4-20250514. LiteLLM error message says "model not found" without naming the missing prefix.testing
Brief description of what this skill does. Include specific triggers - when should Claude use this skill? Example triggers, file types, or keywords that indicate this skill applies.
tools
Manage and troubleshoot PATH configuration in zsh. Use when adding tools to PATH (bun, nvm, Python venv, cargo, go), diagnosing "command not found" errors, validating PATH entries, or organizing shell configuration in .zshrc and .zshrc.local files.
tools
Zabbix monitoring system automation via API and Python. Use when: (1) Managing hosts, templates, items, triggers, or host groups, (2) Automating monitoring configuration, (3) Sending data via Zabbix trapper/sender, (4) Querying historical data or events, (5) Bulk operations on Zabbix objects, (6) Maintenance window management, (7) User/permission management
development
Operate YouTube Music via natural language. Search songs, artists, albums, playlists, lyrics, charts, recommendations, and control playback. Browse personal library, manage playlists, rate tracks, and inspect account info. Use this skill whenever the user asks about YouTube Music, wants to play music, manage playlists, search by song or artist name, inspect lyrics, or control playback.