skills/az-aks-agent-skill/SKILL.md
Azure AKS Agentic CLI - AI-powered troubleshooting and insights tool for Azure Kubernetes Service. Use when diagnosing AKS cluster issues, getting cluster health insights, troubleshooting networking/storage/security problems, or analyzing cluster configuration with natural language queries.
npx skillsauth add julianobarbosa/claude-code-skills az-aks-agentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The Agentic CLI for Azure Kubernetes Service (AKS) is an AI-powered troubleshooting and insights tool (currently in preview) that brings advanced diagnostics directly to your terminal. It allows you to ask natural language questions about your cluster's health, configuration, and issues without requiring deep Kubernetes expertise or knowledge of complex command syntax.
Primary Command: az aks agent
# Prerequisites: Azure CLI version 2.76 or higher
az version
# Install the extension (takes 5-10 minutes)
az extension add --name aks-agent --debug
# Verify installation
az extension list
az aks agent --help
# Initialize LLM configuration (interactive wizard)
az aks agent-init
# Remove extension if needed
az extension remove --name aks-agent --debug
# Get cluster credentials first
az aks get-credentials --resource-group <rg-name> --name <cluster-name>
# Start interactive troubleshooting
az aks agent -g <resource-group> -n <cluster-name>
# Ask a specific question
az aks agent -g <resource-group> -n <cluster-name> --query "What's wrong with my cluster?"
# Non-interactive mode (batch processing)
az aks agent -g <resource-group> -n <cluster-name> --no-interactive --query "Check pod health"
What do you need to do?
├── Cluster Health Check?
│ └── Use: az aks agent --query "What's the health status of my cluster?"
├── Troubleshoot Pod Issues?
│ └── Use: az aks agent --query "Why are my pods failing?"
├── Networking Problems?
│ └── Use: az aks agent --query "Diagnose networking issues"
├── Storage Issues?
│ └── Use: az aks agent --query "Check storage configuration"
├── Security/RBAC Issues?
│ └── Use: az aks agent --query "Review RBAC configuration"
├── Node Pool Problems?
│ └── Use: az aks agent --query "Check node pool health"
└── Configuration Review?
└── Use: az aks agent --query "Review cluster configuration"
| Command | Description |
|---------|-------------|
| az aks agent | Start interactive AI-powered troubleshooting |
| az aks agent-init | Initialize LLM provider configuration |
| az aks agent --help | Show help and available options |
| Parameter | Description | Default |
|-----------|-------------|---------|
| -g, --resource-group | Resource group name | Required |
| -n, --name | AKS cluster name | Required |
| --api-key | LLM API key | From env or config |
| --config-file | Config file path | ~/.azure/aksAgent.config |
| --max-steps | Max investigation steps | 10 |
| --model | LLM model specification | From config |
| --no-interactive | Run in batch mode | false |
| --show-tool-output | Display tool call outputs | false |
| --refresh-toolsets | Refresh toolsets status | false |
# Azure OpenAI
--model "azure/gpt-4o"
--model "azure/gpt-4o-mini"
# OpenAI
--model "gpt-4o"
--model "gpt-4o-mini"
# Anthropic
--model "anthropic/claude-sonnet-4"
--model "anthropic/claude-3-5-sonnet"
# Gemini
--model "gemini/gemini-pro"
# Azure OpenAI API Key
export AZURE_API_KEY="your-azure-openai-key"
# OpenAI API Key
export OPENAI_API_KEY="your-openai-key"
# Anthropic API Key
export ANTHROPIC_API_KEY="your-anthropic-key"
# Azure OpenAI Configuration
llm_provider: azure
azure_api_base: https://<your-endpoint>.openai.azure.com/
azure_api_version: 2025-04-01-preview
model: gpt-4o
# OR OpenAI Configuration
llm_provider: openai
model: gpt-4o
# OR Anthropic Configuration
llm_provider: anthropic
model: claude-sonnet-4
https://{endpoint}.openai.azure.com/ (NOT AI Foundry URI)# General health check
az aks agent -g myRG -n myCluster --query "What's the overall health of my cluster?"
# Node status
az aks agent -g myRG -n myCluster --query "Are all nodes healthy and ready?"
# Resource utilization
az aks agent -g myRG -n myCluster --query "Show me resource utilization across nodes"
# Failed pods analysis
az aks agent -g myRG -n myCluster --query "Why are pods in CrashLoopBackOff?"
# Pending pods
az aks agent -g myRG -n myCluster --query "Why are some pods stuck in Pending state?"
# OOMKilled pods
az aks agent -g myRG -n myCluster --query "Investigate OOMKilled containers"
# Network policy review
az aks agent -g myRG -n myCluster --query "Are there network policies blocking traffic?"
# DNS troubleshooting
az aks agent -g myRG -n myCluster --query "Diagnose DNS resolution issues"
# Service connectivity
az aks agent -g myRG -n myCluster --query "Why can't pods reach external services?"
# PVC issues
az aks agent -g myRG -n myCluster --query "Why are PersistentVolumeClaims pending?"
# Storage class review
az aks agent -g myRG -n myCluster --query "Review storage class configuration"
# RBAC review
az aks agent -g myRG -n myCluster --query "Are RBAC permissions configured correctly?"
# Security best practices
az aks agent -g myRG -n myCluster --query "What security improvements do you recommend?"
# Get cluster credentials first
az aks get-credentials --resource-group $RESOURCE_GROUP --name $AKS_CLUSTER
# List all events
kubectl get events
# Filter by namespace
kubectl get events --namespace default
# Watch auto-repair events
kubectl get events --field-selector=source=aks-auto-repair --watch
# Detailed pod events
kubectl describe pod $POD_NAME
| Type | Description |
|------|-------------|
| Normal | Routine operations and expected activities |
| Warning | Potentially problematic situations requiring attention |
| Reason | Description |
|--------|-------------|
| FailedScheduling | Pod failed to be scheduled on a node |
| CrashLoopBackOff | Container is in a restart loop |
| Scheduled | Pod successfully assigned to a node |
| Pulled | Container image successfully pulled |
| Created | Container created |
| Started | Container started |
| OOMKilled | Container killed due to out of memory |
| Field | Description |
|-------|-------------|
| type | Warning or Normal |
| reason | Short reason code |
| message | Human-readable description |
| namespace | Kubernetes namespace |
| firstSeen | First observation timestamp |
| lastSeen | Most recent observation |
| object | Associated Kubernetes object |
Start broad, then narrow
# Start with general health
"What's wrong with my cluster?"
# Then focus on specific issues
"Why are pods in namespace X failing?"
Provide context about symptoms
"Pods are restarting frequently in the production namespace"
"Services are experiencing intermittent timeouts"
Ask for specific recommendations
"What changes do you recommend to improve cluster performance?"
"How can I fix the networking issues you identified?"
Request historical analysis
"What patterns do you see in recent pod failures?"
"Have there been any unusual events in the last 24 hours?"
# Verify Azure CLI version
az version
# Upgrade Azure CLI if needed
az upgrade
# Force reinstall extension
az extension remove --name aks-agent
az extension add --name aks-agent --debug
# Verify Azure login
az account show
# Re-authenticate
az login
# Check subscription
az account set --subscription <subscription-id>
# Reinitialize LLM configuration
az aks agent-init
# Check API key environment variable
echo $AZURE_API_KEY
# Test with explicit API key
az aks agent -g myRG -n myCluster --api-key "your-key"
az login before usingreferences/cli-commands.md - Complete CLI command referencereferences/troubleshooting.md - Extended troubleshooting guidereferences/examples.md - Practical usage examplesreferences/diagnostics.md - AKS Diagnose and Solve Problems guidereferences/monitoring.md - Comprehensive AKS monitoring guidereferences/control-plane-metrics.md - Control plane metrics (API Server, etcd)references/kubelet-logs.md - Kubelet logs access and analysisreferences/memory-saturation.md - Memory saturation identification and resolutionreferences/node-auto-repair.md - Node auto-repair process and monitoringreferences/api-server-etcd.md - API server and etcd troubleshootingtesting
Brief description of what this skill does. Include specific triggers - when should Claude use this skill? Example triggers, file types, or keywords that indicate this skill applies.
tools
Manage and troubleshoot PATH configuration in zsh. Use when adding tools to PATH (bun, nvm, Python venv, cargo, go), diagnosing "command not found" errors, validating PATH entries, or organizing shell configuration in .zshrc and .zshrc.local files.
tools
Zabbix monitoring system automation via API and Python. Use when: (1) Managing hosts, templates, items, triggers, or host groups, (2) Automating monitoring configuration, (3) Sending data via Zabbix trapper/sender, (4) Querying historical data or events, (5) Bulk operations on Zabbix objects, (6) Maintenance window management, (7) User/permission management
development
Operate YouTube Music via natural language. Search songs, artists, albums, playlists, lyrics, charts, recommendations, and control playback. Browse personal library, manage playlists, rate tracks, and inspect account info. Use this skill whenever the user asks about YouTube Music, wants to play music, manage playlists, search by song or artist name, inspect lyrics, or control playback.