kubernetes-skills/claude/k8s-troubleshoot/SKILL.md
Debug Kubernetes pods, nodes, and workloads. Use when pods are failing, containers crash, nodes are unhealthy, or users mention debugging, troubleshooting, or diagnosing Kubernetes issues.
npx skillsauth add rohitg00/kubectl-mcp-server k8s-troubleshootInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Expert debugging and diagnostics for Kubernetes clusters using kubectl-mcp-server tools.
Use this skill when:
| Priority | Rule | Impact | Tools |
|----------|------|--------|-------|
| 1 | Check pod status first | CRITICAL | get_pods, describe_pod |
| 2 | View recent events | CRITICAL | get_events |
| 3 | Inspect logs (including previous) | HIGH | get_pod_logs |
| 4 | Check resource metrics | HIGH | get_pod_metrics |
| 5 | Verify endpoints | MEDIUM | get_endpoints |
| 6 | Review network policies | MEDIUM | get_network_policies |
| 7 | Examine node status | LOW | get_nodes, describe_node |
| Symptom | First Tool | Next Steps |
|---------|------------|------------|
| Pod Pending | describe_pod | Check events, node capacity, resource requests |
| CrashLoopBackOff | get_pod_logs(previous=True) | Check exit code, resources, liveness probes |
| ImagePullBackOff | describe_pod | Verify image name, registry auth, network |
| OOMKilled | get_pod_metrics | Increase memory limits, check for memory leaks |
| ContainerCreating | describe_pod | Check PVC binding, secrets, configmaps |
| Terminating (stuck) | describe_pod | Check finalizers, PDBs, preStop hooks |
1. get_pods(namespace, label_selector) - Get pod status
2. describe_pod(name, namespace) - See events and conditions
3. get_events(namespace, field_selector="involvedObject.name=<pod>") - Check events
4. get_pod_logs(name, namespace, previous=True) - For crash loops
| State | Likely Cause | Tools to Use |
|-------|-------------|--------------|
| Pending | Scheduling issues | describe_pod, get_nodes, get_events |
| ImagePullBackOff | Registry/auth | describe_pod, check image name |
| CrashLoopBackOff | App crash | get_pod_logs(previous=True) |
| OOMKilled | Memory limit | get_pod_metrics, adjust limits |
| ContainerCreating | Volume/network | describe_pod, get_pvc |
1. get_nodes() - List nodes and status
2. describe_node(name) - See conditions and capacity
3. Check: Ready, MemoryPressure, DiskPressure, PIDPressure
4. node_logs_tool(name, "kubelet") - Kubelet logs
1. get_pod_logs(name, namespace, previous=True) - See why it crashed
2. describe_pod(name, namespace) - Check resource limits, probes
3. get_pod_metrics(name, namespace) - Memory/CPU at crash time
4. If OOM: compare requests/limits to actual usage
5. If app error: check logs for stack trace
1. get_services(namespace) - Verify service exists
2. get_endpoints(namespace) - Check endpoint backends
3. If empty endpoints: pods don't match selector
4. get_network_policies(namespace) - Check traffic rules
5. For Cilium: cilium_endpoints_list_tool(), hubble_flows_query_tool()
1. get_pvc(namespace) - Check PVC status
2. describe_pvc(name, namespace) - See binding issues
3. get_storage_classes() - Verify provisioner exists
4. If Pending: check storage class, access modes
1. kubectl_exec(pod, namespace, "nslookup kubernetes.default") - Test DNS
2. If fails: check coredns pods in kube-system
3. get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
4. get_pod_logs(name="coredns-*", namespace="kube-system")
All tools support context parameter for targeting different clusters:
get_pods(namespace="kube-system", context="production-cluster")
get_events(namespace="default", context="staging-cluster")
describe_pod(name="myapp-xyz", namespace="prod", context="prod-east")
For comprehensive diagnostics, run the bundled scripts:
See references/DECISION-TREE.md for visual troubleshooting flowcharts.
See references/COMMON-ERRORS.md for error message explanations and fixes.
get_pods, describe_pod, get_pod_logs, get_pod_metricsget_events, get_nodes, describe_nodeget_resource_usage, compare_namespacescilium_endpoints_list_tool, hubble_flows_query_toolistio_proxy_status_tool, istio_analyze_tooldevelopment
Manage vCluster (virtual Kubernetes clusters) instances using vind. Use when creating, managing, or operating lightweight virtual clusters for development, testing, or multi-tenancy.
devops
Kubernetes storage management for PVCs, storage classes, and persistent volumes. Use when provisioning storage, managing volumes, or troubleshooting storage issues.
testing
Manage Istio service mesh for traffic management, security, and observability. Use for traffic shifting, canary releases, mTLS, and service mesh troubleshooting.
testing
Audit Kubernetes RBAC, enforce policies, and manage secrets. Use for security reviews, permission audits, policy enforcement with Kyverno/Gatekeeper, and secret management.