kubernetes-skills/claude/k8s-incident/SKILL.md
Respond to Kubernetes incidents with runbooks and diagnostics. Use for outages, pod failures, node issues, network problems, and emergency response.
npx skillsauth add rohitg00/kubectl-mcp-server k8s-incidentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Runbooks and diagnostic workflows for common Kubernetes incidents.
Use this skill when:
| Priority | Rule | Impact | Tools |
|----------|------|--------|-------|
| 1 | Check control plane first | CRITICAL | get_pods(namespace="kube-system") |
| 2 | Assess node health | CRITICAL | get_nodes |
| 3 | Gather events before changes | HIGH | get_events |
| 4 | Document timeline | HIGH | Manual notes |
| 5 | Rollback if safe | MEDIUM | rollback_deployment |
| Incident | First Tool | Next Steps |
|----------|------------|------------|
| Pod failure | get_pod_logs(previous=True) | describe_pod, get_events |
| Node down | describe_node | Check kubelet logs |
| Service unreachable | get_endpoints | get_network_policies |
| Control plane | get_pods(namespace="kube-system") | Check API server logs |
get_nodes()
get_pods(namespace="kube-system")
get_events(namespace)
| Indicator | Severity | Action | |-----------|----------|--------| | Multiple nodes NotReady | Critical | Escalate immediately | | kube-system pods failing | Critical | Control plane issue | | Single pod CrashLoop | Medium | Debug pod | | High latency | Medium | Check resources |
get_pod_logs(name, namespace, previous=True)
describe_pod(name, namespace)
get_events(namespace, field_selector="involvedObject.name=<pod>")
get_pod_metrics(name, namespace)
Common Causes:
describe_pod(name, namespace)
get_secrets(namespace)
describe_pod(name, namespace)
get_nodes()
get_events(namespace)
describe_node(name)
get_events(namespace="", field_selector="involvedObject.name=<node>")
node_logs_tool(name, "kubelet")
describe_node(name)
get_pods(field_selector="spec.nodeName=<node>")
get_services(namespace)
get_endpoints(namespace)
get_pods(namespace, label_selector="<service-selector>")
get_network_policies(namespace)
get_pods(namespace="kube-system", label_selector="k8s-app=kube-dns")
get_pod_logs("coredns-xxx", "kube-system")
cilium_status_tool()
cilium_endpoints_list_tool(namespace)
hubble_flows_query_tool(namespace)
istio_analyze_tool(namespace)
istio_proxy_status_tool()
describe_pvc(name, namespace)
get_storage_classes()
get_events(namespace)
describe_pod(name, namespace)
get_pvc(namespace)
get_events(namespace)
get_pods(namespace="kube-system", label_selector="component=kube-apiserver")
get_events(namespace="kube-system")
get_pods(namespace="kube-system", label_selector="component=etcd")
get_pod_logs("etcd-xxx", "kube-system")
delete_pod(name, namespace, grace_period=0, force=True)
rollback_deployment(name, namespace, revision=0)
rollback_helm_release(name, namespace, revision=1)
For comprehensive incident diagnostics, see scripts/collect-diagnostics.py.
Check all clusters:
for context in ["prod-1", "prod-2", "staging"]:
get_nodes(context=context)
get_pods(namespace="kube-system", context=context)
get_events(namespace="kube-system", context=context)
development
Manage vCluster (virtual Kubernetes clusters) instances using vind. Use when creating, managing, or operating lightweight virtual clusters for development, testing, or multi-tenancy.
development
Debug Kubernetes pods, nodes, and workloads. Use when pods are failing, containers crash, nodes are unhealthy, or users mention debugging, troubleshooting, or diagnosing Kubernetes issues.
devops
Kubernetes storage management for PVCs, storage classes, and persistent volumes. Use when provisioning storage, managing volumes, or troubleshooting storage issues.
testing
Manage Istio service mesh for traffic management, security, and observability. Use for traffic shifting, canary releases, mTLS, and service mesh troubleshooting.