k8s-troubleshooter/skills/SKILL.md
Systematic Kubernetes troubleshooting and incident response. Use when diagnosing pod failures, cluster issues, performance problems, networking issues, storage failures, or responding to production incidents. Provides diagnostic workflows, automated health checks, and comprehensive remediation guidance for common Kubernetes problems.
npx skillsauth add ahmedasmar/devops-claude-skills k8s-troubleshooterInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
4 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Systematic approach to diagnosing and resolving Kubernetes issues in production environments.
Use this skill when:
Follow this systematic approach for any Kubernetes issue:
Run cluster health check:
python3 scripts/cluster_health.py
This provides an overview of:
Based on triage results, focus investigation:
For Namespace-Level Issues:
python3 scripts/check_namespace.py <namespace>
This provides comprehensive namespace health:
For Pod Issues:
python3 scripts/diagnose_pod.py <namespace> <pod-name>
This analyzes:
For specific investigations:
kubectl describe pod <pod> -n <namespace>kubectl logs <pod> -n <namespace>kubectl logs <pod> -n <namespace> --previouskubectl get events -n <namespace> --sort-by='.lastTimestamp'Consult references/common_issues.md for detailed information on:
Each issue includes:
Follow remediation steps from common_issues.md based on root cause identified.
Always:
After applying fix:
For production incidents, follow structured response in references/incident_response.md:
Severity Assessment:
Incident Phases:
Common Incident Scenarios:
See references/incident_response.md for detailed playbooks.
kubectl cluster-info
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20
kubectl describe pod <pod> -n <namespace>
kubectl logs <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --previous
kubectl exec -it <pod> -n <namespace> -- /bin/sh
kubectl get pod <pod> -n <namespace> -o yaml
kubectl describe node <node>
kubectl top nodes
kubectl top pods --all-namespaces
ssh <node> "systemctl status kubelet"
ssh <node> "journalctl -u kubelet -n 100"
kubectl describe svc <service> -n <namespace>
kubectl get endpoints <service> -n <namespace>
kubectl get networkpolicies --all-namespaces
kubectl get pvc,pv --all-namespaces
kubectl describe pvc <pvc> -n <namespace>
kubectl get storageclass
kubectl describe resourcequota -n <namespace>
kubectl describe limitrange -n <namespace>
kubectl get rolebindings,clusterrolebindings -n <namespace>
Comprehensive cluster health check covering:
Usage: python3 scripts/cluster_health.py
Best used as first diagnostic step to get overall cluster health snapshot.
Namespace-level health check and diagnostics:
Usage:
# Human-readable output
python3 scripts/check_namespace.py <namespace>
# JSON output for automation
python3 scripts/check_namespace.py <namespace> --json
# Include more events
python3 scripts/check_namespace.py <namespace> --events 20
Best used when troubleshooting issues in a specific namespace or assessing overall namespace health.
Detailed pod-level diagnostics:
Usage: python3 scripts/diagnose_pod.py <namespace> <pod-name>
Best used when investigating specific pod failures or behavior.
Comprehensive guide to common Kubernetes issues with:
Covers:
Read this when you identify a specific issue type but need detailed remediation steps.
Structured incident response framework including:
Read this when responding to production incidents or planning incident response procedures.
Comprehensive performance diagnosis and optimization guide covering:
Read this when:
Complete guide to Helm troubleshooting including:
Read this when:
Always:
Never:
Key Principles:
tools
Monitoring and observability strategy, implementation, and troubleshooting. Use for designing metrics/logs/traces systems, setting up Prometheus/Grafana/Loki, creating alerts and dashboards, calculating SLOs and error budgets, analyzing performance issues, and comparing monitoring tools (Datadog, ELK, CloudWatch). Covers the Four Golden Signals, RED/USE methods, OpenTelemetry instrumentation, log aggregation patterns, and distributed tracing.
development
Infrastructure as Code with Terraform and Terragrunt. Use for creating, validating, troubleshooting, and managing Terraform configurations, modules, and state. Covers Terraform workflows, best practices, module development, state management, Terragrunt patterns, and common issue resolution.
development
--- name: gitops-workflows description: GitOps deployment workflows with ArgoCD and Flux. Use for setting up GitOps (ArgoCD 3.x, Flux 2.7), designing repository structures (monorepo/polyrepo, app-of-apps), multi-cluster deployments (ApplicationSets, hub-spoke), secrets management (SOPS+age, Sealed Secrets, External Secrets Operator), progressive delivery (Argo Rollouts, Flagger), troubleshooting sync issues, and OCI artifact management. Covers latest 2024-2025 features: ArgoCD annotation-based t
development
CI/CD pipeline design, optimization, DevSecOps security scanning, and troubleshooting. Use for creating workflows, debugging pipeline failures, implementing SAST/DAST/SCA, optimizing build performance, implementing caching strategies, setting up deployments, securing pipelines with OIDC/secrets management, and troubleshooting common issues across GitHub Actions, GitLab CI, and other platforms.