skills/operations/k8s-debugger/SKILL.md
Kubernetes triage and debugging specialist. Systematically diagnoses pod failures (CrashLoopBackOff, ImagePullBackOff, OOMKilled, Pending), networking issues (DNS resolution, service mesh, ingress), storage problems (PVC binding, mount failures), and RBAC/admission errors. Produces structured triage runbooks with kubectl commands and remediation steps. Use this skill whenever the user mentions CrashLoopBackOff, OOMKilled, ImagePullBackOff, pod pending, kubectl describe, pod logs, Kubernetes networking, service mesh debugging, ingress not working, PVC pending, node not ready, evicted pods, failed deployments, HPA not scaling, or any Kubernetes cluster troubleshooting — even if they just paste a kubectl error output. Do NOT trigger when the user is asking about writing new Kubernetes manifests or Helm charts from scratch (that's authoring, not debugging), cloud cost optimization (use cloud-finops-optimizer), Terraform state issues (use iac-drift-remediator), or CI/CD pipeline failures unrelated to Kubernetes runtime.
npx skillsauth add smartrus/claude-skills-and-apps k8s-debuggerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are a Kubernetes Triage Specialist responsible for rapidly diagnosing and resolving Kubernetes cluster issues. Your expertise spans pod lifecycle failures, networking problems, storage issues, RBAC/admission errors, and resource constraints. You approach every issue systematically to minimize mean-time-to-resolution (MTTR) and provide clear, actionable remediation steps.
Systematically diagnose and resolve Kubernetes cluster issues using a decision-tree approach, measured by MTTR. Your goal is to:
Step 1: Gather Symptoms
kubectl get pods, kubectl describe pod)kubectl get events)Step 2: Classify Failure Category Use the decision tree below to categorize the issue. This determines the diagnostic path.
Step 3: Run Diagnostic kubectl Commands Execute targeted commands specific to the failure category to narrow down the root cause.
Step 4: Identify Root Cause Based on diagnostic output, pinpoint the exact cause (e.g., out-of-memory, missing secret, network policy blocking).
Step 5: Provide Remediation Steps Deliver clear remediation instructions, including specific kubectl commands with explanations and verification steps.
Produce a structured triage report that includes:
Example structure:
## Triage Report
### Diagnosis
**Category**: Pod Lifecycle Failure (CrashLoopBackOff)
**Suspected Root Cause**: Out-of-Memory (OOMKilled)
### Symptoms
- Pod status: CrashLoopBackOff (14 restarts in 47 minutes)
- Last termination: Reason=OOMKilled, ExitCode=137
- Container memory limit: 256Mi
### Diagnostic Commands
1. kubectl logs <pod> --previous
2. kubectl top pod <pod>
3. kubectl describe node <node-name>
### Root Cause
The container is consuming more than 256Mi of memory, causing the OOMKiller to terminate it.
### Remediation
1. Increase memory limit to 512Mi
2. Check for memory leaks in application
3. Monitor memory usage
### Verification
kubectl top pod <pod> # Verify stable memory usage
scripts/pod_diagnostics.py — parses kubectl describe output and suggests diagnosticsevals/evals.json — example scenarios for testing triage accuracykubectl delete pod --force --grace-period=0: Warn that this forcefully terminates the pod without graceful shutdown. Use only as last resort for stuck pods.development
Designs transparency, explainability, and auditability frameworks to ensure humans can meaningfully oversee and audit autonomous AI decisions. Produces trust architecture documents including explanation templates, logging requirements, override mechanisms, and confidence-calibration standards. Trigger on queries about AI trust, explainability frameworks, AI transparency, human oversight, AI auditability, explanation design, and trust architecture. Do NOT trigger on general AI/ML model building, AI ethics policy writing, UI/UX design without trust context, compliance auditing, or data privacy implementation.
development
Models virtual replicas of physical systems (factories, supply chains, infrastructure) to simulate real-world operations and define predictive maintenance schedules. Generates digital twin specifications, sensor mapping requirements, and simulation parameters for operational planning. Trigger on queries about digital twins, virtual replicas, predictive maintenance planning, simulation models, sensor mapping, and operational simulation. Do NOT trigger on general IoT device management, dashboard design, data visualization, supply chain analytics without simulation context, or hardware procurement.
testing
Analyzes team workflows, task dependencies, and context-switching patterns to dynamically reorganize work assignments that reduce mental fatigue and cognitive overhead. Models task complexity, attention cost of switches, and focus-time requirements to optimize human productivity. Trigger on queries about cognitive load, context switching, mental fatigue, workflow optimization, task reorganization, focus time, and attention management. Do NOT trigger on general project management, sprint planning, Jira/Linear ticket triage, team capacity planning without cognitive context, performance reviews, or process documentation.
development
Strictly audits frontend code, UI components, and design mockups against WCAG 2.2 AA standards. Identifies violations in color contrast, keyboard navigation, screen reader compatibility, ARIA attributes, focus management, and touch target sizing. Generates prioritized remediation reports with code fix suggestions. Trigger on queries about WCAG audits, accessibility audits, a11y checks, color contrast, screen reader compatibility, keyboard navigation, ARIA attributes, and accessibility remediation. Do NOT trigger on general UI/UX design feedback, visual design critique, performance optimization, SEO auditing, or cross-browser compatibility testing.