Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ahmedasmar/k8s-troubleshooter

Name: k8s-troubleshooter
Author: ahmedasmar

k8s-troubleshooter/skills/SKILL.md

npx skillsauth add ahmedasmar/devops-claude-skills k8s-troubleshooter

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Clean

VirusTotalMulti-engine malware detection

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Kubernetes Troubleshooter & Incident Response

Systematic approach to diagnosing and resolving Kubernetes issues in production environments.

When to Use This Skill

Use this skill when:

Investigating pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, etc.)
Responding to production incidents or outages
Troubleshooting cluster health issues
Diagnosing networking or service connectivity problems
Investigating storage/volume issues
Analyzing performance degradation
Conducting post-incident analysis

Core Troubleshooting Workflow

Follow this systematic approach for any Kubernetes issue:

1. Gather Context

What is the observed symptom?
When did it start?
What changed recently (deployments, config, infrastructure)?
What is the scope (single pod, service, node, cluster)?
What is the business impact (severity level)?

2. Initial Triage

Run cluster health check:

python3 scripts/cluster_health.py

This provides an overview of:

Node health status
System pod health
Pending pods
Failed pods
Crash loop pods

3. Deep Dive Investigation

Based on triage results, focus investigation:

For Namespace-Level Issues:

python3 scripts/check_namespace.py <namespace>

This provides comprehensive namespace health:

Pod status (running, pending, failed, crashlooping)
Service health and endpoints
Deployment availability
PVC status
Resource quota usage
Recent events
Actionable recommendations

For Pod Issues:

python3 scripts/diagnose_pod.py <namespace> <pod-name>

This analyzes:

Pod phase and readiness
Container statuses and states
Restart counts
Recent events
Resource usage

For specific investigations:

Review pod details: kubectl describe pod <pod> -n <namespace>
Check logs: kubectl logs <pod> -n <namespace>
Check previous logs if restarting: kubectl logs <pod> -n <namespace> --previous
Check events: kubectl get events -n <namespace> --sort-by='.lastTimestamp'

4. Identify Root Cause

Consult references/common_issues.md for detailed information on:

ImagePullBackOff / ErrImagePull
CrashLoopBackOff
Pending Pods
OOMKilled
Node issues (NotReady, DiskPressure)
Networking failures
Storage/PVC issues
Resource quotas and throttling
RBAC permission errors

Each issue includes:

Symptoms
Common causes
Diagnostic commands
Remediation steps
Prevention strategies

5. Apply Remediation

Follow remediation steps from common_issues.md based on root cause identified.

Always:

Test fixes in non-production first if possible
Document actions taken
Monitor for effectiveness
Have rollback plan ready

6. Verify & Monitor

After applying fix:

Verify issue is resolved
Monitor for recurrence (15-30 minutes minimum)
Check related systems
Update documentation

Incident Response

For production incidents, follow structured response in references/incident_response.md:

Severity Assessment:

SEV-1 (Critical): Complete outage, data loss, security breach
SEV-2 (High): Major degradation, significant user impact
SEV-3 (Medium): Minor impairment, workaround available
SEV-4 (Low): Cosmetic, minimal impact

Incident Phases:

Detection - Identify and assess
Triage - Determine scope and impact
Investigation - Find root cause
Resolution - Apply fix
Post-Incident - Document and improve

Common Incident Scenarios:

Complete cluster outage
Service degradation
Node failure
Storage issues
Security incidents

See references/incident_response.md for detailed playbooks.

Quick Reference Commands

Cluster Overview

kubectl cluster-info
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20

Pod Diagnostics

kubectl describe pod <pod> -n <namespace>
kubectl logs <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --previous
kubectl exec -it <pod> -n <namespace> -- /bin/sh
kubectl get pod <pod> -n <namespace> -o yaml

Node Diagnostics

kubectl describe node <node>
kubectl top nodes
kubectl top pods --all-namespaces
ssh <node> "systemctl status kubelet"
ssh <node> "journalctl -u kubelet -n 100"

Service & Network

kubectl describe svc <service> -n <namespace>
kubectl get endpoints <service> -n <namespace>
kubectl get networkpolicies --all-namespaces

Storage

kubectl get pvc,pv --all-namespaces
kubectl describe pvc <pvc> -n <namespace>
kubectl get storageclass

Resource & Configuration

kubectl describe resourcequota -n <namespace>
kubectl describe limitrange -n <namespace>
kubectl get rolebindings,clusterrolebindings -n <namespace>

Diagnostic Scripts

cluster_health.py

Comprehensive cluster health check covering:

Node status and health
System pod status (kube-system, etc.)
Pending pods across all namespaces
Failed pods
Pods in crash loops

Usage: python3 scripts/cluster_health.py

Best used as first diagnostic step to get overall cluster health snapshot.

check_namespace.py

Namespace-level health check and diagnostics:

Pod health (running, pending, failed, crashlooping, image pull errors)
Service health and endpoints
Deployment availability status
PersistentVolumeClaim status
Resource quota usage and limits
Recent namespace events
Health status assessment
Actionable recommendations

Usage:

# Human-readable output
python3 scripts/check_namespace.py <namespace>

# JSON output for automation
python3 scripts/check_namespace.py <namespace> --json

# Include more events
python3 scripts/check_namespace.py <namespace> --events 20

Best used when troubleshooting issues in a specific namespace or assessing overall namespace health.

diagnose_pod.py

Detailed pod-level diagnostics:

Pod phase and status
Container states (waiting, running, terminated)
Restart counts and patterns
Resource configuration issues
Recent events
Actionable recommendations

Usage: python3 scripts/diagnose_pod.py <namespace> <pod-name>

Best used when investigating specific pod failures or behavior.

Reference Documentation

references/common_issues.md

Comprehensive guide to common Kubernetes issues with:

Detailed symptom descriptions
Root cause analysis
Step-by-step diagnostic procedures
Remediation instructions
Prevention strategies

Covers:

Pod issues (ImagePullBackOff, CrashLoopBackOff, Pending, OOMKilled)
Node issues (NotReady, DiskPressure)
Networking issues (pod-to-pod communication, service access)
Storage issues (PVC pending, volume mount failures)
Resource issues (quota exceeded, CPU throttling)
Security issues (vulnerabilities, RBAC)

Read this when you identify a specific issue type but need detailed remediation steps.

references/incident_response.md

Structured incident response framework including:

Incident response phases (Detection → Triage → Investigation → Resolution → Post-Incident)
Severity level definitions
Detailed playbooks for common incident scenarios
Communication guidelines
Post-incident review template
Best practices for prevention, preparedness, response, and recovery

Read this when responding to production incidents or planning incident response procedures.

references/performance_troubleshooting.md

Comprehensive performance diagnosis and optimization guide covering:

High Latency Issues - API response time, request latency troubleshooting
CPU Performance - Throttling detection, profiling, optimization
Memory Performance - OOM issues, leak detection, heap profiling
Network Performance - Latency, packet loss, DNS resolution
Storage I/O Performance - Disk performance testing, optimization
Application-Level Metrics - Prometheus integration, distributed tracing
Cluster-Wide Performance - Control plane, scheduler, resource utilization

Read this when:

Investigating slow application response times
Diagnosing CPU or memory performance issues
Troubleshooting network latency or connectivity
Optimizing storage I/O performance
Setting up performance monitoring

references/helm_troubleshooting.md

Complete guide to Helm troubleshooting including:

Release Issues - Stuck releases, missing resources, state problems
Installation Failures - Chart conflicts, validation errors, template rendering
Upgrade and Rollback - Failed upgrades, immutable field errors, rollback procedures
Values and Configuration - Values not applied, parsing errors, secret handling
Chart Dependencies - Dependency updates, version conflicts, subchart values
Hooks and Lifecycle - Hook failures, cleanup issues
Repository Issues - Chart access problems, version mismatches

Read this when:

Working with Helm-deployed applications
Troubleshooting chart installations or upgrades
Debugging Helm release states
Managing chart dependencies

Best Practices

Always:

Start with high-level health check before deep diving
Document symptoms and findings as you investigate
Check recent changes (deployments, config, infrastructure)
Preserve logs and state before making destructive changes
Test fixes in non-production when possible
Monitor after applying fixes to verify resolution

Never:

Make production changes without understanding impact
Delete resources without confirming they're safe to remove
Restart pods repeatedly without investigating root cause
Apply fixes without documentation
Skip post-incident review

Key Principles:

Systematic over random troubleshooting
Evidence-based diagnosis
Fix root cause, not symptoms
Learn and improve from each incident
Prevention is better than reaction

ahmedasmar/k8s-troubleshooter

k8s-troubleshooter/skills/SKILL.md

Systematic Kubernetes troubleshooting and incident response. Use when diagnosing pod failures, cluster issues, performance problems, networking issues, storage failures, or responding to production incidents. Provides diagnostic workflows, automated health checks, and comprehensive remediation guidance for common Kubernetes problems.

108 stars

testing

Updated Mar 27, 2026

$ install --global

skillsauth

npx skillsauth add ahmedasmar/devops-claude-skills k8s-troubleshooter

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

4 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Clean

VirusTotalMulti-engine malware detection

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 1, 2026, 8:21 PM51.5s1 file scanned

SKILL.md

name:: k8s-troubleshooter
description:: Systematic Kubernetes troubleshooting and incident response. Use when diagnosing pod failures, cluster issues, performance problems, networking issues, storage failures, or responding to production incidents. Provides diagnostic workflows, automated health checks, and comprehensive remediation guidance for common Kubernetes problems.

Kubernetes Troubleshooter & Incident Response

Systematic approach to diagnosing and resolving Kubernetes issues in production environments.

When to Use This Skill

Use this skill when:

Investigating pod failures (CrashLoopBackOff, ImagePullBackOff, Pending, etc.)
Responding to production incidents or outages
Troubleshooting cluster health issues
Diagnosing networking or service connectivity problems
Investigating storage/volume issues
Analyzing performance degradation
Conducting post-incident analysis

Core Troubleshooting Workflow

Follow this systematic approach for any Kubernetes issue:

1. Gather Context

What is the observed symptom?
When did it start?
What changed recently (deployments, config, infrastructure)?
What is the scope (single pod, service, node, cluster)?
What is the business impact (severity level)?

2. Initial Triage

Run cluster health check:

python3 scripts/cluster_health.py

This provides an overview of:

Node health status
System pod health
Pending pods
Failed pods
Crash loop pods

3. Deep Dive Investigation

Based on triage results, focus investigation:

For Namespace-Level Issues:

python3 scripts/check_namespace.py <namespace>

This provides comprehensive namespace health:

Pod status (running, pending, failed, crashlooping)
Service health and endpoints
Deployment availability
PVC status
Resource quota usage
Recent events
Actionable recommendations

For Pod Issues:

python3 scripts/diagnose_pod.py <namespace> <pod-name>

This analyzes:

Pod phase and readiness
Container statuses and states
Restart counts
Recent events
Resource usage

For specific investigations:

Review pod details: kubectl describe pod <pod> -n <namespace>
Check logs: kubectl logs <pod> -n <namespace>
Check previous logs if restarting: kubectl logs <pod> -n <namespace> --previous
Check events: kubectl get events -n <namespace> --sort-by='.lastTimestamp'

4. Identify Root Cause

Consult references/common_issues.md for detailed information on:

ImagePullBackOff / ErrImagePull
CrashLoopBackOff
Pending Pods
OOMKilled
Node issues (NotReady, DiskPressure)
Networking failures
Storage/PVC issues
Resource quotas and throttling
RBAC permission errors

Each issue includes:

Symptoms
Common causes
Diagnostic commands
Remediation steps
Prevention strategies

5. Apply Remediation

Follow remediation steps from common_issues.md based on root cause identified.

Always:

Test fixes in non-production first if possible
Document actions taken
Monitor for effectiveness
Have rollback plan ready

6. Verify & Monitor

After applying fix:

Verify issue is resolved
Monitor for recurrence (15-30 minutes minimum)
Check related systems
Update documentation

Incident Response

For production incidents, follow structured response in references/incident_response.md:

Severity Assessment:

SEV-1 (Critical): Complete outage, data loss, security breach
SEV-2 (High): Major degradation, significant user impact
SEV-3 (Medium): Minor impairment, workaround available
SEV-4 (Low): Cosmetic, minimal impact

Incident Phases:

Detection - Identify and assess
Triage - Determine scope and impact
Investigation - Find root cause
Resolution - Apply fix
Post-Incident - Document and improve

Common Incident Scenarios:

Complete cluster outage
Service degradation
Node failure
Storage issues
Security incidents

See references/incident_response.md for detailed playbooks.

Quick Reference Commands

Cluster Overview

kubectl cluster-info
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20

Pod Diagnostics

kubectl describe pod <pod> -n <namespace>
kubectl logs <pod> -n <namespace>
kubectl logs <pod> -n <namespace> --previous
kubectl exec -it <pod> -n <namespace> -- /bin/sh
kubectl get pod <pod> -n <namespace> -o yaml

Node Diagnostics

kubectl describe node <node>
kubectl top nodes
kubectl top pods --all-namespaces
ssh <node> "systemctl status kubelet"
ssh <node> "journalctl -u kubelet -n 100"

Service & Network

kubectl describe svc <service> -n <namespace>
kubectl get endpoints <service> -n <namespace>
kubectl get networkpolicies --all-namespaces

Storage

kubectl get pvc,pv --all-namespaces
kubectl describe pvc <pvc> -n <namespace>
kubectl get storageclass

Resource & Configuration

kubectl describe resourcequota -n <namespace>
kubectl describe limitrange -n <namespace>
kubectl get rolebindings,clusterrolebindings -n <namespace>

Diagnostic Scripts

cluster_health.py

Comprehensive cluster health check covering:

Node status and health
System pod status (kube-system, etc.)
Pending pods across all namespaces
Failed pods
Pods in crash loops

Usage: python3 scripts/cluster_health.py

Best used as first diagnostic step to get overall cluster health snapshot.

check_namespace.py

Namespace-level health check and diagnostics:

Pod health (running, pending, failed, crashlooping, image pull errors)
Service health and endpoints
Deployment availability status
PersistentVolumeClaim status
Resource quota usage and limits
Recent namespace events
Health status assessment
Actionable recommendations

Usage:

# Human-readable output
python3 scripts/check_namespace.py <namespace>

# JSON output for automation
python3 scripts/check_namespace.py <namespace> --json

# Include more events
python3 scripts/check_namespace.py <namespace> --events 20

Best used when troubleshooting issues in a specific namespace or assessing overall namespace health.

diagnose_pod.py

Detailed pod-level diagnostics:

Pod phase and status
Container states (waiting, running, terminated)
Restart counts and patterns
Resource configuration issues
Recent events
Actionable recommendations

Usage: python3 scripts/diagnose_pod.py <namespace> <pod-name>

Best used when investigating specific pod failures or behavior.

Reference Documentation

references/common_issues.md

Comprehensive guide to common Kubernetes issues with:

Detailed symptom descriptions
Root cause analysis
Step-by-step diagnostic procedures
Remediation instructions
Prevention strategies

Covers:

Pod issues (ImagePullBackOff, CrashLoopBackOff, Pending, OOMKilled)
Node issues (NotReady, DiskPressure)
Networking issues (pod-to-pod communication, service access)
Storage issues (PVC pending, volume mount failures)
Resource issues (quota exceeded, CPU throttling)
Security issues (vulnerabilities, RBAC)

Read this when you identify a specific issue type but need detailed remediation steps.

references/incident_response.md

Structured incident response framework including:

Incident response phases (Detection → Triage → Investigation → Resolution → Post-Incident)
Severity level definitions
Detailed playbooks for common incident scenarios
Communication guidelines
Post-incident review template
Best practices for prevention, preparedness, response, and recovery

Read this when responding to production incidents or planning incident response procedures.

references/performance_troubleshooting.md

Comprehensive performance diagnosis and optimization guide covering:

High Latency Issues - API response time, request latency troubleshooting
CPU Performance - Throttling detection, profiling, optimization
Memory Performance - OOM issues, leak detection, heap profiling
Network Performance - Latency, packet loss, DNS resolution
Storage I/O Performance - Disk performance testing, optimization
Application-Level Metrics - Prometheus integration, distributed tracing
Cluster-Wide Performance - Control plane, scheduler, resource utilization

Read this when:

Investigating slow application response times
Diagnosing CPU or memory performance issues
Troubleshooting network latency or connectivity
Optimizing storage I/O performance
Setting up performance monitoring

references/helm_troubleshooting.md

Complete guide to Helm troubleshooting including:

Release Issues - Stuck releases, missing resources, state problems
Installation Failures - Chart conflicts, validation errors, template rendering
Upgrade and Rollback - Failed upgrades, immutable field errors, rollback procedures
Values and Configuration - Values not applied, parsing errors, secret handling
Chart Dependencies - Dependency updates, version conflicts, subchart values
Hooks and Lifecycle - Hook failures, cleanup issues
Repository Issues - Chart access problems, version mismatches

Read this when:

Working with Helm-deployed applications
Troubleshooting chart installations or upgrades
Debugging Helm release states
Managing chart dependencies

Best Practices

Always:

Start with high-level health check before deep diving
Document symptoms and findings as you investigate
Check recent changes (deployments, config, infrastructure)
Preserve logs and state before making destructive changes
Test fixes in non-production when possible
Monitor after applying fixes to verify resolution

Never:

Make production changes without understanding impact
Delete resources without confirming they're safe to remove
Restart pods repeatedly without investigating root cause
Apply fixes without documentation
Skip post-incident review

Key Principles:

Systematic over random troubleshooting
Evidence-based diagnosis
Fix root cause, not symptoms
Learn and improve from each incident
Prevention is better than reaction

Related Skills

ahmedasmar/monitoring-observability

tools

VerifiedTrustedCommunity

Monitoring and observability strategy, implementation, and troubleshooting. Use for designing metrics/logs/traces systems, setting up Prometheus/Grafana/Loki, creating alerts and dashboards, calculating SLOs and error budgets, analyzing performance issues, and comparing monitoring tools (Datadog, ELK, CloudWatch). Covers the Four Golden Signals, RED/USE methods, OpenTelemetry instrumentation, log aggregation patterns, and distributed tracing.

108SKILL.mdUpdated Mar 27, 2026

ahmedasmar/monitoring-observability

ahmedasmar/iac-terraform

development

VerifiedTrustedCommunity

Infrastructure as Code with Terraform and Terragrunt. Use for creating, validating, troubleshooting, and managing Terraform configurations, modules, and state. Covers Terraform workflows, best practices, module development, state management, Terragrunt patterns, and common issue resolution.

108SKILL.mdUpdated Mar 27, 2026

ahmedasmar/iac-terraform

ahmedasmar/gitops-workflows

development

VerifiedTrustedCommunity

--- name: gitops-workflows description: GitOps deployment workflows with ArgoCD and Flux. Use for setting up GitOps (ArgoCD 3.x, Flux 2.7), designing repository structures (monorepo/polyrepo, app-of-apps), multi-cluster deployments (ApplicationSets, hub-spoke), secrets management (SOPS+age, Sealed Secrets, External Secrets Operator), progressive delivery (Argo Rollouts, Flagger), troubleshooting sync issues, and OCI artifact management. Covers latest 2024-2025 features: ArgoCD annotation-based t

108SKILL.mdUpdated Mar 27, 2026

ahmedasmar/gitops-workflows

ahmedasmar/ci-cd

development

VerifiedTrustedCommunity

CI/CD pipeline design, optimization, DevSecOps security scanning, and troubleshooting. Use for creating workflows, debugging pipeline failures, implementing SAST/DAST/SCA, optimizing build performance, implementing caching strategies, setting up deployments, securing pipelines with OIDC/secrets management, and troubleshooting common issues across GitHub Actions, GitLab CI, and other platforms.

108SKILL.mdUpdated Mar 27, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ahmedasmar/devops-claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r devops-claude-skills/k8s-troubleshooter/skills ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ahmedasmar/devops-claude-skills

108 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT