Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

latestaiagents/kubernetes-troubleshooting

Name: kubernetes-troubleshooting
Author: latestaiagents

plugins/devops-sre/skills/automation/kubernetes-troubleshooting/SKILL.md

npx skillsauth add latestaiagents/agent-skills kubernetes-troubleshooting

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Kubernetes Troubleshooting

Systematic approaches to diagnose and fix common Kubernetes issues.

Troubleshooting Framework

1. What's the symptom? (pod not starting, service unreachable, etc.)
2. Where's the problem? (pod, service, ingress, node, cluster)
3. What do the events say?
4. What do the logs say?
5. What changed recently?

Pod Issues

Pod Status Quick Reference

| Status | Meaning | First Check | |--------|---------|-------------| | Pending | Can't be scheduled | kubectl describe pod | | ContainerCreating | Image pulling or volume mounting | Events, kubectl get events | | CrashLoopBackOff | Container crashes repeatedly | kubectl logs --previous | | ImagePullBackOff | Can't pull container image | Image name, credentials | | Error | Container exited with error | kubectl logs | | OOMKilled | Out of memory | Increase memory limits | | Evicted | Node under pressure | Node resources, pod priority |

Debugging Commands

# Get pod status
kubectl get pod <pod-name> -o wide

# Describe pod (events, conditions)
kubectl describe pod <pod-name>

# Get logs
kubectl logs <pod-name>
kubectl logs <pod-name> -c <container>  # specific container
kubectl logs <pod-name> --previous       # previous crash

# Execute into pod
kubectl exec -it <pod-name> -- /bin/sh

# Get all events sorted by time
kubectl get events --sort-by='.lastTimestamp'

CrashLoopBackOff

Symptoms: Pod restarts repeatedly
Common Causes:
├─ Application error on startup
├─ Missing config/secrets
├─ Liveness probe failing too soon
├─ Resource limits too low
└─ Dependency not ready

Debug Steps:
1. kubectl logs <pod> --previous
2. kubectl describe pod <pod>  # check events
3. Check liveness probe configuration
4. Check resource limits
5. Verify ConfigMaps/Secrets exist

ImagePullBackOff

Symptoms: Container image can't be pulled
Common Causes:
├─ Image doesn't exist
├─ Wrong image name/tag
├─ Private registry, missing credentials
├─ Registry rate limiting
└─ Network issues

Debug Steps:
1. Verify image name: kubectl describe pod <pod>
2. Try pulling manually: docker pull <image>
3. Check imagePullSecrets in pod spec
4. Verify secret exists: kubectl get secret <secret-name>
5. Check registry status

Pending Pods

Symptoms: Pod stuck in Pending state
Common Causes:
├─ Insufficient resources (CPU/memory)
├─ No nodes match nodeSelector/affinity
├─ PVC can't be bound
├─ Taint with no toleration
└─ ResourceQuota exceeded

Debug Steps:
1. kubectl describe pod <pod>  # check Events
2. kubectl get nodes -o wide   # check node capacity
3. kubectl describe node <node> # check allocatable
4. kubectl get pvc              # check volume claims
5. kubectl get resourcequota    # check quotas

Service & Networking Issues

Service Not Working

# Check service exists and has endpoints
kubectl get svc <service>
kubectl get endpoints <service>

# If no endpoints, check selector matches pods
kubectl get pods -l <selector-from-service>

# Test from inside cluster
kubectl run test --rm -it --image=busybox -- wget -qO- <service>:<port>

# Check DNS resolution
kubectl run test --rm -it --image=busybox -- nslookup <service>

Debugging Checklist

□ Service exists and has correct port
□ Endpoints exist (pods are selected)
□ Pod selector labels match
□ Pods are Running and Ready
□ Container is listening on correct port
□ NetworkPolicy isn't blocking traffic
□ DNS resolves correctly

Ingress Issues

# Check ingress configuration
kubectl describe ingress <ingress-name>

# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# Verify backend service
kubectl get svc <backend-service>

# Check TLS secret
kubectl get secret <tls-secret>

Deployment Issues

Deployment Not Rolling Out

# Check rollout status
kubectl rollout status deployment/<name>

# Check deployment events
kubectl describe deployment <name>

# Check replicaset
kubectl get rs -l app=<name>
kubectl describe rs <replicaset-name>

# Rollback if needed
kubectl rollout undo deployment/<name>

Common Deployment Problems

Symptom: New pods not creating
Check:
├─ ResourceQuota limits
├─ PodDisruptionBudget blocking
├─ Node capacity

Symptom: Old pods not terminating
Check:
├─ terminationGracePeriodSeconds
├─ PreStop hooks stuck
├─ Finalizers blocking deletion

Symptom: Rollout stuck
Check:
├─ maxUnavailable settings
├─ Readiness probe never passes
├─ PVC can't be detached

Node Issues

Node Not Ready

# Check node status
kubectl get nodes
kubectl describe node <node>

# Check node conditions
kubectl get node <node> -o jsonpath='{.status.conditions[*].type}'

# Check kubelet logs (on node)
journalctl -u kubelet -f

# Drain node if needed
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

Resource Pressure

# Check node resources
kubectl top nodes

# Check which pods are using resources
kubectl top pods --all-namespaces

# Find pods on specific node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node>

Quick Diagnostic Commands

# Overall cluster health
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Specific namespace health
kubectl get all -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Resource usage
kubectl top nodes
kubectl top pods -n <namespace>

# Network debugging pod
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash

Systematic Debug Template

## Issue: [Brief description]

### Symptom
[What's happening]

### Affected Resources
- Namespace:
- Deployment/Pod:
- Service:

### Investigation

#### Step 1: Check Status

kubectl get pod <pod> kubectl describe pod <pod>

Findings: [...]

#### Step 2: Check Logs

kubectl logs <pod>

Findings: [...]

#### Step 3: Check Events

kubectl get events --sort-by='.lastTimestamp'

Findings: [...]

### Root Cause
[What caused the issue]

### Resolution
[What fixed it]

### Prevention
[How to prevent recurrence]

Emergency Procedures

Force Delete Stuck Pod

# Only use when pod is truly stuck
kubectl delete pod <pod> --grace-period=0 --force

Emergency Rollback

# Immediate rollback
kubectl rollout undo deployment/<name>

# Rollback to specific revision
kubectl rollout history deployment/<name>
kubectl rollout undo deployment/<name> --to-revision=<n>

Scale Down Quickly

# Scale to zero
kubectl scale deployment/<name> --replicas=0

# Scale back up
kubectl scale deployment/<name> --replicas=3

latestaiagents/kubernetes-troubleshooting

plugins/devops-sre/skills/automation/kubernetes-troubleshooting/SKILL.md

Diagnose and fix common Kubernetes issues with systematic debugging approaches. Use this skill when troubleshooting K8s clusters, pods not starting, deployments failing, or networking issues. Activate when: kubernetes, k8s, pod, deployment, kubectl, container, crashloopbackoff, imagepullbackoff, pending pods, kubernetes networking, service not working, ingress issues.

2 stars

development

Updated Apr 23, 2026

$ install --global

skillsauth

npx skillsauth add latestaiagents/agent-skills kubernetes-troubleshooting

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 2:54 AM37.0s1 file scanned

SKILL.md

name:: kubernetes-troubleshooting
description:: |
Activate when:: kubernetes, k8s, pod, deployment, kubectl, container, crashloopbackoff, imagepullbackoff,

Kubernetes Troubleshooting

Systematic approaches to diagnose and fix common Kubernetes issues.

Troubleshooting Framework

1. What's the symptom? (pod not starting, service unreachable, etc.)
2. Where's the problem? (pod, service, ingress, node, cluster)
3. What do the events say?
4. What do the logs say?
5. What changed recently?

Pod Issues

Pod Status Quick Reference

Debugging Commands

# Get pod status
kubectl get pod <pod-name> -o wide

# Describe pod (events, conditions)
kubectl describe pod <pod-name>

# Get logs
kubectl logs <pod-name>
kubectl logs <pod-name> -c <container>  # specific container
kubectl logs <pod-name> --previous       # previous crash

# Execute into pod
kubectl exec -it <pod-name> -- /bin/sh

# Get all events sorted by time
kubectl get events --sort-by='.lastTimestamp'

CrashLoopBackOff

Symptoms: Pod restarts repeatedly
Common Causes:
├─ Application error on startup
├─ Missing config/secrets
├─ Liveness probe failing too soon
├─ Resource limits too low
└─ Dependency not ready

Debug Steps:
1. kubectl logs <pod> --previous
2. kubectl describe pod <pod>  # check events
3. Check liveness probe configuration
4. Check resource limits
5. Verify ConfigMaps/Secrets exist

ImagePullBackOff

Symptoms: Container image can't be pulled
Common Causes:
├─ Image doesn't exist
├─ Wrong image name/tag
├─ Private registry, missing credentials
├─ Registry rate limiting
└─ Network issues

Debug Steps:
1. Verify image name: kubectl describe pod <pod>
2. Try pulling manually: docker pull <image>
3. Check imagePullSecrets in pod spec
4. Verify secret exists: kubectl get secret <secret-name>
5. Check registry status

Pending Pods

Symptoms: Pod stuck in Pending state
Common Causes:
├─ Insufficient resources (CPU/memory)
├─ No nodes match nodeSelector/affinity
├─ PVC can't be bound
├─ Taint with no toleration
└─ ResourceQuota exceeded

Debug Steps:
1. kubectl describe pod <pod>  # check Events
2. kubectl get nodes -o wide   # check node capacity
3. kubectl describe node <node> # check allocatable
4. kubectl get pvc              # check volume claims
5. kubectl get resourcequota    # check quotas

Service & Networking Issues

Service Not Working

# Check service exists and has endpoints
kubectl get svc <service>
kubectl get endpoints <service>

# If no endpoints, check selector matches pods
kubectl get pods -l <selector-from-service>

# Test from inside cluster
kubectl run test --rm -it --image=busybox -- wget -qO- <service>:<port>

# Check DNS resolution
kubectl run test --rm -it --image=busybox -- nslookup <service>

Debugging Checklist

□ Service exists and has correct port
□ Endpoints exist (pods are selected)
□ Pod selector labels match
□ Pods are Running and Ready
□ Container is listening on correct port
□ NetworkPolicy isn't blocking traffic
□ DNS resolves correctly

Ingress Issues

# Check ingress configuration
kubectl describe ingress <ingress-name>

# Check ingress controller logs
kubectl logs -n ingress-nginx -l app.kubernetes.io/name=ingress-nginx

# Verify backend service
kubectl get svc <backend-service>

# Check TLS secret
kubectl get secret <tls-secret>

Deployment Issues

Deployment Not Rolling Out

# Check rollout status
kubectl rollout status deployment/<name>

# Check deployment events
kubectl describe deployment <name>

# Check replicaset
kubectl get rs -l app=<name>
kubectl describe rs <replicaset-name>

# Rollback if needed
kubectl rollout undo deployment/<name>

Common Deployment Problems

Symptom: New pods not creating
Check:
├─ ResourceQuota limits
├─ PodDisruptionBudget blocking
├─ Node capacity

Symptom: Old pods not terminating
Check:
├─ terminationGracePeriodSeconds
├─ PreStop hooks stuck
├─ Finalizers blocking deletion

Symptom: Rollout stuck
Check:
├─ maxUnavailable settings
├─ Readiness probe never passes
├─ PVC can't be detached

Node Issues

Node Not Ready

# Check node status
kubectl get nodes
kubectl describe node <node>

# Check node conditions
kubectl get node <node> -o jsonpath='{.status.conditions[*].type}'

# Check kubelet logs (on node)
journalctl -u kubelet -f

# Drain node if needed
kubectl drain <node> --ignore-daemonsets --delete-emptydir-data

Resource Pressure

# Check node resources
kubectl top nodes

# Check which pods are using resources
kubectl top pods --all-namespaces

# Find pods on specific node
kubectl get pods --all-namespaces -o wide --field-selector spec.nodeName=<node>

Quick Diagnostic Commands

# Overall cluster health
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
kubectl get events --sort-by='.lastTimestamp' | tail -20

# Specific namespace health
kubectl get all -n <namespace>
kubectl get events -n <namespace> --sort-by='.lastTimestamp'

# Resource usage
kubectl top nodes
kubectl top pods -n <namespace>

# Network debugging pod
kubectl run netshoot --rm -it --image=nicolaka/netshoot -- /bin/bash

Systematic Debug Template

## Issue: [Brief description]

### Symptom
[What's happening]

### Affected Resources
- Namespace:
- Deployment/Pod:
- Service:

### Investigation

#### Step 1: Check Status

kubectl get pod <pod> kubectl describe pod <pod>

Findings: [...]

#### Step 2: Check Logs

kubectl logs <pod>

Findings: [...]

#### Step 3: Check Events

kubectl get events --sort-by='.lastTimestamp'

Findings: [...]

### Root Cause
[What caused the issue]

### Resolution
[What fixed it]

### Prevention
[How to prevent recurrence]

Emergency Procedures

Force Delete Stuck Pod

# Only use when pod is truly stuck
kubectl delete pod <pod> --grace-period=0 --force

Emergency Rollback

# Immediate rollback
kubectl rollout undo deployment/<name>

# Rollback to specific revision
kubectl rollout history deployment/<name>
kubectl rollout undo deployment/<name> --to-revision=<n>

Scale Down Quickly

# Scale to zero
kubectl scale deployment/<name> --replicas=0

# Scale back up
kubectl scale deployment/<name> --replicas=3

Related Skills

latestaiagents/skill-testing

development

VerifiedTrustedCommunity

Test skills for correct activation, content quality, and regression — both automated checks (frontmatter validity, lint) and manual verification (query-suite activation testing). Covers CI integration and how to catch skill regressions before users do. Use this skill when adding skills to a repo, setting up CI for a skill library, or debugging "the skill exists but doesn't work". Activate when: test skills, validate skills, skill CI, skill linting, skill activation test, skill regression.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-testing

latestaiagents/skill-frontmatter

documentation

VerifiedTrustedCommunity

Write the YAML frontmatter for a SKILL.md file so it activates reliably — name, description, and activation keywords that the model matches against. Covers length, tone, and the most common frontmatter mistakes. Use this skill when authoring a new skill, fixing a skill that isn't auto-activating, or reviewing skills for publication. Activate when: SKILL.md frontmatter, skill description, skill activation, skill YAML, write a skill, author a skill.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-frontmatter

latestaiagents/skill-activation-patterns

development

VerifiedTrustedCommunity

Design skills that fire at the right moment — neither over-eager (noise) nor under-eager (silent). Covers activation specificity, trigger phrases, disambiguation between overlapping skills, and debugging activation. Use this skill when multiple skills could fire on the same query, a skill never fires, or a skill fires too often. Activate when: skill won't activate, skill over-activates, overlapping skills, skill triggers, skill selection, skill disambiguation.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/skill-activation-patterns

latestaiagents/progressive-disclosure

development

VerifiedTrustedCommunity

Structure SKILL.md content so the model reads just enough — concise summary up front, progressively deeper detail, examples on demand. Covers section ordering, length budgets, when to split into multiple skills. Use this skill when writing or refactoring a skill body, one skill has grown too long, or a skill is wordy but not useful. Activate when: SKILL.md structure, skill content, skill too long, split skill, progressive disclosure, skill body.

2SKILL.mdUpdated Apr 23, 2026

latestaiagents/progressive-disclosure

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/latestaiagents/agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r agent-skills/plugins/devops-sre/skills/automation/kubernetes-troubleshooting ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

latestaiagents/agent-skills

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT