.claude/skills/eks-troubleshooting/SKILL.md
EKS troubleshooting and debugging guide covering pod failures, cluster issues, networking problems, and performance diagnostics. Use when diagnosing cluster issues, debugging pod failures (CrashLoopBackOff, Pending, OOMKilled), resolving networking problems, investigating performance issues, troubleshooting IAM/IRSA permissions, fixing image pull errors, or analyzing EKS cluster health.
npx skillsauth add adaptationio/skrillz eks-troubleshootingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Comprehensive troubleshooting guide for Amazon EKS clusters covering control plane issues, node problems, pod failures, networking, storage, security, and performance diagnostics. Based on 2025 AWS best practices.
Keywords: EKS, Kubernetes, kubectl, debugging, troubleshooting, pod failure, node issues, networking, DNS, AWS, diagnostics
Pod Issues:
Cluster Issues:
Networking Problems:
Security & Permissions:
Performance Issues:
# Check cluster accessibility
kubectl cluster-info
# Get overall cluster status
kubectl get nodes
kubectl get pods --all-namespaces | grep -v Running
# Check recent events
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -20
# Check control plane components
kubectl get --raw /healthz
kubectl get componentstatuses # Deprecated in 1.19+ but still useful
Pod Issues:
# Get pod status
kubectl get pods -n <namespace>
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --previous # Previous container logs
Node Issues:
# Check node status
kubectl get nodes -o wide
kubectl describe node <node-name>
kubectl top nodes # Requires metrics-server
Cluster-Wide Issues:
# Check all resources
kubectl get all --all-namespaces
kubectl get events --all-namespaces --sort-by='.lastTimestamp'
# Check EKS cluster health (AWS CLI)
aws eks describe-cluster --name <cluster-name> --query 'cluster.health'
references/workload-issues.mdreferences/cluster-issues.mdreferences/networking-issues.mdSymptoms:
Quick Check:
kubectl describe pod <pod-name> -n <namespace> | grep -A 10 Events
Common Causes:
Quick Fixes:
# Check resource requests
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 5 resources
# Check node capacity
kubectl describe nodes | grep -A 5 "Allocated resources"
# For Karpenter clusters - check provisioner logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter
Symptoms:
Quick Check:
# View current logs
kubectl logs <pod-name> -n <namespace>
# View previous container logs (most useful)
kubectl logs <pod-name> -n <namespace> --previous
# Get exit code
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.status.containerStatuses[0].lastState.terminated.exitCode}'
Common Exit Codes:
Quick Fixes:
# Check for OOMKilled
kubectl describe pod <pod-name> -n <namespace> | grep -i oom
# Increase memory limit
kubectl set resources deployment <deployment-name> \
-c <container-name> \
--limits=memory=512Mi
# Check liveness/readiness probes
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 Probe
Symptoms:
Quick Check:
kubectl describe pod <pod-name> -n <namespace> | grep -A 5 "Failed to pull image"
Common Causes:
Quick Fixes:
# Check if image exists (for ECR)
aws ecr describe-images --repository-name <repo-name> --image-ids imageTag=<tag>
# Verify IRSA role has ECR permissions
kubectl describe serviceaccount <sa-name> -n <namespace> | grep Annotations
# For ECR - ensure IAM role has this policy:
# AmazonEC2ContainerRegistryReadOnly or ecr:GetAuthorizationToken, ecr:BatchGetImage
# Test image pull manually on node
kubectl debug node/<node-name> -it --image=busybox
# Then: docker pull <image>
Symptoms:
Quick Check:
# Check memory limits
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 3 "limits:"
# Check actual memory usage
kubectl top pod <pod-name> -n <namespace>
Quick Fix:
# Increase memory limit
kubectl set resources deployment <deployment-name> \
--limits=memory=1Gi \
--requests=memory=512Mi
Quick Check:
kubectl get nodes
kubectl describe node <node-name> | grep -A 10 Conditions
Common Causes:
Quick Fixes:
# Check node conditions
kubectl describe node <node-name> | grep -E "MemoryPressure|DiskPressure|PIDPressure"
# For EKS managed nodes - check EC2 instance health
aws ec2 describe-instance-status --instance-ids <instance-id>
# Drain and delete node (if managed node group - ASG will replace)
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data
kubectl delete node <node-name>
Symptoms:
Quick Check:
# Check VPC CNI logs
kubectl logs -n kube-system -l k8s-app=aws-node --tail=100
# Check IP addresses per node
kubectl get nodes -o custom-columns=NAME:.metadata.name,ADDRESSES:.status.addresses[*].address
# Check ENI usage
aws ec2 describe-instances --instance-ids <instance-id> \
--query 'Reservations[].Instances[].NetworkInterfaces[].PrivateIpAddresses'
Quick Fixes:
# Enable prefix delegation (for new nodes)
kubectl set env daemonset aws-node \
-n kube-system \
ENABLE_PREFIX_DELEGATION=true
# Increase warm pool targets
kubectl set env daemonset aws-node \
-n kube-system \
WARM_IP_TARGET=5
# Pod debugging
kubectl get pods -n <namespace> -o wide
kubectl describe pod <pod-name> -n <namespace>
kubectl logs <pod-name> -n <namespace> --tail=100 -f
kubectl logs <pod-name> -n <namespace> -c <container-name> # Multi-container pods
kubectl logs <pod-name> -n <namespace> --previous # Previous crash logs
# Node debugging
kubectl get nodes -o wide
kubectl describe node <node-name>
kubectl top nodes
kubectl top pods -n <namespace>
# Events (VERY useful)
kubectl get events -n <namespace> --sort-by='.lastTimestamp'
kubectl get events --all-namespaces --sort-by='.lastTimestamp' | tail -50
# Resource usage
kubectl top pods -n <namespace> --containers
kubectl describe node <node-name> | grep -A 5 "Allocated resources"
# Execute command in running pod
kubectl exec -it <pod-name> -n <namespace> -- /bin/sh
# Port forward for local testing
kubectl port-forward <pod-name> -n <namespace> 8080:80
# Debug with ephemeral container (K8s 1.23+)
kubectl debug <pod-name> -n <namespace> -it --image=busybox
# Debug node issues
kubectl debug node/<node-name> -it --image=ubuntu
# Copy files from pod
kubectl cp <namespace>/<pod-name>:/path/to/file ./local-file
# Get pod YAML
kubectl get pod <pod-name> -n <namespace> -o yaml
# Get all resources in namespace
kubectl get all -n <namespace>
kubectl api-resources --verbs=list --namespaced -o name \
| xargs -n 1 kubectl get --show-kind --ignore-not-found -n <namespace>
# Get pods not running
kubectl get pods --all-namespaces --field-selector=status.phase!=Running
# Get pods by label
kubectl get pods -n <namespace> -l app=myapp
# Custom columns
kubectl get pods -n <namespace> -o custom-columns=\
NAME:.metadata.name,\
STATUS:.status.phase,\
NODE:.spec.nodeName,\
IP:.status.podIP
# JSONPath queries
kubectl get pods -n <namespace> -o jsonpath='{.items[*].metadata.name}'
# Get pod restart count
kubectl get pods -n <namespace> -o jsonpath=\
'{range .items[*]}{.metadata.name}{"\t"}{.status.containerStatuses[0].restartCount}{"\n"}{end}'
# Check EKS cluster status
aws eks describe-cluster --name <cluster-name> \
--query 'cluster.{Status:status,Health:health,Version:version}'
# List all EKS clusters
aws eks list-clusters
# Check addon status
aws eks list-addons --cluster-name <cluster-name>
aws eks describe-addon --cluster-name <cluster-name> --addon-name vpc-cni
# Update kubeconfig
aws eks update-kubeconfig --name <cluster-name> --region <region>
# Check service account IRSA annotation
kubectl get sa <sa-name> -n <namespace> -o yaml | grep eks.amazonaws.com/role-arn
# Verify pod has correct service account
kubectl get pod <pod-name> -n <namespace> -o yaml | grep serviceAccountName
# Check if pod has AWS credentials
kubectl exec <pod-name> -n <namespace> -- env | grep AWS
# Test IAM permissions from pod
kubectl exec <pod-name> -n <namespace> -- aws sts get-caller-identity
kubectl exec <pod-name> -n <namespace> -- aws s3 ls # Test S3 access
# Get ECR login password
aws ecr get-login-password --region <region>
# Test ECR access
aws ecr describe-repositories --region <region>
# Check if IAM role/user has ECR permissions
aws iam get-role --role-name <role-name>
aws iam list-attached-role-policies --role-name <role-name>
# Check Karpenter logs
kubectl logs -n karpenter -l app.kubernetes.io/name=karpenter --tail=100 -f
# Check provisioner configuration
kubectl get provisioner -o yaml
# Check Karpenter controller status
kubectl get pods -n karpenter
# Debug why nodes not provisioning
kubectl describe provisioner default
kubectl get events -n karpenter --sort-by='.lastTimestamp'
# Check node resource usage
kubectl top nodes
# Check pod resource usage
kubectl top pods -n <namespace> --sort-by=memory
kubectl top pods -n <namespace> --sort-by=cpu
# Check resource requests vs limits
kubectl get pods -n <namespace> -o custom-columns=\
NAME:.metadata.name,\
CPU_REQ:.spec.containers[*].resources.requests.cpu,\
CPU_LIM:.spec.containers[*].resources.limits.cpu,\
MEM_REQ:.spec.containers[*].resources.requests.memory,\
MEM_LIM:.spec.containers[*].resources.limits.memory
# Check pod readiness/liveness probes
kubectl get pods -n <namespace> -o yaml | grep -A 10 Probe
# Check pod startup time
kubectl describe pod <pod-name> -n <namespace> | grep Started
# Profile application in pod
kubectl exec <pod-name> -n <namespace> -- top
kubectl exec <pod-name> -n <namespace> -- netstat -tuln
# Enable Container Insights (if not enabled)
# Via AWS CLI:
aws eks update-cluster-config \
--name <cluster-name> \
--logging '{"clusterLogging":[{"types":["api","audit","authenticator","controllerManager","scheduler"],"enabled":true}]}'
# Check control plane logs in CloudWatch
# Log groups:
# - /aws/eks/<cluster-name>/cluster
# - /aws/containerinsights/<cluster-name>/application
# - /aws/containerinsights/<cluster-name>/host
# - /aws/containerinsights/<cluster-name>/dataplane
# Check Fluent Bit daemonset
kubectl get pods -n amazon-cloudwatch -l k8s-app=fluent-bit
# Check Fluent Bit logs
kubectl logs -n amazon-cloudwatch -l k8s-app=fluent-bit --tail=50
# Force delete pod
kubectl delete pod <pod-name> -n <namespace> --grace-period=0 --force
# If still stuck, remove finalizers
kubectl patch pod <pod-name> -n <namespace> -p '{"metadata":{"finalizers":null}}'
# Force drain
kubectl drain <node-name> --ignore-daemonsets --delete-emptydir-data --force
# If still stuck, delete pods directly
kubectl delete pods -n <namespace> --field-selector spec.nodeName=<node-name> --force --grace-period=0
# Check API server health
kubectl get --raw /healthz
# Check control plane logs (CloudWatch)
aws logs tail /aws/eks/<cluster-name>/cluster --follow
# Restart coredns if DNS issues
kubectl rollout restart deployment coredns -n kube-system
# Check etcd health (EKS manages this, but can check API responsiveness)
time kubectl get nodes # Should be < 1 second
# Always set resource requests and limits
resources:
requests:
memory: "256Mi"
cpu: "250m"
limits:
memory: "512Mi"
cpu: "500m" # CPU limits optional, can cause throttling
# Implement probes for all applications
livenessProbe:
httpGet:
path: /healthz
port: 8080
initialDelaySeconds: 30
periodSeconds: 10
timeoutSeconds: 5
failureThreshold: 3
readinessProbe:
httpGet:
path: /ready
port: 8080
initialDelaySeconds: 5
periodSeconds: 5
timeoutSeconds: 3
failureThreshold: 3
startupProbe: # For slow-starting apps
httpGet:
path: /healthz
port: 8080
failureThreshold: 30
periodSeconds: 10
# Pod disruption budgets
apiVersion: policy/v1
kind: PodDisruptionBudget
metadata:
name: app-pdb
spec:
minAvailable: 2
selector:
matchLabels:
app: myapp
For detailed troubleshooting of specific areas:
AWS Documentation:
Kubernetes Documentation:
Tools:
Quick Start: Use diagnostic workflow above → Identify issue type → Jump to relevant reference guide Last Updated: November 27, 2025 (2025 AWS Best Practices)
development
Setup secure web-based terminal access to WSL2 from mobile/tablet via ttyd + ngrok/Cloudflare/Tailscale. One-command install, start, stop, status. Use when you need remote terminal access, web terminal, browser-based shell, or mobile access to WSL2 environment.
development
Complete development workflows where Claude writes the code while Gemini and Codex provide research, planning, reviews, and different perspectives. Claude remains the main developer. Use for complex projects requiring expert planning and multi-perspective reviews.
development
Systematic progress tracking for skill development. Manages task states (pending/in_progress/completed), updates in real-time, reports progress, identifies blockers, and maintains momentum. Use when tracking skill development, coordinating work, or reporting progress.
testing
Comprehensive testing workflow orchestrating functional testing, example validation, integration testing, and usability assessment. Sequential workflow for complete skill testing from examples through scenarios to integration validation. Use when conducting thorough testing, pre-deployment validation, ensuring skill functionality, or comprehensive quality checks.