skills/core/pod-crash-debug/SKILL.md
Diagnose pod crash failures (CrashLoopBackOff, OOMKilled, Error, RunContainerError). Checks pod status, events, and previous logs to identify root cause.
npx skillsauth add scitix/siclaw pod-crash-debugInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When a pod is stuck in CrashLoopBackOff, Error, OOMKilled, or RunContainerError, follow this flow to identify the root cause.
Scope: This skill is for diagnosis only. Once you identify the root cause, report it to the user and stop. Do NOT attempt to fix the application code or modify resource limits — that should be left to the user.
kubectl get pod <pod> -n <ns> -o wide
Note the STATUS, RESTARTS count, and NODE. A high restart count confirms the pod is crash-looping.
kubectl describe pod <pod> -n <ns>
Focus on:
reason, exit code, and signalBackOff, Failed, Unhealthy, or OOMKilling eventskubectl logs <pod> -n <ns> --previous --tail=200
If the pod has multiple containers, specify the crashing container:
kubectl logs <pod> -n <ns> -c <container> --previous --tail=200
If --previous fails with "previous terminated container not found", try current logs:
kubectl logs <pod> -n <ns> --tail=200
Match the information from steps 2-3 against the patterns below. Once a pattern matches, report the root cause to the user and stop.
OOMKilled / exit code 137 (from OOM) — Out of MemoryThe container exceeded its memory limit and was killed by the kernel OOM killer.
Check the container's resource limits:
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[*].resources}'
Advise the user to either increase the memory limit or investigate the application's memory usage (possible memory leak).
The container was killed by SIGKILL but not due to OOM. Common causes:
kubectl describe pod for Unhealthy events with Liveness probe failedIf liveness probe failures are present, advise the user to adjust probe timing (initialDelaySeconds, timeoutSeconds, periodSeconds) or fix the health endpoint.
The application exited with a generic error code. The root cause is in the container logs from step 3. Report the relevant error lines to the user.
Often indicates a missing binary, incorrect command syntax, or shell script error. Check the container's command and args:
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[0].command} {.spec.containers[0].args}'
The entrypoint binary exists but is not executable. Advise the user to check file permissions in the container image.
The entrypoint binary does not exist in the container image. Advise the user to verify the image contains the expected binary and the command field is correct.
RunContainerError — Container failed to startThe container runtime failed to start the container. Common causes:
Check events from step 2 for the specific error message and report it to the user.
CreateContainerConfigError — Invalid container configurationA referenced ConfigMap, Secret, or other resource does not exist or is misconfigured. Check the events for the specific missing resource name and report it to the user.
PostStartHookError — Lifecycle hook failedThe container's postStart hook failed, causing the container to be killed. Check events and logs for the hook's error output.
--previous logs are empty and the container exits immediately, the issue is likely with the entrypoint command — check the image's ENTRYPOINT/CMD and the pod's command/args override.-c <init-container-name> to get the specific init container's logs.--previous logs are usually sufficient.testing
Show and ping the gateway of a network interface, on a Kubernetes node or inside a pod's network namespace. Auto-detects the gateway from the routing table (ip -j route), reports interface type (RoCE / Ethernet / IB), and tests reachability with ping. Use for default-route / gateway questions, network reachability checks, RoCE/RDMA data-path validation, and "can this node/pod reach its gateway" investigations.
development
Guide for writing and improving Siclaw skills. Read this when creating or modifying a skill. Covers skill directory layout, SKILL.md format, script execution modes, and best practices.
devops
Retrieve logs from a Kubernetes node. Supports journalctl (systemd units) and file-based logs. Use when you need to inspect node-level logs (containerd, kubelet, etc.). Run via host_script (preferred) or node_script.
development
Guides the user to the Siclaw Web page to manage Skills. Use this guide when the user requests to create, edit, or view a Skill in a Channel conversation.