skills/core/image-pull-debug/SKILL.md
Diagnose container image pull failures (ErrImagePull / ImagePullBackOff). Checks pod status, containerd logs, and events to identify root cause.
npx skillsauth add scitix/siclaw image-pull-debugInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When a pod is stuck in ErrImagePull or ImagePullBackOff, follow this flow to identify the root cause.
Important: ErrImagePull, ImagePullBackOff, and Back-off pulling image are NOT causes — they only indicate the pull failed. You MUST proceed through all steps below to find the actual cause. Never conclude with just these status messages.
Scope: This skill is for diagnosis only. Once you identify the root cause, report it to the user and stop. Do NOT attempt network-level debugging (ping, curl, iptables, traceroute, etc.) — that is outside the scope of this skill and should be left to the user or network administrator.
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.nodeName}'
kubectl get pod <pod> -n <ns> -o jsonpath='{.status.containerStatuses[*].image}'
kubectl get pod <pod> -n <ns> -o jsonpath='{.status.containerStatuses[0].state.waiting.message}'
Note the node name, image name, and waiting message (may already contain the root cause).
Also check the image registry:
nginx:latest, envoyproxy/gateway:v1.2.8), it pulls from Docker Hub (docker.io).registry.example.com/app:v1), it pulls from that registry.Containerd logs are the authoritative source for the root cause. Pod events are often generic ("Failed to pull image") and do not contain the actual error — always check containerd logs.
Use the node-logs skill:
bash skills/core/node-logs/scripts/get-node-logs.sh \
--node <nodeName> --unit containerd --grep "<image>" --since "1h ago"
Replace <image> with the image name or a unique substring. Adjust --since to cover the pod's creation time.
If journalctl returns nothing, try log files:
bash skills/core/node-logs/scripts/get-node-logs.sh \
--node <nodeName> --file /var/log/messages --grep "<image>"
Match the error from containerd logs (or the state.waiting.message from step 1) against the patterns below. Once a pattern matches, report the root cause to the user and stop. Do not continue with further diagnostic commands.
If containerd logs have no relevant entries, check events as a supplementary source:
kubectl get events -n <ns> --field-selector involvedObject.name=<pod>
If still no match, report whatever error information you have found and let the user decide next steps. Do NOT start autonomous network investigation.
not found / manifest unknown — Image does not existThe image name or tag does not exist in the registry. Inform the user to verify the image name and tag.
unauthorized / access denied / denied — Authentication failedThe registry rejected the request. Advise the user to:
x509 / certificate / tls — Certificate not trustedThe node's containerd does not trust the registry's CA certificate. Advise the user to add the registry CA to the node's containerd trust config (/etc/containerd/certs.d/) or system trust store.
no such host / lookup.*failed — DNS resolution failedThe registry hostname cannot be resolved. Advise the user to check the hostname spelling and node DNS config.
connection reset by peer — Remote resetTCP reached the registry but was reset by the remote end — a server-side or intermediary issue.
i/o timeout / dial tcp.*timeout — Connection timed outThe node cannot establish a TCP connection to the registry. Common causes: firewall blocking, proxy misconfiguration, or registry unreachable from the node's network.
Docker Hub specific: If the image is from Docker Hub (docker.io) and the node is in mainland China, this is almost certainly caused by network restrictions (GFW). Advise the user to use a registry mirror or re-tag the image to a domestically accessible registry.
Report the timeout to the user and stop. Do NOT attempt network diagnostics (ping, curl, iptables, etc.).
connection refused — Connection refusedTCP reached the host but the port is not listening. The registry service is down or on a different port.
too many requests / 429 / rate limit — Rate limitedThe registry is throttling requests. Inform the user to wait for the rate limit window to expire, or configure a registry mirror to reduce direct requests.
no space left on device — Disk fullContainerd cannot unpack image layers due to insufficient disk space on the node.
invalid reference format — Malformed image nameThe image reference contains illegal characters or has incorrect format. Inform the user to fix the image field in the pod spec.
server gave HTTP response to HTTPS client — Protocol mismatchThe registry serves HTTP but containerd expects HTTPS. Advise the user to configure the registry as insecure in containerd config, or enable TLS on the registry.
does not match the specified platform — Architecture mismatchThe image exists but has no manifest for this node's CPU architecture. Inform the user to use a multi-arch image or the correct platform-specific tag.
ErrImageNeverPull — Pull policy forbids pullingimagePullPolicy is Never but the image is not present on the node. Inform the user to pre-load the image or change imagePullPolicy.
--since parameter should cover the pod's creation time. If the pod was created long ago, increase accordingly (e.g. --since "24h ago").state.waiting.message from step 1 is your best available information — report it directly.testing
Show and ping the gateway of a network interface, on a Kubernetes node or inside a pod's network namespace. Auto-detects the gateway from the routing table (ip -j route), reports interface type (RoCE / Ethernet / IB), and tests reachability with ping. Use for default-route / gateway questions, network reachability checks, RoCE/RDMA data-path validation, and "can this node/pod reach its gateway" investigations.
development
Guide for writing and improving Siclaw skills. Read this when creating or modifying a skill. Covers skill directory layout, SKILL.md format, script execution modes, and best practices.
devops
Retrieve logs from a Kubernetes node. Supports journalctl (systemd units) and file-based logs. Use when you need to inspect node-level logs (containerd, kubelet, etc.). Run via host_script (preferred) or node_script.
development
Guides the user to the Siclaw Web page to manage Skills. Use this guide when the user requests to create, edit, or view a Skill in a Channel conversation.