skills/core/deployment-rollout-debug/SKILL.md
Diagnose Deployment rollout failures (stuck rollouts, ProgressDeadlineExceeded, replica mismatch). Checks rollout status, ReplicaSets, and new pod health to identify why an update is failing.
npx skillsauth add scitix/siclaw deployment-rollout-debugInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When a Deployment rollout is stuck, not progressing, or shows replica mismatches, follow this flow to identify the root cause.
Scope: This skill is for diagnosis only. Once you identify the root cause, report it to the user and stop. Do NOT attempt to roll back, scale, or modify the Deployment — that should be left to the user.
kubectl rollout status deployment/<name> -n <ns> --timeout=5s
This shows whether the rollout is progressing, complete, or stuck. A timeout indicates the rollout is not making progress.
kubectl get deployment <name> -n <ns> -o wide
Compare the columns:
If UP-TO-DATE < DESIRED or AVAILABLE < DESIRED, the rollout is incomplete.
kubectl describe deployment <name> -n <ns>
Focus on:
Progressing (status and reason) and AvailablemaxSurge and maxUnavailable settingskubectl get rs -n <ns> -l app=<name>
If the label app=<name> doesn't match, find ReplicaSets owned by the Deployment:
kubectl get rs -n <ns> --sort-by='.metadata.creationTimestamp' | grep <name>
You should see the old RS (with reduced replicas) and the new RS (scaling up). If the new RS has 0 ready replicas, its pods are failing.
Find pods from the new RS:
kubectl get pods -n <ns> -l app=<name> --sort-by='.metadata.creationTimestamp'
Check the status of the newest pods. Based on their status:
pod-pending-debug skillpod-crash-debug skillimage-pull-debug skillProgressDeadlineExceeded — Rollout timed outThe Deployment's spec.progressDeadlineSeconds (default 600s) has been exceeded without progress.
This is a symptom, not a cause. The root cause is in the new pods — check step 5 for why new pods are not becoming ready.
MinimumReplicasUnavailable — Not enough replicas availableThe Deployment cannot maintain the minimum number of available replicas during the rollout.
Check the new pods' status (step 5). The new version's pods are failing to start or pass readiness probes.
Running but not Ready — Readiness probe failingThe new pods started successfully but are failing their readiness probe. The rollout will not progress because the new pods are not considered available.
Check the readiness probe configuration and failures:
kubectl describe pod <new-pod> -n <ns>
Look for Readiness probe failed events. Common causes:
initialDelaySeconds too low)maxSurge: 0 and maxUnavailable: 0 — Invalid strategyIf both maxSurge and maxUnavailable are 0, the rollout cannot proceed because it cannot create extra pods and cannot remove existing pods. At least one must be greater than 0.
The new ReplicaSet exists but has 0 replicas or cannot create pods:
kubectl describe rs <new-rs> -n <ns>
Check events for:
quota-debug for native ResourceQuota/LimitRange diagnosis, or volcano-queue-diagnose for Volcano gang scheduling clustersThe old ReplicaSet keeps its replicas because the new RS pods are not ready yet. This is expected behavior — Kubernetes will not remove old pods until new pods are available. Fix the new pods first.
kubectl rollout status waits for the rollout to complete by default. Use --timeout to avoid blocking.kubectl rollout undo deployment/<name> -n <ns> (but let the user decide, do not execute this).testing
Show and ping the gateway of a network interface, on a Kubernetes node or inside a pod's network namespace. Auto-detects the gateway from the routing table (ip -j route), reports interface type (RoCE / Ethernet / IB), and tests reachability with ping. Use for default-route / gateway questions, network reachability checks, RoCE/RDMA data-path validation, and "can this node/pod reach its gateway" investigations.
development
Guide for writing and improving Siclaw skills. Read this when creating or modifying a skill. Covers skill directory layout, SKILL.md format, script execution modes, and best practices.
devops
Retrieve logs from a Kubernetes node. Supports journalctl (systemd units) and file-based logs. Use when you need to inspect node-level logs (containerd, kubelet, etc.). Run via host_script (preferred) or node_script.
development
Guides the user to the Siclaw Web page to manage Skills. Use this guide when the user requests to create, edit, or view a Skill in a Channel conversation.