skills/core/statefulset-debug/SKILL.md
Diagnose StatefulSet rollout and scaling failures (ordered update stuck, OnDelete not updating, partition misconfiguration, PVC binding deadlocks). Checks update strategy, pod ordinal progression, PVC bindings, and ordered startup to identify why a StatefulSet is not progressing.
npx skillsauth add scitix/siclaw statefulset-debugInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When a StatefulSet rollout is stuck, pods are not updating, or scaling is not progressing, follow this flow to identify the root cause.
Scope: This skill is for diagnosis only. Once you identify the root cause, report it to the user and stop. Do NOT attempt to modify the StatefulSet, delete pods, or change PVCs — that should be left to the user.
When to use: A StatefulSet is not progressing — pods are not updating to the new version, scaling up/down is stuck, or specific ordinal pods are not becoming ready.
Not for Deployments: Deployment rollouts have different semantics (parallel, unordered). Use deployment-rollout-debug for Deployments.
StatefulSets differ fundamentally from Deployments:
volumeClaimTemplateskubectl get statefulset <name> -n <ns> -o wide
Compare the columns:
spec.replicas)currentRevision == updateRevision)If READY < REPLICAS or there is no UP-TO-DATE column showing full count, the rollout or scaling is incomplete.
kubectl describe statefulset <name> -n <ns>
Focus on:
RollingUpdate or OnDeleteFirst get the StatefulSet's pod selector to reliably find its pods:
kubectl get statefulset <name> -n <ns> -o jsonpath='{.spec.selector.matchLabels}'
Then use the returned labels to list pods:
kubectl get pods -n <ns> -l <key>=<value> --sort-by='.metadata.name'
Identify which ordinal pod is stuck. In a StatefulSet with OrderedReady policy, the stuck pod blocks all subsequent operations.
The StatefulSet uses updateStrategy.type: OnDelete. In this mode, Kubernetes does not automatically update pods — the user must manually delete each pod for it to be recreated with the new spec.
kubectl get statefulset <name> -n <ns> -o jsonpath='{.spec.updateStrategy}'
If the output shows {"type":"OnDelete"} or no rollingUpdate field:
Check if the current and update revisions differ:
kubectl get statefulset <name> -n <ns> -o jsonpath='current={.status.currentRevision} update={.status.updateRevision}'
If they differ, the StatefulSet spec has been updated but pods are still running the old version. This is expected behavior for OnDelete — pods must be manually deleted to pick up the new version.
Check which pods are still on the old revision (use the selector from step 3):
kubectl get pods -n <ns> -l <key>=<value> -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.metadata.labels.controller-revision-hash}{"\n"}{end}'
Pods whose controller-revision-hash matches currentRevision (not updateRevision) are still on the old version.
In RollingUpdate mode, StatefulSet updates pods in reverse ordinal order (N-1 → N-2 → ... → 0). By default (one-at-a-time), if pod at ordinal K is not Ready, the update stops — pods K-1, K-2, ..., 0 will not be updated.
Check maxUnavailable (Kubernetes 1.24+, GA in 1.27):
kubectl get statefulset <name> -n <ns> -o jsonpath='{.spec.updateStrategy.rollingUpdate.maxUnavailable}'
If maxUnavailable is set (e.g., 3), multiple pods can be updated simultaneously instead of strict one-at-a-time. In this case, seeing 2-3 pods updating at once is normal — not a sign of being stuck. Only investigate if the number of updating pods is below maxUnavailable for an extended period, or if specific pods are stuck in a non-Ready state.
Find pods that are not Ready:
kubectl get pods -n <ns> -l <key>=<value> --sort-by='.metadata.name'
Check the stuck pod's status:
pod-pending-debugpod-crash-debugimage-pull-debugIf the pod is Running but not Ready:
kubectl describe pod <stuck-pod> -n <ns>
Look for Readiness probe failed events. Common causes:
The StatefulSet has spec.updateStrategy.rollingUpdate.partition set. Only pods with ordinal ≥ partition are updated; pods with ordinal < partition remain on the old version.
kubectl get statefulset <name> -n <ns> -o jsonpath='{.spec.updateStrategy.rollingUpdate.partition}'
If this returns a number (e.g., 3), then pods 0, 1, 2 will NOT be updated. This is often used intentionally for canary rollouts — update a subset first, verify, then lower the partition to 0 to roll out fully.
If the user expects all pods to be updated, the partition value needs to be set to 0 or removed.
When scaling up, StatefulSet creates pods in forward ordinal order (0 → 1 → 2 → ...). Pod at ordinal K+1 is not created until pod K is Running and Ready.
kubectl get pods -n <ns> | grep <statefulset-name>
Find the highest ordinal pod that exists — the next ordinal is waiting for this pod to become Ready.
Check why the current highest pod is not Ready (same diagnosis as the "stuck at specific ordinal" pattern above).
For the podManagementPolicy field:
kubectl get statefulset <name> -n <ns> -o jsonpath='{.spec.podManagementPolicy}'
If the policy is Parallel and pods are still stuck, the issue is not ordering — check individual pod status.
StatefulSet pods use volumeClaimTemplates to create per-pod PVCs. If the PVC is bound to a PV in a specific availability zone (AZ) or node, but that node/AZ has no resources, the pod cannot be scheduled.
Check PVC status for the stuck pod:
kubectl get pvc -n <ns> | grep <statefulset-name>
kubectl describe pvc <pvc-name> -n <ns>
Check the StorageClass's volumeBindingMode:
kubectl get storageclass $(kubectl get pvc <pvc-name> -n <ns> -o jsonpath='{.spec.storageClassName}') -o jsonpath='{.volumeBindingMode}'
Pending and the pod stays Pending — a deadlock.Check if the PV has a node affinity constraint:
kubectl get pv <pv-name> -o jsonpath='{.spec.nodeAffinity}'
If the PV is locked to a specific node/zone:
kubectl describe node <node>kubectl get node <node>Common scenario: A node was replaced or drained, but the PV is still bound to the old node's zone. The new pod can only be scheduled to nodes that can access this PV, but those nodes may be full or tainted.
For further PVC diagnosis, use the pvc-debug skill.
When a StatefulSet is scaled down, pods are deleted in reverse ordinal order (N-1 → N-2 → ...). However, Kubernetes does not automatically delete the associated PVCs.
kubectl get pvc -n <ns> | grep <statefulset-name>
If there are PVCs for ordinals that no longer exist (e.g., data-myapp-3 when replicas is 2), these are orphaned PVCs from a previous scale-down.
This is by design to prevent data loss. But when scaling back up, the new pod will reattach to the old PVC with stale data, which may cause application issues.
Check the StatefulSet's persistentVolumeClaimRetentionPolicy (Kubernetes 1.27+):
kubectl get statefulset <name> -n <ns> -o jsonpath='{.spec.persistentVolumeClaimRetentionPolicy}'
During an update or scale-down, if a pod is stuck in Terminating, the next operation cannot proceed.
kubectl describe pod <terminating-pod> -n <ns>
First check if a PodDisruptionBudget (PDB) is preventing the deletion:
kubectl get pdb -n <ns>
kubectl describe pdb <pdb-name> -n <ns>
If the PDB's minAvailable or maxUnavailable limit has been reached, the StatefulSet controller cannot delete the pod. Check status.disruptionsAllowed — if it is 0, no more pods can be disrupted until other pods become Ready.
If PDB is not the issue, check other common causes:
metadata.finalizersterminationGracePeriodSeconds to expireCheck the grace period:
kubectl get pod <pod> -n <ns> -o jsonpath='{.spec.terminationGracePeriodSeconds}'
OnDelete is frequently used in database StatefulSets (MySQL, PostgreSQL, etc.) where the operator wants manual control over when each replica is restarted. If a user complains that pods are not updating, check the strategy before assuming there is a bug.partition field is for canary rollouts. A common workflow: set partition=N-1 to update only the last pod, verify, then set partition=0 to roll out to all pods. If a user sees partial updates, check partition before investigating further.volumeClaimTemplates follow the naming convention <volumeClaimTemplate-name>-<statefulset-name>-<ordinal>. Use this pattern to find PVCs for specific ordinals.pod-pending-debug. If it is crashing, use pod-crash-debug. If PVCs are not binding, use pvc-debug.testing
Show and ping the gateway of a network interface, on a Kubernetes node or inside a pod's network namespace. Auto-detects the gateway from the routing table (ip -j route), reports interface type (RoCE / Ethernet / IB), and tests reachability with ping. Use for default-route / gateway questions, network reachability checks, RoCE/RDMA data-path validation, and "can this node/pod reach its gateway" investigations.
development
Guide for writing and improving Siclaw skills. Read this when creating or modifying a skill. Covers skill directory layout, SKILL.md format, script execution modes, and best practices.
devops
Retrieve logs from a Kubernetes node. Supports journalctl (systemd units) and file-based logs. Use when you need to inspect node-level logs (containerd, kubelet, etc.). Run via host_script (preferred) or node_script.
development
Guides the user to the Siclaw Web page to manage Skills. Use this guide when the user requests to create, edit, or view a Skill in a Channel conversation.