skills/democratic-csi-nvmeof-attach-failure/SKILL.md
Debug democratic-csi NVMe-oF volume attachment failures. Use when: (1) Pods are stuck in Init:0/1 and events show "MountVolume.MountDevice failed: unable to attach any nvme devices", (2) democratic-csi NodeStageVolume returns "unable to attach any nvme devices", (3) PVC provisioning succeeds but pods never start because volumes can't be mounted, (4) Node plugin logs show "connecting to transport: tcp://<IP>:4420" where the IP differs from the driver config. Covers four distinct root causes: nvme binary hostname resolution failure (libnss), NVMe-oF TCP port not reachable (wrong interface or service not running), kernel module availability, and stale transport IP stored in PV volumeAttributes at provision time.
npx skillsauth add aldengolab/lorist democratic-csi-nvmeof-attach-failureInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Pods that require NVMe-oF PVCs are stuck in Init:0/1. PVCs provision successfully
(TrueNAS creates volumes), but attachment at the node fails. The democratic-csi node
plugin logs "unable to attach any nvme devices" on every NodeStageVolume call.
MountVolume.MountDevice failed ... unable to attach any nvme deviceshandler error - method: NodeStageVolume error: {"message":"unable to attach any nvme devices"}Bound (provisioning OK) but pods never leave Init stateCreateVolume responses from TrueNASWork through four failure modes in order:
Check controller logs to confirm volumes were created on TrueNAS:
kubectl logs -n democratic-csi <controller-pod> -c csi-driver --tail=50 | grep -i "CreateVolume\|error"
If CreateVolume succeeded but NodeStageVolume fails, the problem is network/transport, not credentials or config.
kubectl exec -n democratic-csi <node-pod> -c csi-driver -- cat /proc/modules | grep nvme
Required modules: nvme_tcp, nvme_fabrics, nvme_core. If missing, the host kernel lacks NVMe-oF support.
kubectl exec -n democratic-csi <node-pod> -c csi-driver -- \
nvme discover -t tcp -a <shareHost-value> -s <sharePort>
If output contains "No support for hostname IP address resolution; recompile with libnss support":
nvme binary in the container is statically compiled and can't resolve hostnames via NSSshareHostkubectl run -it --rm dns-test --image=busybox --restart=Never -- nslookup <hostname>kubectl exec -n democratic-csi <node-pod> -c csi-driver -- \
nvme discover -t tcp -a <resolved-IP> -s <sharePort>
If output is "Connection refused" or "failed to get transport address":
Confirm with a busybox TCP probe:
kubectl run -it --rm tcp-test --image=busybox --restart=Never -- \
sh -c "echo connected > /dev/tcp/<IP>/<port> && echo OPEN || echo REFUSED"
Common cause: the storage server (e.g., TrueNAS) binds the NVMe-oF target to a LAN
interface but shareHost in the driver config points to a different IP (e.g., Tailscale
overlay IP). The storage controller API (HTTP) works on the overlay but NVMe-oF TCP
doesn't.
Fix options:
shareHost in the driver config to the LAN/storage-network IPIf the config is managed via ESO (ExternalSecrets), update the backing secret in the secret store, then force a sync:
kubectl annotate externalsecret <name> -n <ns> force-sync=$(date +%s) --overwrite
After updating, restart the democratic-csi node plugin pods to pick up the new config.
If steps 1–5 pass (kernel modules present, transport reachable at the correct IP, driver config looks right) but NodeStageVolume still fails, check whether the PV itself stores the wrong transport IP:
kubectl get pv <pv-name> -o jsonpath='{.spec.csi.volumeAttributes.transports}'
Diagnostic signal: node plugin logs will show
connecting to transport: tcp://<wrong-IP>:4420 where <wrong-IP> differs from the
transports in the current driver config. This happens because volumeAttributes are
written at provision time from the storage backend API response and take precedence
over the current driver config for existing volumes.
Fix: spec.persistentvolumesource is immutable — it cannot be patched in place.
See kubernetes-csi-pv-spec-update for the replace --force + finalizer removal
procedure.
Root fix: Correct the NVMe-oF port binding on TrueNAS so future volumes are provisioned with the correct storage-network IP. Existing PVs must still be replaced manually.
# Confirm NodeStageVolume no longer errors
kubectl logs -n democratic-csi <node-pod> -c csi-driver --tail=20 | grep -i "error\|NodeStage"
# Check pod status
kubectl get pods -n <app-namespace>
Pods should transition from Init:0/1 to Running.
hostPID: true on the node DaemonSet is required for nsenter-based nvme operations if you go that route.driver-config-file.yaml (democratic-csi chart requirement).development
Build a UEFI Secure Boot PXE netboot server for Ubuntu autoinstall. Use when: designing or implementing network boot infrastructure for automated Ubuntu provisioning with Secure Boot enabled. Covers the complete chain: signed shim+GRUB selection, TFTP layout, kernel parameters, autoinstall config requirements, and post-install bootstrapping scripts. Also applicable when debugging an existing PXE setup that uses the wrong GRUB binary or config paths.
development
Design pattern for running a persistent PXE/TFTP server that safely coexists with already-installed nodes. Use when: building PXE infrastructure that should stay always-on, designing automated bare-metal provisioning in GitOps/Kubernetes environments, or any PXE setup where UEFI boot order has network boot first. Eliminates boot loops without requiring UEFI firmware changes.
development
This skill governs all prose output — Claude's own responses, documentation, PR descriptions, commit messages, README content, comments, and any text the user asks to draft or edit. It should also be used when the user asks to "review my writing", "edit this for clarity", "make this clearer", "simplify this text", "rewrite this", "check my prose", "tighten this up", or "make this more concise". Based on George Orwell's "Politics and the English Language" (1946).
development
Debug Kubernetes pods using hostNetwork: true that crash with "Address already in use" or "failed to create listening socket for port N". Use when: (1) a hostNetwork pod container is in CrashLoopBackOff and logs show a port bind failure, (2) the port works fine in non-hostNetwork pods but fails with hostNetwork, (3) you need to identify which host-level process holds a port from within Kubernetes (no SSH). Covers /proc/net/udp inspection and kubectl debug node with nsenter.