skills/core/volcano-resource-insufficient/SKILL.md
Resource shortage diagnostic guide for Volcano. Use when seeing Insufficient cpu/memory events, OOMKilled pods, or nodes with zero allocatable resources.
npx skillsauth add scitix/siclaw volcano-resource-insufficientInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This guide helps diagnose resource shortage issues in Volcano-scheduled workloads. Resource insufficiency is one of the most common causes of scheduling failures.
Scope: This skill is for diagnosis only. Once you identify the root cause, report it to the user and stop. Do NOT attempt to modify resource quotas or delete workloads.
Use this skill when:
Insufficient cpu or Insufficient memoryPending with resource-related eventsOOMKilled (Out of Memory)FailedScheduling events mention resource constraintsThe entire cluster lacks sufficient resources to meet the workload demands.
Total resources exist but are distributed across too many nodes to satisfy specific scheduling constraints (like Gang scheduling).
Individual nodes lack enough resources, even though the cluster as a whole has capacity.
The Queue has reached its deserved resource limit, preventing new pods from being scheduled.
Check the specific error message in events:
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name> --sort-by='.lastTimestamp'
Common error patterns:
| Error Message | Resource Type | Scope |
|--------------|---------------|-------|
| Insufficient cpu | CPU | Node-level |
| Insufficient memory | Memory | Node-level |
| Insufficient nvidia.com/gpu | GPU | Node-level |
| 0/N nodes are available | General | Cluster-level |
| exceeded quota | Queue-level | Queue limit |
Determine how much resources the pod is requesting:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources.requests}'
For detailed breakdown:
kubectl get pod <pod-name> -n <namespace> -o yaml | grep -A 10 "resources:"
Key fields:
resources.requests.cpu - CPU cores requested (e.g., "100m" = 0.1 core, "2" = 2 cores)resources.requests.memory - Memory requested (e.g., "1Gi", "512Mi")resources.requests.nvidia.com/gpu - GPUs requestedresources.limits - Maximum allowed (may be different from requests)View total allocatable resources per node:
kubectl get nodes -o custom-columns='NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory,GPU:.status.allocatable.nvidia\.com/gpu,PODS:.status.allocatable.pods'
For detailed node information:
kubectl describe node <node-name>
Key concepts:
allocatable = Total capacity - System reserved - Kubelet reservedcapacity = Total hardware capacityIf metrics-server is available:
kubectl top nodes
For per-node pod usage:
kubectl top pods --all-namespaces --sort-by=cpu | head -20
kubectl top pods --all-namespaces --sort-by=memory | head -20
Note: If metrics-server is not available, you can still see resource allocation (requests) but not actual usage.
For each node, calculate available resources:
Available CPU = allocatable.cpu - sum(all pod requests on node)
Available Memory = allocatable.memory - sum(all pod requests on node)
Quick check with:
kubectl describe node <node-name> | grep -A 20 "Allocated resources"
Look for:
cpu-requests vs cpu-capacitymemory-requests vs memory-capacityFor Gang scheduling or affinity constraints, fragmentation is critical:
# Count nodes that can fit a single pod
NODE_CPU_REQ="4"
NODE_MEM_REQ="8Gi"
kubectl get nodes -o json | jq -r '
.items[] |
select(.status.allocatable.cpu | tonumber >= '"$NODE_CPU_REQ"') |
select(.status.allocatable.memory | ascii_downcase | gsub("[gimk]"; "") | tonumber >= 8) |
.metadata.name'
Fragmentation indicators:
Symptom: Insufficient cpu or Insufficient memory on all nodes
Diagnosis:
# Check pod request
kubectl get pod <pod> -o jsonpath='{.spec.containers[0].resources.requests.cpu}'
# Output: 32
# Check largest node's allocatable
kubectl get nodes -o jsonpath='{range .items[*]}{.metadata.name}{"\t"}{.status.allocatable.cpu}{"\n"}{end}' | sort -k2 -n | tail -1
# Output: node-1 16
Analysis: Pod requests 32 CPUs, but largest node only has 16 allocatable.
Solution:
Symptom: Most nodes show high allocation percentage
Diagnosis:
kubectl describe node <node-name> | grep "Allocated resources"
# cpu-requests: 14900m (93%)
# memory-requests: 55000Mi (85%)
Analysis: Nodes are 85-93% allocated, leaving little room for new pods.
Solution:
Symptom: Gang scheduling fails despite total resources being sufficient
Diagnosis:
# Total cluster CPU
kubectl get nodes -o jsonpath='{range .items[*]}{.status.allocatable.cpu}{"\n"}{end}' | awk '{sum+=$1} END {print sum}'
# Output: 64
# Available nodes for 4-CPU pods
kubectl get nodes -o custom-columns='NAME:.metadata.name,CPU:.status.allocatable.cpu' | awk '$2 >= 4 {count++} END {print count " nodes can fit the pod"}'
# Output: 2 nodes can fit the pod
# But we need 8 pods for Gang
# 2 < 8, so Gang fails
Analysis: Total cluster has 64 CPUs, but only 2 nodes have 4+ CPUs. Gang needs 8 pods.
Solution:
binpack plugin to concentrate podsSymptom: Events mention queue limits, PodGroup stays in Pending
Diagnosis:
# Check queue status
kubectl get queue <queue-name> -o yaml
Look for:
status.allocated >= status.deservedstate is Open but no capacity availableAnalysis: Queue has used all its deserved resources.
Solution:
volcano-queue-diagnose for detailed analysisSymptom: Insufficient nvidia.com/gpu in events
Diagnosis:
# Check GPU allocatable
kubectl get nodes -o custom-columns='NAME:.metadata.name,GPU:.status.allocatable.nvidia\.com/gpu'
# Check GPU usage (if metrics available)
kubectl top nodes --show-capacity 2>/dev/null || echo "GPU metrics not available"
Analysis: GPU resources are fully allocated or fragmented across nodes.
Solution:
kubectl get nodes -o json | jq -r '
.items |
map(.status.allocatable) |
reduce .[] as $item ({};
. + {cpu: ((.cpu // 0 | tonumber) + ($item.cpu | tonumber)),
memory: ((.memory // 0) + ($item.memory | tonumber))})'
kubectl get pods --all-namespaces -o json | jq -r '
.items[] |
select(.spec.containers[].resources.requests.cpu | tonumber > 4) |
"\(.metadata.namespace)/\(.metadata.name): \(.spec.containers[].resources.requests)"'
# High-level view
kubectl get nodes -o custom-columns='NAME:.metadata.name,CPU_ALLOC:.status.allocatable.cpu,MEM_ALLOC:.status.allocatable.memory'
Right-size resource requests
Use cluster autoscaler
Enable binpack plugin
Monitor resource quotas
volcano-queue-diagnose proactivelyRegular capacity planning
volcano-diagnose-pod - General Pod scheduling diagnosisvolcano-gang-scheduling - Gang scheduling constraint issuesvolcano-queue-diagnose - Queue resource analysisvolcano-node-resources - Node resource queryingtesting
Show and ping the gateway of a network interface, on a Kubernetes node or inside a pod's network namespace. Auto-detects the gateway from the routing table (ip -j route), reports interface type (RoCE / Ethernet / IB), and tests reachability with ping. Use for default-route / gateway questions, network reachability checks, RoCE/RDMA data-path validation, and "can this node/pod reach its gateway" investigations.
development
Guide for writing and improving Siclaw skills. Read this when creating or modifying a skill. Covers skill directory layout, SKILL.md format, script execution modes, and best practices.
devops
Retrieve logs from a Kubernetes node. Supports journalctl (systemd units) and file-based logs. Use when you need to inspect node-level logs (containerd, kubelet, etc.). Run via host_script (preferred) or node_script.
development
Guides the user to the Siclaw Web page to manage Skills. Use this guide when the user requests to create, edit, or view a Skill in a Channel conversation.