skills/core/volcano-gang-scheduling/SKILL.md
Gang Scheduling diagnostic guide for Volcano. Use when PodGroup cannot schedule completely, member Pods remain Pending, or minAvailable/minMember constraints are not satisfied.
npx skillsauth add scitix/siclaw volcano-gang-schedulingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This is a diagnostic guide for Gang scheduling issues in Volcano. Gang scheduling requires that all members of a PodGroup be scheduled simultaneously. If the cluster cannot satisfy the minMember requirement, none of the pods will be scheduled.
Scope: This skill is for diagnosis only. Once you identify the root cause, report it to the user and stop. Do NOT attempt to modify PodGroups or resource configurations.
Use this skill when:
Inqueue but member Pods remain PendingminMember related errorsminAvailable or minMember that cannot be satisfiedFailedScheduling events mentioning Gang constraintsGang scheduling in Volcano ensures that either all members of a workload are scheduled, or none are. This is crucial for distributed workloads like MPI, TensorFlow, PyTorch where partial scheduling is wasteful.
Key Concepts:
minMember (in PodGroup spec): Minimum number of pods that must be scheduled simultaneouslyminResources (in PodGroup spec): Aggregate resource floor (e.g., total GPUs) that must be available — both minMember and minResources must be satisfied if setminAvailable (in Job spec): Similar concept at Job levelFind the PodGroup associated with the pending pods:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.metadata.annotations.scheduling.volcano.sh/pod-group}'
Get detailed PodGroup information:
kubectl get podgroup <podgroup-name> -n <namespace> -o yaml
Key fields to examine:
| Field | Meaning | What to Look For |
|-------|---------|------------------|
| spec.minMember | Minimum pods required | Is this number achievable? |
| spec.minResources | Aggregate resource floor | Is total cluster capacity sufficient? |
| status.phase | Current scheduling phase | Should be Inqueue for ready-to-schedule |
| status.running | Currently running pods | Compare to minMember |
| status.pending | Pending pods | These are waiting for Gang constraint |
| spec.queue | Queue name | Check if queue has sufficient resources |
Common scenarios:
status.phase: Pending - PodGroup is waiting to be enqueuedstatus.phase: Inqueue - Ready for scheduling but constraint not metstatus.running < spec.minMember - Gang constraint not satisfiedCalculate the total resources needed for the Gang:
Total CPU = minMember × single Pod CPU request
Total Memory = minMember × single Pod Memory request
Total GPU = minMember × single Pod GPU request (if applicable)
Get a pod's resource requests:
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.containers[*].resources.requests}'
View available resources across nodes:
kubectl get nodes -o custom-columns='NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory,GPU:.status.allocatable.nvidia\.com/gpu'
Check current resource usage:
kubectl top nodes
If pods target specific nodes:
kubectl get nodes -l <label-key>=<label-value> -o wide
Look for Gang-specific scheduling errors:
kubectl get events -n <namespace> --field-selector involvedObject.name=<pod-name> --sort-by='.lastTimestamp'
Common Gang-related event messages:
| Message | Meaning | Investigation |
|---------|---------|---------------|
| minMember not satisfied | Gang constraint preventing scheduling | Check if total resources >= minMember requirements |
| gang member not ready | Some pods in the gang are not ready | Check individual pod status |
| resource insufficient | Not enough resources for all members | Use volcano-resource-insufficient skill |
If the PodGroup is in a Queue, check if the queue has sufficient deserved resources:
kubectl get queue <queue-name>
kubectl describe queue <queue-name>
Look for:
status.deserved vs status.allocatedstatus.state is Open (not Closing or Closed)Symptom: minMember is larger than the number of available nodes, or requires more resources than any single node can provide.
Example:
Solution:
minMember in PodGroup specSymptom: Total cluster resources are sufficient, but not concentrated on enough nodes to satisfy simultaneous scheduling.
Example:
Solution:
binpack plugin to concentrate pods on fewer nodesSymptom: Resources exist but are being used by lower-priority workloads that should be preempted.
Check:
Solution:
priority plugin is enabled in scheduler configSymptom: The PodGroup's queue has used all its deserved resources.
Check:
kubectl get queue <queue-name> -o jsonpath='{.status.allocated}'
kubectl get queue <queue-name> -o jsonpath='{.status.deserved}'
Solution:
volcano-queue-diagnose for detailed analysisSymptom: Queue shows available capacity, but Gang still blocks. Pod scheduling constraints narrow the effective node pool below what Gang requires.
Diagnosis — compute the effective node pool:
# 1. Check pod's nodeSelector
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.nodeSelector}'
# 2. Check matching nodes
kubectl get nodes -l <selector-key>=<selector-value> -o custom-columns="NAME:.metadata.name,CPU:.status.allocatable.cpu,GPU:.status.allocatable['nvidia.com/gpu']"
# 3. Check tolerations (tainted nodes require matching tolerations)
kubectl get pod <pod-name> -n <namespace> -o jsonpath='{.spec.tolerations}'
kubectl get nodes -o custom-columns="NAME:.metadata.name,TAINTS:.spec.taints[*].key"
Volcano scheduling is two-phase: first queue-level admission (capacity check), then node-level placement. A job can pass the queue check but fail node placement if all matching nodes are occupied.
Solution:
Symptom: Queue allocated < deserved, PodGroup is Inqueue, but pods remain Pending.
Check — verify remaining capacity vs Gang requirement:
# Queue remaining capacity
kubectl get queue <queue> -o jsonpath='{"deserved: "}{.status.deserved}{"\nallocated: "}{.status.allocated}'
# PodGroup minMember and minResources
kubectl get podgroup <pg> -n <ns> -o jsonpath='{"minMember: "}{.spec.minMember}{"\nminResources: "}{.spec.minResources}'
Calculate: remaining = deserved - allocated. If remaining < minMember × per-pod-resources, the Gang cannot be satisfied even though the queue is not fully used.
If minResources is set, also verify: remaining >= minResources for each resource dimension.
Solution:
minMember or minResources if the job can tolerate partial schedulingSymptom: Job was Running, then moves to Aborted. Running pod count dropped below minMember.
This happens when pods are evicted (preemption, node failure, OOM) and the remaining count falls below the Gang constraint, causing the entire group to be torn down.
Check:
# Current running vs required
kubectl get podgroup <pg> -n <ns> -o jsonpath='{"running: "}{.status.running}{"\nminMember: "}{.spec.minMember}'
# Check for eviction/preemption events
kubectl get events -n <ns> --field-selector reason=Preempted
kubectl get events -n <ns> --field-selector reason=Evicted
Solution:
reclaimable: false on the queue to prevent preemptionAfter identifying the issue, verify your analysis:
Check if issue is Gang-specific:
Calculate minimum requirements:
Check scheduler logs:
# Use volcano-scheduler-logs skill
bash skills/core/volcano-scheduler-logs/scripts/get-scheduler-logs.sh --keyword gang
Gang Scheduling constraint: Must have enough resources to schedule minMember Pods simultaneously on different nodes.
Even if total cluster resources are sufficient, if resources are released gradually over time (as other pods complete), the "simultaneous" requirement may not be met.
Distinguish between:
volcano-diagnose-pod - General Pod scheduling diagnosisvolcano-queue-diagnose - Queue status and resource analysisvolcano-resource-insufficient - Resource shortage diagnosisvolcano-scheduler-logs - Scheduler log analysistesting
Show and ping the gateway of a network interface, on a Kubernetes node or inside a pod's network namespace. Auto-detects the gateway from the routing table (ip -j route), reports interface type (RoCE / Ethernet / IB), and tests reachability with ping. Use for default-route / gateway questions, network reachability checks, RoCE/RDMA data-path validation, and "can this node/pod reach its gateway" investigations.
development
Guide for writing and improving Siclaw skills. Read this when creating or modifying a skill. Covers skill directory layout, SKILL.md format, script execution modes, and best practices.
devops
Retrieve logs from a Kubernetes node. Supports journalctl (systemd units) and file-based logs. Use when you need to inspect node-level logs (containerd, kubelet, etc.). Run via host_script (preferred) or node_script.
development
Guides the user to the Siclaw Web page to manage Skills. Use this guide when the user requests to create, edit, or view a Skill in a Channel conversation.