skills/core/volcano-queue-diagnose/SKILL.md
Diagnose Volcano Queue status and resource allocation. Check queue weights, deserved resources, allocated resources, and identify queue-related scheduling bottlenecks.
npx skillsauth add scitix/siclaw volcano-queue-diagnoseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Diagnose Volcano Queue status, resource allocation, and scheduling bottlenecks. This skill helps understand how resources are distributed across queues and why workloads may be pending due to queue constraints.
Scope: This skill is for diagnosis only. Once you identify the root cause, report it to the user and stop. Do NOT attempt to modify queue configurations or delete queues.
Not applicable to native ResourceQuota: Volcano Queue and Kubernetes ResourceQuota are independent mechanisms. If the cluster does not use Volcano, use quota-debug instead. To check: kubectl get queue 2>/dev/null — if it returns an error or empty, Volcano is not installed.
bash skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh [options]
| Parameter | Required | Description |
|-----------|----------|-------------|
| --queue QUEUE | no | Queue name to diagnose (default: all queues) |
| --show-pods | no | Show pods associated with each queue |
| --verbose | no | Show detailed resource breakdown |
Diagnose all queues:
bash skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh
Diagnose specific queue:
bash skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh --queue training-queue
Show verbose output with pod information:
bash skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh --queue training-queue --show-pods --verbose
In Volcano, a Queue is a cluster-level resource allocation unit. Jobs and PodGroups are submitted to queues, and the scheduler distributes resources among queues based on:
Important: A Queue is a cluster-scoped resource. PodGroups from any namespace can reference the same queue, so cross-namespace resource competition within a queue is expected.
| Field | Meaning |
|-------|---------|
| state | Queue state: Open, Closed, Closing |
| deserved | Resources the queue should receive based on weight |
| allocated | Resources currently allocated to jobs in this queue |
| used | Resources actually used by running pods (≤ allocated) |
| pending | Number of PodGroups waiting in the queue |
| running | Number of running PodGroups |
Get an overview of all queues:
kubectl get queue
Output columns:
Get detailed information about a specific queue:
kubectl get queue <queue-name> -o yaml
kubectl describe queue <queue-name>
Key sections to examine:
spec:
weight: 10 # Relative weight (default: 1)
capability: # Max resources allowed
cpu: "100"
memory: "200Gi"
reclaimable: true # Allow resource reclamation
status:
state: Open # Open, Closed, or Closing
pending: 5 # PodGroups waiting
running: 10 # Running PodGroups
deserved: # Resources this queue should get
cpu: "40"
memory: "80Gi"
allocated: # Resources actually allocated
cpu: "35"
memory: "70Gi"
Calculate utilization ratios:
Allocation Ratio = allocated / deserved
Utilization Ratio = used / allocated
Interpretation:
allocated >= deserved: Queue is at or over its fair shareallocated < deserved: Queue has room to growused << allocated: Jobs have reserved resources but not using themFind workloads associated with a queue:
# Find all PodGroups in a queue
kubectl get podgroups --all-namespaces -o json | \
jq -r '.items[] | select(.spec.queue=="<queue-name>") | "\(.metadata.namespace)/\(.metadata.name)"'
# Check pending PodGroups
kubectl get podgroups --all-namespaces -o json | \
jq -r '.items[] | select(.spec.queue=="<queue-name>" and .status.phase=="Pending") | \
"\(.metadata.namespace)/\(.metadata.name): \(.status.phase)"'
Look for queue-related events:
kubectl get events --all-namespaces --field-selector reason=FailedScheduling | grep -i queue
Symptom: allocated >= deserved, new PodGroups stay in Pending
Check:
kubectl get queue <queue> -o jsonpath='{"
Deserved: "}{.status.deserved}{"
Allocated: "}{.status.allocated}{"
Ratio: "}{.status.allocated.cpu}{"/"}{.status.deserved.cpu}{"
"}'
For GPU-specific checks (GPU is often the bottleneck):
kubectl get queue -o custom-columns="NAME:.metadata.name,GPU_CAP:.spec.capability['nvidia.com/gpu'],GPU_ALLOC:.status.allocated['nvidia.com/gpu']"
Also cross-validate capability against actual cluster capacity — a common misconfiguration is setting spec.capability higher than the cluster's physical resources:
kubectl get nodes -o custom-columns="NAME:.metadata.name,GPU:.status.allocatable['nvidia.com/gpu'],CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory"
If the sum of all nodes' allocatable GPUs is less than the queue's spec.capability, the queue can never be fully utilized. When allocation reaches the cluster's physical limit, the queue appears to have remaining capacity but no more resources can actually be scheduled.
Solution:
Symptom: status.state: Closed, new PodGroups rejected
Check:
kubectl get queue <queue> -o jsonpath='{.status.state}'
Solution:
Symptom: One queue gets all resources, others starve
Check:
kubectl get queue -o custom-columns='NAME:.metadata.name,WEIGHT:.spec.weight,STATE:.status.state,CPU_DESERVED:.status.deserved.cpu,CPU_ALLOC:.status.allocated.cpu,MEM_DESERVED:.status.deserved.memory,MEM_ALLOC:.status.allocated.memory'
Analysis: Volcano distributes resources proportionally by weight. For example:
Solution:
Symptom: Queue is over-allocated but reclaim is not triggered
Check:
# Check reclaim is enabled in scheduler config
kubectl get cm volcano-scheduler-configmap -n volcano-system -o yaml | grep reclaim
Reclaim troubleshooting checklist (all must be true):
reclaim action must be in scheduler actionsproportion plugin must be enabledreclaimable: trueCheck the reclaimable flag on the specific queue:
kubectl get queue <queue> -o jsonpath='{.spec.reclaimable}'
If reclaimable is false (or unset), the queue's resources cannot be reclaimed even if it's over-allocated.
Solution:
volcano-scheduler-logs --keyword reclaimIf using hierarchical queues:
# Check parent-child relationships
kubectl get queue -o custom-columns='NAME:.metadata.name,PARENT:.spec.parent,WEIGHT:.spec.weight'
Key points:
The diagnose-queue.sh script provides:
Queue Summary Table
Resource Breakdown (with --verbose)
Warning Flags
[OVER] - Queue allocated > deserved[FULL] - Queue at capacity[CLOSED] - Queue not accepting new jobs[HIGH_PEND] - Many pending PodGroups| Variable | Default | Description |
|----------|---------|-------------|
| VOLCANO_NAMESPACE | default | Default namespace for pod lookup |
volcano-diagnose-pod - Diagnose individual pod schedulingvolcano-gang-scheduling - Gang constraint issuesvolcano-resource-insufficient - Resource shortage diagnosisvolcano-scheduler-logs - Check scheduler decisionsquota-debug - Native Kubernetes ResourceQuota/LimitRange diagnosis (non-Volcano)testing
Show and ping the gateway of a network interface, on a Kubernetes node or inside a pod's network namespace. Auto-detects the gateway from the routing table (ip -j route), reports interface type (RoCE / Ethernet / IB), and tests reachability with ping. Use for default-route / gateway questions, network reachability checks, RoCE/RDMA data-path validation, and "can this node/pod reach its gateway" investigations.
development
Guide for writing and improving Siclaw skills. Read this when creating or modifying a skill. Covers skill directory layout, SKILL.md format, script execution modes, and best practices.
devops
Retrieve logs from a Kubernetes node. Supports journalctl (systemd units) and file-based logs. Use when you need to inspect node-level logs (containerd, kubelet, etc.). Run via host_script (preferred) or node_script.
development
Guides the user to the Siclaw Web page to manage Skills. Use this guide when the user requests to create, edit, or view a Skill in a Channel conversation.