skills/core/volcano-diagnose-pod/SKILL.md
Diagnose Volcano-managed Pod scheduling issues. Checks Pod status, PodGroup, events, and Queue to identify scheduling failures.
npx skillsauth add scitix/siclaw volcano-diagnose-podInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Diagnose Volcano-managed Pod scheduling issues. This skill checks Pod status, associated PodGroup, scheduling events, and Queue configuration to identify why a Pod cannot be scheduled.
Scope: This skill is for diagnosis only. Once you identify the root cause, report it to the user and stop. Do NOT attempt to modify pod specs, PodGroups, or Queues — that should be left to the user.
bash skills/core/volcano-diagnose-pod/scripts/diagnose-pod.sh --pod <pod-name> --namespace <namespace>
| Parameter | Required | Description |
|-----------|----------|-------------|
| --pod POD | yes | Pod name to diagnose |
| --namespace NS | no | Namespace (default: default) |
| --verbose | no | Show detailed output including node resources |
Diagnose a pending pod in default namespace:
bash skills/core/volcano-diagnose-pod/scripts/diagnose-pod.sh --pod my-job-0
Diagnose a pod in specific namespace:
bash skills/core/volcano-diagnose-pod/scripts/diagnose-pod.sh --pod my-job-0 --namespace training
Verbose mode with node resource information:
bash skills/core/volcano-diagnose-pod/scripts/diagnose-pod.sh --pod my-job-0 --namespace training --verbose
The script performs the following checks in order:
Check the Pod's current phase and conditions.
kubectl get pod <pod> -n <ns> -o wide
kubectl describe pod <pod> -n <ns>
Check if the Pod is associated with a PodGroup and its scheduling status.
kubectl get pod <pod> -n <ns> -o jsonpath='{.metadata.annotations.scheduling.volcano.sh/pod-group}'
kubectl get podgroup <podgroup> -n <ns>
Key fields to check:
spec.minMember: Minimum members required for Gang schedulingstatus.phase: Pending, Inqueue, Running, Unknownstatus.running: Number of running podsstatus.pending: Number of pending podsCheck scheduling events for failure reasons.
kubectl get events -n <ns> --field-selector involvedObject.name=<pod> --sort-by='.lastTimestamp'
Look for these event patterns:
FailedScheduling - General scheduling failureThe scheduler attempted but failed to schedule the pod. Check the message for specific reasons.
Volcano-specific sub-patterns:
| Event Message | Meaning | Next Step |
|---------------|---------|-----------|
| 0/N nodes are available + minMember | Gang constraint not satisfied | Use volcano-gang-scheduling |
| exceeded quota / queue resource exceeded | Queue deserved resources exhausted | Use volcano-queue-diagnose |
| Insufficient cpu/memory + Gang mention | Resource shortage blocking Gang | Use volcano-resource-insufficient |
| pod group is not ready | PodGroup not in Inqueue phase | Check PodGroup status |
| task <name> is not ready | Task dependencies not met | Check dependent tasks |
Quick Reference vs Detailed Analysis: The table above provides a quick lookup for common patterns. The sections below provide detailed analysis, additional context, and more diagnostic commands for each pattern.
Insufficient cpu / Insufficient memory - Resource shortageNo node has enough allocatable resources. Check:
kubectl top nodeskubectl get pod <pod> -n <ns> -o jsonpath='{.spec.containers[*].resources.requests}'Volcano context: If this is a Gang-scheduled pod, even if total cluster resources are sufficient, you need enough resources simultaneously on enough nodes. Use volcano-resource-insufficient to check fragmentation.
minMember not satisfied - Gang constraintThe PodGroup requires minMember pods to be scheduled simultaneously, but the cluster cannot satisfy this. Use volcano-gang-scheduling skill for detailed diagnosis.
Key insight: Even if kubectl top nodes shows enough total resources, Gang requires simultaneous availability on different nodes.
queue resource exceeded - Queue quota limitThe Queue associated with this Pod has exceeded its deserved resources. Check Queue status with volcano-queue-diagnose skill.
Volcano-specific terms you might see:
overused - Queue has exceeded its fair sharedeserved resources - Calculated from queue weight proportionallocated resources - Currently used by jobs in this queuereclaim events - Resource reclamation triggeredIf you see events mentioning reclaim:
over-allocated (allocated > deserved)volcano-queue-diagnose --queue <queue>preempt events - Priority preemptionHigher priority workload is evicting this pod. Check:
kubectl get pod <pod> -o jsonpath='{.spec.priorityClassName}'volcano-scheduler-logs --keyword preemptenqueue related eventsPodGroup is enqueued - PodGroup admitted to queue, ready for schedulingPodGroup is pending - Waiting for queue admission (capacity or resource check)enqueue failed - Failed admission check (overcommit, queue closed, etc.)Check the Queue configuration and resource allocation.
kubectl get podgroup <podgroup> -n <ns> -o jsonpath='{.spec.queue}'
kubectl get queue <queue>
kubectl describe queue <queue>
Key fields:
spec.weight: Queue weight for resource sharingspec.capability: Maximum resources the queue can usestatus.state: Open, Closed, or Closingstatus.deserved: Resources deserved by this queuestatus.allocated: Resources currently allocatedWhen --verbose is specified, also check node allocatable resources.
kubectl get nodes -o custom-columns='NAME:.metadata.name,CPU:.status.allocatable.cpu,MEM:.status.allocatable.memory'
kubectl get pods -n volcano-system -l app=volcano-schedulerkubectl get pods -n volcano-system -l app=volcano-controller-manager
volcano-scheduler-logs skillkubectl get podgroup <pg> -n <ns> -o jsonpath='{.spec.queue}'
kubectl get queue <queue-name>
If the queue name is empty, the job uses the default queue — verify it exists and is OpenminMember constraint is not satisfiedvolcano-queue-diagnose for detailed analysis| Variable | Default | Description |
|----------|---------|-------------|
| VOLCANO_NAMESPACE | default | Default namespace for Pod lookup |
| VOLCANO_SCHEDULER_NS | volcano-system | Namespace where volcano scheduler runs |
volcano-gang-scheduling - Detailed Gang scheduling diagnosisvolcano-queue-diagnose - Queue status and quota analysisvolcano-scheduler-logs - Scheduler log analysisvolcano-resource-insufficient - Resource shortage diagnosisquota-debug - Native Kubernetes ResourceQuota/LimitRange diagnosis (non-Volcano)development
Guide for writing and improving Siclaw skills. Read this when creating or modifying a skill. Covers skill directory layout, SKILL.md format, script execution modes, and best practices.
development
Guides the user to the Siclaw Web page to manage Skills. Use this guide when the user requests to create, edit, or view a Skill in a Channel conversation.
development
Retrieve and analyze Volcano scheduler logs. Filter by keyword, time range, or pod name to debug scheduling decisions.
tools
View Volcano scheduler configuration. Check scheduler ConfigMap, actions, plugins, and tier settings.