skills/core/volcano-job-diagnose/SKILL.md
Diagnose Volcano Job status and issues. Check Job phases, task statuses, PodGroup associations, and overall job health.
npx skillsauth add scitix/siclaw volcano-job-diagnoseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Diagnose Volcano Job (batch.volcano.sh/v1beta1) status and issues. This skill checks Job phases, task statuses, PodGroup associations, and overall job health.
Scope: This skill is for diagnosis only. Once you identify the root cause, report it to the user and stop. Do NOT attempt to modify job specs or restart jobs — that should be left to the user.
bash skills/core/volcano-job-diagnose/scripts/diagnose-job.sh --job <job-name> --namespace <namespace>
| Parameter | Required | Description |
|-----------|----------|-------------|
| --job JOB | yes | Job name to diagnose |
| --namespace NS | no | Namespace (default: default) |
| --verbose | no | Show detailed task and pod information |
Diagnose a Volcano Job:
bash skills/core/volcano-job-diagnose/scripts/diagnose-job.sh --job my-training-job --namespace training
Verbose mode with task details:
bash skills/core/volcano-job-diagnose/scripts/diagnose-job.sh --job my-training-job --namespace training --verbose
apiVersion: batch.volcano.sh/v1beta1
kind: Job
spec:
schedulerName: volcano
tasks:
- name: worker
replicas: 4
template:
spec:
containers:
- name: worker
resources:
requests:
cpu: "4"
memory: "8Gi"
maxRetry: 3 # Max retries before job is Aborted
policies:
- event: PodFailed
action: RestartJob
Note: Volcano Jobs can also be queried using the short name
vcjob(e.g.,kubectl get vcjob). This is an alias forjob.batch.volcano.sh. Be careful not to confuse with native Kubernetesbatch/v1 Job— always usejob.batch.volcano.shorvcjobfor Volcano Jobs.
| Phase | Meaning |
|-------|---------|
| Pending | Job is waiting for resources or admission |
| Running | Job is executing |
| Completing | Job tasks are completing |
| Completed | Job finished successfully |
| Failed | Job failed |
| Restarting | Job is being restarted due to policy |
| Terminating | Job is being terminated |
| Aborted | Job was aborted |
Each task within a job has its own status:
Pending - Task pods not yet scheduledRunning - Task pods are runningCompleted - Task finishedFailed - Task failedGet the Job status:
kubectl get job.batch.volcano.sh <job-name> -n <namespace> -o yaml
Key fields to check:
status.state.phase - Current job phasestatus.failed - Number of failed tasksstatus.succeeded - Number of succeeded tasksstatus.running - Number of running tasksstatus.pending - Number of pending tasksList all tasks and their statuses:
kubectl get pods -n <namespace> -l volcano.sh/job-name=<job-name> -o wide
What to look for:
Find the PodGroup created for this Job:
kubectl get podgroups -n <namespace> -l volcano.sh/job-name=<job-name>
Or check the Job's tasks for PodGroup annotations:
kubectl get pods -n <namespace> -l volcano.sh/job-name=<job-name> \
-o jsonpath='{.items[0].metadata.annotations.scheduling\.volcano\.sh/pod-group}'
Next step: If PodGroup status is problematic, use volcano-diagnose-pod for detailed PodGroup analysis.
Review job policies that may affect behavior:
kubectl get job.batch.volcano.sh <job-name> -n <namespace> -o jsonpath='{.spec.policies}'
Common policies:
PodFailed → RestartJob - Restart entire job on any pod failurePodFailed → RestartTask - Restart only the failed taskPodEvicted → RestartTask - Restart evicted tasksPodEvicted → AbortJob - Abort entire job when a pod is evicted (can cause unexpected aborts during preemption)TaskCompleted → CompleteJob - Complete job when task finishesAlso check maxRetry — when retries are exhausted the job moves to Aborted:
kubectl get job.batch.volcano.sh <job-name> -n <namespace> -o jsonpath='{.spec.maxRetry}'
Check job-related events:
kubectl get events -n <namespace> --field-selector involvedObject.name=<job-name>
Common event patterns:
JobFailed - Job has failedCheck the reason and message for failure details.
JobRestarting - Job is being restartedCheck the restart policy and previous failure reason.
TaskFailed - Individual task failedMay or may not cause entire job to fail depending on policy.
Symptom: Job phase is Pending, no pods created.
Check:
kubectl get podgroups -n <ns>kubectl get queue <queue>kubectl get events -n <ns> | grep <job-name>Likely causes:
Symptom: Partial task scheduling (e.g., 2/4 tasks running).
Check:
Likely causes:
volcano-gang-scheduling)Symptom: Job keeps restarting, never completes.
Check:
kubectl get job.batch.volcano.sh -o jsonpath='{.spec.policies}'kubectl describe pod <pod>kubectl logs <pod>Likely causes:
Symptom: Some tasks succeeded, but job marked as Failed.
Check:
Likely causes:
Symptom: Job was Running, then moved to Aborted.
Check:
# Check maxRetry
kubectl get job.batch.volcano.sh <job> -n <ns> -o jsonpath='{.spec.maxRetry}'
# Check for preemption/eviction events
kubectl get events -n <ns> --field-selector reason=Preempted
kubectl get events -n <ns> --field-selector reason=Evicted
# Check if running pod count dropped below minMember (Gang breakage)
kubectl get podgroup -n <ns> -l volcano.sh/job-name=<job> -o jsonpath='{"running: "}{.items[0].status.running}{"\nminMember: "}{.items[0].spec.minMember}'
Likely causes:
maxRetry exhausted — job restarted too many timesPodEvicted → AbortJob policyminMember, tearing down the entire groupPodEvicted → AbortJob when RestartTask would be more appropriateVolcano controls task coordination through lifecycle policies, not explicit task dependencies.
spec:
tasks:
- name: master
replicas: 1
policies:
- event: TaskCompleted
action: CompleteJob
- name: worker
replicas: 4
policies:
- event: PodFailed
action: RestartTask
Diagnosis:
# Check per-task status counts
kubectl get job.batch.volcano.sh <job> -o jsonpath='{.status.taskStatusCount}'
# Check configured policies
kubectl get job.batch.volcano.sh <job> -o jsonpath='{.spec.tasks[*].policies}'
Look for mismatched events/actions that could cause unexpected restarts or premature completion.
Use this skill in combination with others:
# 1. Job-level diagnosis
bash skills/core/volcano-job-diagnose/scripts/diagnose-job.sh --job my-job --namespace training
# 2. If PodGroup issues found → Pod-level diagnosis
bash skills/core/volcano-diagnose-pod/scripts/diagnose-pod.sh --pod my-job-worker-0 --namespace training
# 3. If Gang issues → Gang scheduling analysis
# (refer to volcano-gang-scheduling skill)
# 4. If Queue issues → Queue diagnosis
bash skills/core/volcano-queue-diagnose/scripts/diagnose-queue.sh --queue training-queue
# 5. Check scheduler logs for decisions
bash skills/core/volcano-scheduler-logs/scripts/get-scheduler-logs.sh --pod my-job-worker-0 --since 1h
| Variable | Default | Description |
|----------|---------|-------------|
| VOLCANO_NAMESPACE | default | Default namespace for job lookup |
volcano-diagnose-pod - Pod-level scheduling diagnosisvolcano-gang-scheduling - Gang scheduling constraint analysisvolcano-queue-diagnose - Queue resource analysisvolcano-scheduler-logs - Scheduler decision logsdeployment-rollout-debug - (Similar concept for Deployments)development
Guide for writing and improving Siclaw skills. Read this when creating or modifying a skill. Covers skill directory layout, SKILL.md format, script execution modes, and best practices.
development
Guides the user to the Siclaw Web page to manage Skills. Use this guide when the user requests to create, edit, or view a Skill in a Channel conversation.
development
Retrieve and analyze Volcano scheduler logs. Filter by keyword, time range, or pod name to debug scheduling decisions.
tools
View Volcano scheduler configuration. Check scheduler ConfigMap, actions, plugins, and tier settings.