skills/core/job-debug/SKILL.md
Diagnose Job and CronJob failures (BackoffLimitExceeded, DeadlineExceeded, pods failing, CronJob not triggering). Checks Job status, pod logs, and CronJob schedule to identify why batch workloads are failing.
npx skillsauth add scitix/siclaw job-debugInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When a Job has failed, is stuck, or a CronJob is not triggering as expected, follow this flow to identify the root cause.
Scope: This skill is for diagnosis only. Once you identify the root cause, report it to the user and stop. Do NOT attempt to delete, restart, or modify Jobs or CronJobs — that should be left to the user.
kubectl get jobs -n <ns>
Note the COMPLETIONS (succeeded/total) and DURATION columns. For a specific Job:
kubectl describe job <job-name> -n <ns>
Focus on:
Complete (success) or Failed (failure) with the reasonBackoffLimitExceeded, DeadlineExceeded, or pod creation errorskubectl get pods -n <ns> -l job-name=<job-name> --sort-by='.metadata.creationTimestamp'
Note pod statuses. For failed pods, check logs:
kubectl logs <pod-name> -n <ns>
If the pod was terminated, check the exit code:
kubectl get pod <pod-name> -n <ns> -o jsonpath='{.status.containerStatuses[*].state.terminated}'
kubectl get cronjobs -n <ns>
Note the SCHEDULE, SUSPEND, ACTIVE, and LAST SCHEDULE columns.
For details:
kubectl describe cronjob <cronjob-name> -n <ns>
Focus on:
True, the CronJob will not create new JobsList Jobs created by the CronJob:
kubectl get jobs -n <ns> -l job-name --sort-by='.metadata.creationTimestamp' | grep <cronjob-name>
BackoffLimitExceeded — Too many pod failuresThe Job's pods have failed more times than the backoffLimit (default: 6). The Job is marked as Failed.
Check why the pods are failing — look at the logs of the most recent failed pod (step 2). Common causes:
pod-crash-debug for deeper analysis)Advise the user to fix the underlying pod failure, then create a new Job.
DeadlineExceeded — Job took too longThe Job did not complete within its activeDeadlineSeconds limit. All running pods are terminated.
kubectl get job <job-name> -n <ns> -o jsonpath='{.spec.activeDeadlineSeconds}'
Check if the Job's pods are slow or stuck:
Advise the user to either increase activeDeadlineSeconds or investigate why the workload is slow.
Pending — Scheduling issuesJob pods are created but cannot be scheduled.
Use the pod-pending-debug skill to diagnose the scheduling failure. Common causes for batch jobs:
ImagePullBackOff — Image issueThe Job pod cannot pull its container image. Use the image-pull-debug skill.
The Job completed (status Complete) but produced incorrect results. This is not a Kubernetes issue — check the application logs:
kubectl logs <succeeded-pod> -n <ns>
If the pod has been garbage collected, check if ttlSecondsAfterFinished caused it to be deleted:
kubectl get job <job-name> -n <ns> -o jsonpath='{.spec.ttlSecondsAfterFinished}'
The CronJob exists but is not creating Jobs on schedule.
spec.suspend is true. No new Jobs will be created until it is unsuspended.concurrencyPolicy: Forbid and a previous Job is still running, the new Job will be skipped.kubectl get cronjob <cronjob-name> -n <ns> -o jsonpath='suspend={.spec.suspend} concurrencyPolicy={.spec.concurrencyPolicy} schedule={.spec.schedule}'
Check if there are active Jobs blocking the next run:
kubectl get jobs -n <ns> | grep <cronjob-name>
If every Job created by the CronJob fails, the root cause is in the Job template. Diagnose the most recent Job using steps 1-2 above.
Check the startingDeadlineSeconds — if set, Jobs that miss their window (e.g., because the previous Job ran too long) are skipped:
kubectl get cronjob <cronjob-name> -n <ns> -o jsonpath='{.spec.startingDeadlineSeconds}'
Also check successfulJobsHistoryLimit and failedJobsHistoryLimit — if these are set to 0, completed/failed Jobs and their pods are deleted immediately, making diagnosis harder.
<cronjob-name>-<timestamp>. Use this pattern to find related Jobs.backoffLimit controls how many retries happen before the Job is marked as Failed.completions and parallelism control how many pods need to succeed and how many run concurrently. A Job with completions: 5, parallelism: 2 runs 2 pods at a time until 5 have succeeded.completionMode: Indexed), each pod gets a unique index. A failure in one index does not affect others — check which specific index is failing.testing
Show and ping the gateway of a network interface, on a Kubernetes node or inside a pod's network namespace. Auto-detects the gateway from the routing table (ip -j route), reports interface type (RoCE / Ethernet / IB), and tests reachability with ping. Use for default-route / gateway questions, network reachability checks, RoCE/RDMA data-path validation, and "can this node/pod reach its gateway" investigations.
development
Guide for writing and improving Siclaw skills. Read this when creating or modifying a skill. Covers skill directory layout, SKILL.md format, script execution modes, and best practices.
devops
Retrieve logs from a Kubernetes node. Supports journalctl (systemd units) and file-based logs. Use when you need to inspect node-level logs (containerd, kubelet, etc.). Run via host_script (preferred) or node_script.
development
Guides the user to the Siclaw Web page to manage Skills. Use this guide when the user requests to create, edit, or view a Skill in a Channel conversation.