Wait For Job

Multi-agent collaboration

Default to using subagents when they are likely to improve speed, quality, confidence, or keep the main context clean.
Use subagents to widen coverage, dig deeper on one thread, get a fresh second opinion, or keep the main thread clean while side work runs.
Split work into clear packets with owners, inputs, acceptance checks, and a synthesis step when parallelizing.
Keep the main agent focused on synthesis, unblockers, and the next critical-path step; let subagents handle bounded side work that can run in parallel.
Use single-agent execution only when scope is small or coordination overhead outweighs gains.

Wait-for-job-specific subagent split

Use subagents when the wait may last a while or when a simple success signal can still hide broken outputs.
Suggested split: status-poller owns the primary command, regex, timeout state, and wait progress, and updates plan/current/wait-for-job.md.
Suggested split: artifact-checker owns secondary logs, outputs, and contradictory failure signals, and updates the same note with evidence.
The main agent owns the final call on whether success is real, whether the wait is blocked, and whether the task should escalate to a richer monitor skill.
Use subagents only when the work splits cleanly; otherwise stay single-agent.

Proactive autonomy and knowledge compounding

Be proactive: immediately take the next highest-value in-scope action when it is clear.
Default to autonomous execution: do not pause for confirmation between normal in-scope steps.
Request user input only when absolutely necessary: ambiguous requirements, material risk tradeoffs, missing required data/access, or destructive/irreversible actions outside policy.
If blocked by command/tool/env failures, attempt high-confidence fallbacks autonomously before escalating (for example rg -> find/grep, python -> python3, alternate repo-native scripts).
When the workflow uses plan/, ensure required plan directories exist before reading/writing them (create when edits are allowed; otherwise use an in-memory fallback and call it out).
Treat transient external failures (network/SSH/remote APIs/timeouts) as retryable by default: run bounded retries with backoff and capture failure evidence before concluding blocked.
On repeated invocations for the same objective, resume from prior findings/artifacts and prioritize net-new progress over rerunning identical work unless verification requires reruns.
Drive work to complete outcomes with verification, not partial handoffs.
Treat iterative execution as the default for non-trivial work; run adaptive loop passes. Example loops (adapt as needed, not rigid): issue-resolution investigate -> plan -> fix -> verify -> battletest -> organise-docs -> git-commit -> re-review; cleanup scan -> prioritize -> clean -> verify -> re-scan; docs audit -> update -> verify -> re-audit.
Keep looping until actual completion criteria are met: no actionable in-scope items remain, verification is green, and confidence is high.
Run organise-docs frequently during execution to capture durable decisions and learnings, not only at the end.
Create small checkpoint commits frequently with git-commit when changes are commit-eligible, checks are green, and repo policy permits commits.
Never squash commits; always use merge commits when integrating branches.
Prefer simplification over added complexity: aggressively remove bloat, redundancy, and over-engineering while preserving correctness.
When you touch code, leave the touched area in a better state than you found it: clearer, simpler, tidier, and at least as performant unless the task requires an explicit trade-off.
Use simple, plain English in user messages, docs, notes, reports, code comments, and other explanatory writing. Avoid jargon, fancy wording, and complex phrasing. When a technical term is needed for correctness, explain it in simple words the first time. Default to short user-facing responses. Think about what the user most wants to know, and lead with that. Do not dump every detail by default. Always include important changes, blockers, verification gaps, and any important assumptions, nuances, principles, or decisions that shaped the work. Add more detail only when the user asks for it or when uncertainty or risk makes it necessary.
Compound knowledge continuously: keep docs/ accurate and up to date, and promote durable learnings and decisions from work into docs.

Long-task checkpoint cadence

For any non-trivial task (including long efforts), run recurring checkpoint cycles instead of waiting for a single end-of-task wrap-up.
At each meaningful milestone with commit-eligible changes, and at least once per major phase, invoke git-commit to create a small logical checkpoint commit once relevant checks are green and repo policy permits commits.
At the same cadence, invoke organise-docs whenever durable learnings/decisions appear, and prune stale plan/ scratch artifacts.
If either checkpoint is blocked (for example failing checks or low-confidence documentation), resolve or record the blocker immediately and retry before expanding scope.

Terminal state contract (must follow)

The skill is complete only when all of the following are true:

Objective completion: the user-requested outcome is achieved, or explicitly marked blocked with concrete blocker evidence.
Workflow completion: every required workflow step is resolved as done, blocked, or not-applicable, with brief evidence or rationale.
Step-level terminal completion: each numbered subtask must have explicit completion evidence (artifact, command output, or written rationale) before advancing.
Verification completion: required checks/validations for this skill are executed, or any unavailable checks are explicitly called out with impact.
Findings completion (where applicable): report only evidence-backed findings; if no high-confidence critical findings are present, explicitly state that.
Loop completion: no actionable in-scope next step remains under the current objective.

Stop only after this terminal contract is satisfied; otherwise continue iterating.

Terminal state examples (adapt to skill)

done: poller reaches terminal success condition (exit-code or completion regex), waiting or queued states were watched through to finish, and the best available logs/outputs/results do not contradict successful completion.
blocked: timeout cap, explicit failure regex, or repeated command failure occurs after bounded retries; failure evidence and retry/unblock command are reported.
not-applicable: downstream action is intentionally skipped when user requested wait-only behavior (for example status reported with no follow-on execution).

Overview

Block until an external task is fully complete, then continue with downstream analysis or execution. Treat waiting, queued, and running states as active monitoring time, not as a reason to stop early. Use the bundled poller to avoid racing ahead while jobs are still active, and check the best available logs, outputs, and results before calling the task done.

Workflow

Define completion, waiting, and failure signals.

Choose completion mode:
- Exit-code mode: command exits 0 only when the task is complete.
- Regex mode: command output matches a completion regex.
Identify any waiting or queued states that should still count as active work rather than failure.
Define optional failure regexes (Failed, Error, Cancelled) for hard-stop conditions.
Identify best-available secondary evidence to inspect while the task is active, such as logs, stdout/stderr, output directories, result files, or status fields beyond the primary success check.

Run the poller with explicit limits.

Keep --interval-seconds at 120 unless the user asks for different polling cadence.
Use --timeout-seconds with an upper bound of 28800 (8 hours). The script enforces this cap and defaults to 28800.
For checks that depend on unstable transports (SSH/remote APIs), treat transient command failures as retryable and rerun the poller with bounded retries before hard-failing.

Inspect secondary evidence while polling.

At each meaningful poll, check the best-available logs/outputs/results for signs that the task is healthy and still producing the intended outcome.
If the task is only waiting or queued, keep monitoring until it advances or a real blocker appears.
If secondary evidence is unavailable, say so explicitly and treat the primary status command as the only trustworthy signal.

Continue only on validated success.

Exit code 0 or matching success regex: proceed only if logs/outputs/results do not show contradictory failure or broken output.
Non-zero exit after bounded retries: stop and report the reason; do not continue silently.
Success signal plus missing, empty, corrupt, or clearly wrong outputs/results: treat as blocked or escalate to the richer monitor skill instead of silently calling it done.

Cluster defaults (Slurm)

For Slurm job waits, filter polling to the current project/user scope whenever that context is available.
If project cluster env is required, prefer loading it explicitly (for example via CLUSTER_ENV_FILE=\"$PWD/.env\").
If .env is missing, attempt high-confidence reconstruction from project conventions first; if confidence is low, fail fast and ask.
Stay attached through both PENDING and RUNNING until all scoped jobs reach terminal state and the best-available outputs/logs/results look sane.
If waits expose clearly lingering/stuck jobs or suspicious outputs, hand off to cluster triage/cancellation workflow before resuming normal execution.

Quick Start

Exit-code mode

Use when status command can return 0 only once complete.

python "/path/to/wait-for-job/scripts/poll_until_done.py" \
  --check-cmd "test -f /tmp/job.done" \
  --interval-seconds 120

Regex mode (cluster-style polling)

Use when status command always exits 0 but output changes over time.

python "/path/to/wait-for-job/scripts/poll_until_done.py" \
  --check-cmd "kubectl get job my-job -n my-namespace -o jsonpath='{.status.conditions[*].type}'" \
  --success-regex "Complete" \
  --failure-regex "Failed" \
  --interval-seconds 120

Script

Use scripts/poll_until_done.py.

Arguments:

--check-cmd: shell command to evaluate each poll.
--success-regex: optional regex that marks completion from command output.
--failure-regex: optional repeatable regex for terminal failure output.
--interval-seconds: polling interval (default 120).
--timeout-seconds: wall-clock timeout in seconds, required range 1..28800 (default 28800).
--max-attempts: max polling attempts (0 disables cap).
--retry-on-nonzero: in regex mode, continue polling when command exits non-zero.
--quiet: print only terminal outcome messages.

Exit codes:

0: task completed.
1: timeout or max attempts reached.
2: failure condition detected or fatal command error.
130: interrupted.

Agent Rules

Wait for poller completion before doing downstream steps.
Do not treat queued, waiting, or quiet states as completion.
Do not treat the primary success signal as enough when the best-available logs/outputs/results disagree.
Report the exact check command and completion signal used.
Report what secondary evidence was checked, or say explicitly when none was available.
If polling fails for transient transport reasons, retry up to 3 attempts with short backoff before reporting blocked.
Ask for clarification only when completion signal cannot be inferred.

Wait For Job

Multi-agent collaboration

Default to using subagents when they are likely to improve speed, quality, confidence, or keep the main context clean.
Use subagents to widen coverage, dig deeper on one thread, get a fresh second opinion, or keep the main thread clean while side work runs.
Split work into clear packets with owners, inputs, acceptance checks, and a synthesis step when parallelizing.
Keep the main agent focused on synthesis, unblockers, and the next critical-path step; let subagents handle bounded side work that can run in parallel.
Use single-agent execution only when scope is small or coordination overhead outweighs gains.

Wait-for-job-specific subagent split

Use subagents when the wait may last a while or when a simple success signal can still hide broken outputs.
Suggested split: status-poller owns the primary command, regex, timeout state, and wait progress, and updates plan/current/wait-for-job.md.
Suggested split: artifact-checker owns secondary logs, outputs, and contradictory failure signals, and updates the same note with evidence.
The main agent owns the final call on whether success is real, whether the wait is blocked, and whether the task should escalate to a richer monitor skill.
Use subagents only when the work splits cleanly; otherwise stay single-agent.

Proactive autonomy and knowledge compounding

Be proactive: immediately take the next highest-value in-scope action when it is clear.
Default to autonomous execution: do not pause for confirmation between normal in-scope steps.
Request user input only when absolutely necessary: ambiguous requirements, material risk tradeoffs, missing required data/access, or destructive/irreversible actions outside policy.
If blocked by command/tool/env failures, attempt high-confidence fallbacks autonomously before escalating (for example rg -> find/grep, python -> python3, alternate repo-native scripts).
When the workflow uses plan/, ensure required plan directories exist before reading/writing them (create when edits are allowed; otherwise use an in-memory fallback and call it out).
Treat transient external failures (network/SSH/remote APIs/timeouts) as retryable by default: run bounded retries with backoff and capture failure evidence before concluding blocked.
On repeated invocations for the same objective, resume from prior findings/artifacts and prioritize net-new progress over rerunning identical work unless verification requires reruns.
Drive work to complete outcomes with verification, not partial handoffs.
Treat iterative execution as the default for non-trivial work; run adaptive loop passes. Example loops (adapt as needed, not rigid): issue-resolution investigate -> plan -> fix -> verify -> battletest -> organise-docs -> git-commit -> re-review; cleanup scan -> prioritize -> clean -> verify -> re-scan; docs audit -> update -> verify -> re-audit.
Keep looping until actual completion criteria are met: no actionable in-scope items remain, verification is green, and confidence is high.
Run organise-docs frequently during execution to capture durable decisions and learnings, not only at the end.
Create small checkpoint commits frequently with git-commit when changes are commit-eligible, checks are green, and repo policy permits commits.
Never squash commits; always use merge commits when integrating branches.
Prefer simplification over added complexity: aggressively remove bloat, redundancy, and over-engineering while preserving correctness.
When you touch code, leave the touched area in a better state than you found it: clearer, simpler, tidier, and at least as performant unless the task requires an explicit trade-off.
Use simple, plain English in user messages, docs, notes, reports, code comments, and other explanatory writing. Avoid jargon, fancy wording, and complex phrasing. When a technical term is needed for correctness, explain it in simple words the first time. Default to short user-facing responses. Think about what the user most wants to know, and lead with that. Do not dump every detail by default. Always include important changes, blockers, verification gaps, and any important assumptions, nuances, principles, or decisions that shaped the work. Add more detail only when the user asks for it or when uncertainty or risk makes it necessary.
Compound knowledge continuously: keep docs/ accurate and up to date, and promote durable learnings and decisions from work into docs.

Long-task checkpoint cadence

For any non-trivial task (including long efforts), run recurring checkpoint cycles instead of waiting for a single end-of-task wrap-up.
At each meaningful milestone with commit-eligible changes, and at least once per major phase, invoke git-commit to create a small logical checkpoint commit once relevant checks are green and repo policy permits commits.
At the same cadence, invoke organise-docs whenever durable learnings/decisions appear, and prune stale plan/ scratch artifacts.
If either checkpoint is blocked (for example failing checks or low-confidence documentation), resolve or record the blocker immediately and retry before expanding scope.

Terminal state contract (must follow)

The skill is complete only when all of the following are true:

Objective completion: the user-requested outcome is achieved, or explicitly marked blocked with concrete blocker evidence.
Workflow completion: every required workflow step is resolved as done, blocked, or not-applicable, with brief evidence or rationale.
Step-level terminal completion: each numbered subtask must have explicit completion evidence (artifact, command output, or written rationale) before advancing.
Verification completion: required checks/validations for this skill are executed, or any unavailable checks are explicitly called out with impact.
Findings completion (where applicable): report only evidence-backed findings; if no high-confidence critical findings are present, explicitly state that.
Loop completion: no actionable in-scope next step remains under the current objective.

Stop only after this terminal contract is satisfied; otherwise continue iterating.

Terminal state examples (adapt to skill)

done: poller reaches terminal success condition (exit-code or completion regex), waiting or queued states were watched through to finish, and the best available logs/outputs/results do not contradict successful completion.
blocked: timeout cap, explicit failure regex, or repeated command failure occurs after bounded retries; failure evidence and retry/unblock command are reported.
not-applicable: downstream action is intentionally skipped when user requested wait-only behavior (for example status reported with no follow-on execution).

Overview

Workflow

Define completion, waiting, and failure signals.

Choose completion mode:
- Exit-code mode: command exits 0 only when the task is complete.
- Regex mode: command output matches a completion regex.
Identify any waiting or queued states that should still count as active work rather than failure.
Define optional failure regexes (Failed, Error, Cancelled) for hard-stop conditions.
Identify best-available secondary evidence to inspect while the task is active, such as logs, stdout/stderr, output directories, result files, or status fields beyond the primary success check.

Run the poller with explicit limits.

Keep --interval-seconds at 120 unless the user asks for different polling cadence.
Use --timeout-seconds with an upper bound of 28800 (8 hours). The script enforces this cap and defaults to 28800.
For checks that depend on unstable transports (SSH/remote APIs), treat transient command failures as retryable and rerun the poller with bounded retries before hard-failing.

Inspect secondary evidence while polling.

At each meaningful poll, check the best-available logs/outputs/results for signs that the task is healthy and still producing the intended outcome.
If the task is only waiting or queued, keep monitoring until it advances or a real blocker appears.
If secondary evidence is unavailable, say so explicitly and treat the primary status command as the only trustworthy signal.

Continue only on validated success.

Exit code 0 or matching success regex: proceed only if logs/outputs/results do not show contradictory failure or broken output.
Non-zero exit after bounded retries: stop and report the reason; do not continue silently.
Success signal plus missing, empty, corrupt, or clearly wrong outputs/results: treat as blocked or escalate to the richer monitor skill instead of silently calling it done.

Cluster defaults (Slurm)

For Slurm job waits, filter polling to the current project/user scope whenever that context is available.
If project cluster env is required, prefer loading it explicitly (for example via CLUSTER_ENV_FILE=\"$PWD/.env\").
If .env is missing, attempt high-confidence reconstruction from project conventions first; if confidence is low, fail fast and ask.
Stay attached through both PENDING and RUNNING until all scoped jobs reach terminal state and the best-available outputs/logs/results look sane.
If waits expose clearly lingering/stuck jobs or suspicious outputs, hand off to cluster triage/cancellation workflow before resuming normal execution.

Quick Start

Exit-code mode

Use when status command can return 0 only once complete.

python "/path/to/wait-for-job/scripts/poll_until_done.py" \
  --check-cmd "test -f /tmp/job.done" \
  --interval-seconds 120

Regex mode (cluster-style polling)

Use when status command always exits 0 but output changes over time.

python "/path/to/wait-for-job/scripts/poll_until_done.py" \
  --check-cmd "kubectl get job my-job -n my-namespace -o jsonpath='{.status.conditions[*].type}'" \
  --success-regex "Complete" \
  --failure-regex "Failed" \
  --interval-seconds 120

Script

Use scripts/poll_until_done.py.

Arguments:

--check-cmd: shell command to evaluate each poll.
--success-regex: optional regex that marks completion from command output.
--failure-regex: optional repeatable regex for terminal failure output.
--interval-seconds: polling interval (default 120).
--timeout-seconds: wall-clock timeout in seconds, required range 1..28800 (default 28800).
--max-attempts: max polling attempts (0 disables cap).
--retry-on-nonzero: in regex mode, continue polling when command exits non-zero.
--quiet: print only terminal outcome messages.

Exit codes:

0: task completed.
1: timeout or max attempts reached.
2: failure condition detected or fatal command error.
130: interrupted.

Agent Rules

Wait for poller completion before doing downstream steps.
Do not treat queued, waiting, or quiet states as completion.
Do not treat the primary success signal as enough when the best-available logs/outputs/results disagree.
Report the exact check command and completion signal used.
Report what secondary evidence was checked, or say explicitly when none was available.
If polling fails for transient transport reasons, retry up to 3 attempts with short backoff before reporting blocked.
Ask for clarification only when completion signal cannot be inferred.

Adoption

olliecrow/wait-for-job

$ install --global

Security Scan Results

SKILL.md

Wait For Job

Multi-agent collaboration

Wait-for-job-specific subagent split

Proactive autonomy and knowledge compounding

Long-task checkpoint cadence

Terminal state contract (must follow)

Terminal state examples (adapt to skill)

Overview

Workflow

Cluster defaults (Slurm)

Quick Start

Exit-code mode

Regex mode (cluster-style polling)

Script

Agent Rules

Related Skills

olliecrow/sentinel-research

olliecrow/handoff

olliecrow/codex-custom-review

olliecrow/yeet

olliecrow/wait-for-job

$ install --global

Security Scan Results

SKILL.md

Wait For Job

Multi-agent collaboration

Wait-for-job-specific subagent split

Proactive autonomy and knowledge compounding

Long-task checkpoint cadence

Terminal state contract (must follow)

Terminal state examples (adapt to skill)

Overview

Workflow

Cluster defaults (Slurm)

Quick Start

Exit-code mode

Regex mode (cluster-style polling)

Script

Agent Rules

Related Skills

olliecrow/sentinel-research

olliecrow/handoff

olliecrow/codex-custom-review

olliecrow/yeet