cluster-monitor

Multi-agent collaboration

Default to using subagents when they are likely to improve speed, quality, confidence, or keep the main context clean.
Use subagents to widen coverage, dig deeper on one thread, get a fresh second opinion, or keep the main thread clean while side work runs.
Split work into clear packets with owners, inputs, acceptance checks, and a synthesis step when parallelizing.
Keep the main agent focused on synthesis, unblockers, and the next critical-path step; let subagents handle bounded side work that can run in parallel.
Use single-agent execution only when scope is small or coordination overhead outweighs gains.

Cluster-monitor-specific subagent split

Default to subagents for long waits or when scheduler state and output health both need steady checking.
Suggested split: queue-watcher owns scheduler deltas, timing changes, and stuck-state signals, and updates plan/current/cluster-monitor.md.
Suggested split: artifact-watcher owns logs, outputs, result files, and sanity checks, and updates the same note with evidence paths and latest health.
Suggested split: fresh-checker does a clean second pass before intervention, cancel/resubmit, or final closeout when the state is messy or surprising.
The main agent owns intervention thresholds, scoped cancel or resubmit decisions, and the final merged story.
Use subagents only when the work splits cleanly; otherwise stay single-agent.

Overview

cluster-monitor is the primary cluster skill and subsumes former cluster-check behavior.

Use it for:

quick operational status (squeue/sacct/sinfo/QoS),
long-running queue-to-completion monitoring for hours or days,
deep triage and microscope-level analysis of logs, outputs, and results,
intervention workflows when continuing would likely generate invalid results or expensive reruns.

Default objective: maximize correct completion and valid learning throughput while minimizing wasted wall-clock and duplicate reruns.

Proactive autonomy and knowledge compounding

Be proactive: immediately take the next highest-value in-scope action when it is clear.
Default to autonomous execution: do not pause for confirmation between normal in-scope steps.
Request user input only when absolutely necessary: ambiguous requirements, material risk tradeoffs, missing required data/access, or destructive/irreversible actions outside policy.
If blocked by command/tool/env failures, attempt high-confidence fallbacks autonomously before escalating (for example rg -> find/grep, python -> python3, alternate repo-native scripts).
When the workflow uses plan/, ensure required plan directories exist before reading/writing them (create when edits are allowed; otherwise use an in-memory fallback and call it out).
Treat transient external failures (network/SSH/remote APIs/timeouts) as retryable by default: run bounded retries with backoff and capture failure evidence before concluding blocked.
On repeated invocations for the same objective, resume from prior findings/artifacts and prioritize net-new progress over rerunning identical work unless verification requires reruns.
Drive work to complete outcomes with verification, not partial handoffs.
Treat iterative execution as the default for non-trivial work; run adaptive loop passes. Example loops (adapt as needed, not rigid): issue-resolution investigate -> plan -> fix -> verify -> battletest -> organise-docs -> git-commit -> re-review; cleanup scan -> prioritize -> clean -> verify -> re-scan; docs audit -> update -> verify -> re-audit.
Keep looping until actual completion criteria are met: no actionable in-scope items remain, verification is green, and confidence is high.
Run organise-docs frequently during execution to capture durable decisions and learnings, not only at the end.
Create small checkpoint commits frequently with git-commit when changes are commit-eligible, checks are green, and repo policy permits commits.
Never squash commits; always use merge commits when integrating branches.
Prefer simplification over added complexity: aggressively remove bloat, redundancy, and over-engineering while preserving correctness.
When you touch code, leave the touched area in a better state than you found it: clearer, simpler, tidier, and at least as performant unless the task requires an explicit trade-off.
Use simple, plain English in user messages, docs, notes, reports, code comments, and other explanatory writing. Avoid jargon, fancy wording, and complex phrasing. When a technical term is needed for correctness, explain it in simple words the first time. Default to short user-facing responses. Think about what the user most wants to know, and lead with that. Do not dump every detail by default. Always include important changes, blockers, verification gaps, and any important assumptions, nuances, principles, or decisions that shaped the work. Add more detail only when the user asks for it or when uncertainty or risk makes it necessary.
Compound knowledge continuously: keep docs/ accurate and up to date, and promote durable learnings and decisions from work into docs.

Long-task checkpoint cadence

For any non-trivial task (including long efforts), run recurring checkpoint cycles instead of waiting for a single end-of-task wrap-up.
At each meaningful milestone with commit-eligible changes, and at least once per major phase, invoke git-commit to create a small logical checkpoint commit once relevant checks are green and repo policy permits commits.
At the same cadence, invoke organise-docs whenever durable learnings/decisions appear, and prune stale plan/ scratch artifacts.
If either checkpoint is blocked (for example failing checks or low-confidence documentation), resolve or record the blocker immediately and retry before expanding scope.

Interruption handoff (required for long runs)

Before stopping, yielding to a fresh session, or ending due to external interruption, write a concise handoff in plan/handoffs/ or plan/current/notes.md.
The handoff must include:
- monitored scope (project, cluster_user, job IDs/batches),
- latest state counts and most recent progress table,
- last evidence checked (logs, outputs, metrics, queue snapshot time),
- current blocker or reason for pause,
- exact next step and next recommended poll window.
On repeat invocations, read the latest handoff first and continue from deltas rather than rebuilding context from scratch.

Terminal state contract (must follow)

The skill is complete only when all of the following are true:

Objective completion: the user-requested outcome is achieved, or explicitly marked blocked with concrete blocker evidence.
Workflow completion: every required workflow step is resolved as done, blocked, or not-applicable, with brief evidence or rationale.
Step-level terminal completion: each numbered subtask must have explicit completion evidence (artifact, command output, or written rationale) before advancing.
Verification completion: required checks/validations for this skill are executed, or any unavailable checks are explicitly called out with impact.
Findings completion (where applicable): report only evidence-backed findings; if no high-confidence critical findings are present, explicitly state that.
Loop completion: no actionable in-scope next step remains under the current objective.

Stop only after this terminal contract is satisfied; otherwise continue iterating.

Terminal state examples (adapt to skill)

done: monitored job set reaches the requested end condition, every scoped job was watched through queue/pending and running to terminal state, and the best-available outputs/results/logs were checked and found consistent with the intended run.
blocked: scheduler/cluster access or required project wiring is unavailable after bounded retries; blocker evidence and exact unblock command are reported.
not-applicable: intervention steps are skipped with rationale when high-bar intervention criteria are not met.

Scope and identity (must establish first)

Determine and record:

project_root: current repo root.
project_name: inferred from repo basename unless overridden by explicit user instruction.
cluster_user: from env/config/project scripts, falling back only when high confidence.
job_prefix or batch identifiers: from project scripts, submitted job IDs, or naming conventions.
cluster_host: from project cluster wrappers/env/ssh config.

Build the monitored set from both scopes:

current conversation job IDs (if already referenced in the thread/artifacts), and
current project jobs for the current cluster user (derived from prefix/project filters).

Always scope cancellation and cleanup to the current project + cluster user + selected monitored set.

Modes

Quick-status mode (must support)

Use for fast operational answers such as queue usage, node capacity, QoS, and completion checks.

Requirements:

Keep it read-only and fast.
Prefer live scheduler evidence (squeue, sinfo, sacct, scontrol show qos).
Return concrete timestamps and units.
If access is blocked, provide blocker evidence and exact unblock command.
Stop after requested status unless user asks to escalate.

Deep-monitor mode (default for long runs)

Use for long-running monitoring, deep diagnosis, and intervention decisions.

Requirements:

Stay attached from PENDING/queued through RUNNING to terminal completion; queue time is still active monitoring time, not a reason to stop waiting.
Monitor over time with low-noise polling and state deltas.
Inspect scheduler state plus any accessible logs/outputs/results/files under a microscope, not only scheduler states.
Do not jump straight to final result digging while jobs are still queued or running; keep monitoring until they actually finish, and intervene if needed.
Do not stop at scheduler terminal state alone; keep going until the finished outputs/results/logs have been sanity-checked against the intended job outcome.
Intervene only when high-bar criteria are met.

Trigger phrases

Use quick-status mode for prompts like:

check current cluster usage
per-node cpu/gpu usage
what is the qos
are jobs finished
cluster configuration right now

Use deep-monitor mode for prompts like:

monitor these jobs until done
watch this slurm batch and intervene only if needed
diagnose cluster failures and resubmit if required
microscope-check logs, outputs, results, and files while jobs wait or run

Prompt templates

Use these copy-paste templates:

[$cluster-monitor] quick-status: per-node cpu/gpu usage, queue counts by state, and qos values with timestamp.
[$cluster-monitor] quick-status: are all conversation/project jobs finished? include a submitted/running/pending/completed/failed/canceled summary table.
[$cluster-monitor] deep-monitor: monitor current conversation jobs + current project jobs from queue/pending through running to terminal completion, inspect scheduler/log/output/result/file evidence under a microscope, do not stop at scheduler completion alone, include a submitted/running/pending/completed/failed/canceled/intervened summary table at each material update, and intervene only if invalid-output or costly-rerun risk is high.
[$cluster-monitor] deep-monitor: if intervention is warranted, cancel scoped jobs, clean logs/outputs/cache/temp + disk pressure, apply verified fixes, resubmit, and continue monitoring until the finished outputs, logs, and results have been checked and still make sense.

Monitoring posture and intervention bar

Be patient by default; queue waiting alone is not intervention-worthy.
Accept isolated exploratory failures when resulting outputs remain usable.
Maintain a high intervention bar: intervene only when continuing is likely to:
- produce invalid/corrupt/unusable outputs, or
- force costly reruns that duplicate substantial work/time.

Default policy bands (batch-level, same failure pattern):

watch band: up to 10% similar failures,
escalation band: >10% similar failures,
intervention band: >=15% similar failures plus evidence of invalid output risk or rerun inevitability.

Hard-stop override (intervene earlier):

deterministic configuration bug affecting most jobs,
widespread invalid artifacts (NaNs/empty/corrupt metrics),
deterministic crash loop showing current configuration cannot produce valid results,
disk pressure from failed/canceled artifacts that risks breaking active/future runs.

Workflow

1) Preflight and wiring

Confirm pwd, repo root, branch, and required tools.
Prefer project-native cluster wrappers/scripts; fall back to ssh + squeue/sacct/scontrol only when needed.
Validate connectivity; retry transient failures with bounded backoff.

2) Build monitored inventory

Collect live queue for scoped jobs (RUNNING, PENDING, reasons, nodes).
Collect recent states/history (sacct) for monitored batches.
Resolve batch grouping using project manifests when available; otherwise use prefix/time windows.
Record which jobs are conversation-linked vs project-derived.

3) Low-noise monitoring loop

Cadence defaults:

pending only with legitimate reasons: every 600-900s,
active running jobs: every 180-300s,
degraded/suspect state: every 60-120s.

At each poll, gather and compare deltas:

scheduler state counts and transitions,
queue reasons and queue-time changes for pending jobs,
new stderr/stdout error signatures,
new output/result artifacts and integrity signals,
progress and completion markers,
signals that the produced work still matches the intended run shape (expected files, expected metric ranges, expected artifact freshness, or other project-specific sanity clues when available).

Even when every monitored job is still pending, remain in the loop and keep checking the accessible scheduler/log/output/result/file evidence until jobs finish or intervention is required.

Prefer compact snapshots over repeated full table dumps. Every material update must include a standardized progress table with at least:

submitted
running
pending
completed
failed
canceled
intervened

If a field is unknown, say unknown explicitly rather than omitting it.

4) Microscope-level diagnostics

For changed or suspect jobs, inspect deeply:

stdout/stderr tails and targeted searches (Traceback, ERROR, Exception, OOM, timeout, NaN, corruption patterns),
output files, metrics summaries, manifests, checkpoints,
signs of wrapper-level vs root-cause mismatch,
sync integrity between remote and local artifacts,
whether the produced artifacts look complete, non-empty, current, and internally consistent enough to support the intended outcome.

Do this while jobs are pending or running whenever the evidence is available; do not wait until after completion to start checking logs, outputs, or produced files.

Classify health as:

healthy, degraded, or systemic.

5) Intervention decision

Do not intervene for:

normal scheduler waiting,
isolated corner-case failures with intact learning value,
mixed evidence where valid coverage remains likely.

Intervene only when evidence indicates invalid outputs or expensive reruns are likely without action.

6) Intervention workflow (mandatory sequence)

When intervention is justified, execute this order:

Diagnose and define exact fix plan.
Cancel only affected scoped jobs (project/user/monitored set).
Clean up canceled-job artifacts that could pollute future runs:
- logs,
- partial outputs/results,
- stale caches/scratch,
- orphaned temporary files.
Check and reduce disk pressure caused by canceled/failed artifacts.
Implement high-confidence code/config fixes.
Validate fixes with relevant local checks/smoke tests before resubmission.
Resubmit jobs with clear hypothesis notes and record new job IDs.
Return to deep-monitor loop and verify the fix in real run behavior.

Never cancel/cleanup outside scoped project/user ownership.

7) Completion workflow

Start this section only after the monitored jobs reach terminal states. Until then, stay in the monitoring loop.

When monitored jobs finish:

sync all relevant outputs,
run immediate post-run analysis of metrics, anomalies, and failures,
verify the finished job set matches the intended scope and that no expected job is still queued, running, missing, or silently dropped,
verify the final outputs/results/logs are present and sane enough for the intended use instead of assuming scheduler COMPLETED means success,
separate informative failures from wasted failures,
capture durable learnings and rationale in docs/code/tests,
propose next experiments only when asked or clearly warranted.

Reporting format (required)

Keep the report short by default.

In most updates include only:

Findings summary.
Scope and identity (project/user/prefix/host).
Standardized progress table (submitted/running/pending/completed/failed/canceled/intervened).
Intervention decision and rationale against the high-bar rule.
Next step, blocker, or completion state.

Add timeline detail, microscope diagnostics, sync status, learnings, or the evidence runbook only when they changed, when they are needed to justify the decision, or when the user asks.

Skill composition

When this skill is triggered, compose other skills as needed:

Use investigate when a suspicious job state, log pattern, or result artifact needs deeper root-cause work before intervening.
Use verify after any scoped fix or resubmission to prove the fix changed the real failing behavior.
Use battletest when a cluster-found issue points to a broader workflow risk that should be checked outside the single monitored run.
Use organise-docs when monitoring or intervention establishes durable operating limits, failure modes, or recovery rules worth keeping.
Use git-commit or checkpoint once a real fix is verified and the repo state is commit-eligible.

If there is a conflict, live monitoring correctness wins: companion skills should help explain, fix, verify, and preserve learning without breaking attachment to the scoped run.

Repeat invocations

Resume from prior state and focus on deltas.
If no material change occurred, report no material change and continue waiting.
Keep monitoring until completion or until intervention criteria are clearly met.

cluster-monitor

Multi-agent collaboration

Default to using subagents when they are likely to improve speed, quality, confidence, or keep the main context clean.
Use subagents to widen coverage, dig deeper on one thread, get a fresh second opinion, or keep the main thread clean while side work runs.
Split work into clear packets with owners, inputs, acceptance checks, and a synthesis step when parallelizing.
Keep the main agent focused on synthesis, unblockers, and the next critical-path step; let subagents handle bounded side work that can run in parallel.
Use single-agent execution only when scope is small or coordination overhead outweighs gains.

Cluster-monitor-specific subagent split

Default to subagents for long waits or when scheduler state and output health both need steady checking.
Suggested split: queue-watcher owns scheduler deltas, timing changes, and stuck-state signals, and updates plan/current/cluster-monitor.md.
Suggested split: artifact-watcher owns logs, outputs, result files, and sanity checks, and updates the same note with evidence paths and latest health.
Suggested split: fresh-checker does a clean second pass before intervention, cancel/resubmit, or final closeout when the state is messy or surprising.
The main agent owns intervention thresholds, scoped cancel or resubmit decisions, and the final merged story.
Use subagents only when the work splits cleanly; otherwise stay single-agent.

Overview

cluster-monitor is the primary cluster skill and subsumes former cluster-check behavior.

Use it for:

quick operational status (squeue/sacct/sinfo/QoS),
long-running queue-to-completion monitoring for hours or days,
deep triage and microscope-level analysis of logs, outputs, and results,
intervention workflows when continuing would likely generate invalid results or expensive reruns.

Default objective: maximize correct completion and valid learning throughput while minimizing wasted wall-clock and duplicate reruns.

Proactive autonomy and knowledge compounding

Be proactive: immediately take the next highest-value in-scope action when it is clear.
Default to autonomous execution: do not pause for confirmation between normal in-scope steps.
Request user input only when absolutely necessary: ambiguous requirements, material risk tradeoffs, missing required data/access, or destructive/irreversible actions outside policy.
If blocked by command/tool/env failures, attempt high-confidence fallbacks autonomously before escalating (for example rg -> find/grep, python -> python3, alternate repo-native scripts).
When the workflow uses plan/, ensure required plan directories exist before reading/writing them (create when edits are allowed; otherwise use an in-memory fallback and call it out).
Treat transient external failures (network/SSH/remote APIs/timeouts) as retryable by default: run bounded retries with backoff and capture failure evidence before concluding blocked.
On repeated invocations for the same objective, resume from prior findings/artifacts and prioritize net-new progress over rerunning identical work unless verification requires reruns.
Drive work to complete outcomes with verification, not partial handoffs.
Treat iterative execution as the default for non-trivial work; run adaptive loop passes. Example loops (adapt as needed, not rigid): issue-resolution investigate -> plan -> fix -> verify -> battletest -> organise-docs -> git-commit -> re-review; cleanup scan -> prioritize -> clean -> verify -> re-scan; docs audit -> update -> verify -> re-audit.
Keep looping until actual completion criteria are met: no actionable in-scope items remain, verification is green, and confidence is high.
Run organise-docs frequently during execution to capture durable decisions and learnings, not only at the end.
Create small checkpoint commits frequently with git-commit when changes are commit-eligible, checks are green, and repo policy permits commits.
Never squash commits; always use merge commits when integrating branches.
Prefer simplification over added complexity: aggressively remove bloat, redundancy, and over-engineering while preserving correctness.
When you touch code, leave the touched area in a better state than you found it: clearer, simpler, tidier, and at least as performant unless the task requires an explicit trade-off.
Use simple, plain English in user messages, docs, notes, reports, code comments, and other explanatory writing. Avoid jargon, fancy wording, and complex phrasing. When a technical term is needed for correctness, explain it in simple words the first time. Default to short user-facing responses. Think about what the user most wants to know, and lead with that. Do not dump every detail by default. Always include important changes, blockers, verification gaps, and any important assumptions, nuances, principles, or decisions that shaped the work. Add more detail only when the user asks for it or when uncertainty or risk makes it necessary.
Compound knowledge continuously: keep docs/ accurate and up to date, and promote durable learnings and decisions from work into docs.

Long-task checkpoint cadence

For any non-trivial task (including long efforts), run recurring checkpoint cycles instead of waiting for a single end-of-task wrap-up.
At each meaningful milestone with commit-eligible changes, and at least once per major phase, invoke git-commit to create a small logical checkpoint commit once relevant checks are green and repo policy permits commits.
At the same cadence, invoke organise-docs whenever durable learnings/decisions appear, and prune stale plan/ scratch artifacts.
If either checkpoint is blocked (for example failing checks or low-confidence documentation), resolve or record the blocker immediately and retry before expanding scope.

Interruption handoff (required for long runs)

Before stopping, yielding to a fresh session, or ending due to external interruption, write a concise handoff in plan/handoffs/ or plan/current/notes.md.
The handoff must include:
- monitored scope (project, cluster_user, job IDs/batches),
- latest state counts and most recent progress table,
- last evidence checked (logs, outputs, metrics, queue snapshot time),
- current blocker or reason for pause,
- exact next step and next recommended poll window.
On repeat invocations, read the latest handoff first and continue from deltas rather than rebuilding context from scratch.

Terminal state contract (must follow)

The skill is complete only when all of the following are true:

Objective completion: the user-requested outcome is achieved, or explicitly marked blocked with concrete blocker evidence.
Workflow completion: every required workflow step is resolved as done, blocked, or not-applicable, with brief evidence or rationale.
Step-level terminal completion: each numbered subtask must have explicit completion evidence (artifact, command output, or written rationale) before advancing.
Verification completion: required checks/validations for this skill are executed, or any unavailable checks are explicitly called out with impact.
Findings completion (where applicable): report only evidence-backed findings; if no high-confidence critical findings are present, explicitly state that.
Loop completion: no actionable in-scope next step remains under the current objective.

Stop only after this terminal contract is satisfied; otherwise continue iterating.

Terminal state examples (adapt to skill)

done: monitored job set reaches the requested end condition, every scoped job was watched through queue/pending and running to terminal state, and the best-available outputs/results/logs were checked and found consistent with the intended run.
blocked: scheduler/cluster access or required project wiring is unavailable after bounded retries; blocker evidence and exact unblock command are reported.
not-applicable: intervention steps are skipped with rationale when high-bar intervention criteria are not met.

Scope and identity (must establish first)

Determine and record:

project_root: current repo root.
project_name: inferred from repo basename unless overridden by explicit user instruction.
cluster_user: from env/config/project scripts, falling back only when high confidence.
job_prefix or batch identifiers: from project scripts, submitted job IDs, or naming conventions.
cluster_host: from project cluster wrappers/env/ssh config.

Build the monitored set from both scopes:

current conversation job IDs (if already referenced in the thread/artifacts), and
current project jobs for the current cluster user (derived from prefix/project filters).

Always scope cancellation and cleanup to the current project + cluster user + selected monitored set.

Modes

Quick-status mode (must support)

Use for fast operational answers such as queue usage, node capacity, QoS, and completion checks.

Requirements:

Keep it read-only and fast.
Prefer live scheduler evidence (squeue, sinfo, sacct, scontrol show qos).
Return concrete timestamps and units.
If access is blocked, provide blocker evidence and exact unblock command.
Stop after requested status unless user asks to escalate.

Deep-monitor mode (default for long runs)

Use for long-running monitoring, deep diagnosis, and intervention decisions.

Requirements:

Stay attached from PENDING/queued through RUNNING to terminal completion; queue time is still active monitoring time, not a reason to stop waiting.
Monitor over time with low-noise polling and state deltas.
Inspect scheduler state plus any accessible logs/outputs/results/files under a microscope, not only scheduler states.
Do not jump straight to final result digging while jobs are still queued or running; keep monitoring until they actually finish, and intervene if needed.
Do not stop at scheduler terminal state alone; keep going until the finished outputs/results/logs have been sanity-checked against the intended job outcome.
Intervene only when high-bar criteria are met.

Trigger phrases

Use quick-status mode for prompts like:

check current cluster usage
per-node cpu/gpu usage
what is the qos
are jobs finished
cluster configuration right now

Use deep-monitor mode for prompts like:

monitor these jobs until done
watch this slurm batch and intervene only if needed
diagnose cluster failures and resubmit if required
microscope-check logs, outputs, results, and files while jobs wait or run

Prompt templates

Use these copy-paste templates:

[$cluster-monitor] quick-status: per-node cpu/gpu usage, queue counts by state, and qos values with timestamp.
[$cluster-monitor] quick-status: are all conversation/project jobs finished? include a submitted/running/pending/completed/failed/canceled summary table.
[$cluster-monitor] deep-monitor: monitor current conversation jobs + current project jobs from queue/pending through running to terminal completion, inspect scheduler/log/output/result/file evidence under a microscope, do not stop at scheduler completion alone, include a submitted/running/pending/completed/failed/canceled/intervened summary table at each material update, and intervene only if invalid-output or costly-rerun risk is high.
[$cluster-monitor] deep-monitor: if intervention is warranted, cancel scoped jobs, clean logs/outputs/cache/temp + disk pressure, apply verified fixes, resubmit, and continue monitoring until the finished outputs, logs, and results have been checked and still make sense.

Monitoring posture and intervention bar

Be patient by default; queue waiting alone is not intervention-worthy.
Accept isolated exploratory failures when resulting outputs remain usable.
Maintain a high intervention bar: intervene only when continuing is likely to:
- produce invalid/corrupt/unusable outputs, or
- force costly reruns that duplicate substantial work/time.

Default policy bands (batch-level, same failure pattern):

watch band: up to 10% similar failures,
escalation band: >10% similar failures,
intervention band: >=15% similar failures plus evidence of invalid output risk or rerun inevitability.

Hard-stop override (intervene earlier):

deterministic configuration bug affecting most jobs,
widespread invalid artifacts (NaNs/empty/corrupt metrics),
deterministic crash loop showing current configuration cannot produce valid results,
disk pressure from failed/canceled artifacts that risks breaking active/future runs.

Workflow

1) Preflight and wiring

Confirm pwd, repo root, branch, and required tools.
Prefer project-native cluster wrappers/scripts; fall back to ssh + squeue/sacct/scontrol only when needed.
Validate connectivity; retry transient failures with bounded backoff.

2) Build monitored inventory

Collect live queue for scoped jobs (RUNNING, PENDING, reasons, nodes).
Collect recent states/history (sacct) for monitored batches.
Resolve batch grouping using project manifests when available; otherwise use prefix/time windows.
Record which jobs are conversation-linked vs project-derived.

3) Low-noise monitoring loop

Cadence defaults:

pending only with legitimate reasons: every 600-900s,
active running jobs: every 180-300s,
degraded/suspect state: every 60-120s.

At each poll, gather and compare deltas:

scheduler state counts and transitions,
queue reasons and queue-time changes for pending jobs,
new stderr/stdout error signatures,
new output/result artifacts and integrity signals,
progress and completion markers,
signals that the produced work still matches the intended run shape (expected files, expected metric ranges, expected artifact freshness, or other project-specific sanity clues when available).

Even when every monitored job is still pending, remain in the loop and keep checking the accessible scheduler/log/output/result/file evidence until jobs finish or intervention is required.

Prefer compact snapshots over repeated full table dumps. Every material update must include a standardized progress table with at least:

submitted
running
pending
completed
failed
canceled
intervened

If a field is unknown, say unknown explicitly rather than omitting it.

4) Microscope-level diagnostics

For changed or suspect jobs, inspect deeply:

stdout/stderr tails and targeted searches (Traceback, ERROR, Exception, OOM, timeout, NaN, corruption patterns),
output files, metrics summaries, manifests, checkpoints,
signs of wrapper-level vs root-cause mismatch,
sync integrity between remote and local artifacts,
whether the produced artifacts look complete, non-empty, current, and internally consistent enough to support the intended outcome.

Do this while jobs are pending or running whenever the evidence is available; do not wait until after completion to start checking logs, outputs, or produced files.

Classify health as:

healthy, degraded, or systemic.

5) Intervention decision

Do not intervene for:

normal scheduler waiting,
isolated corner-case failures with intact learning value,
mixed evidence where valid coverage remains likely.

Intervene only when evidence indicates invalid outputs or expensive reruns are likely without action.

6) Intervention workflow (mandatory sequence)

When intervention is justified, execute this order:

Diagnose and define exact fix plan.
Cancel only affected scoped jobs (project/user/monitored set).
Clean up canceled-job artifacts that could pollute future runs:
- logs,
- partial outputs/results,
- stale caches/scratch,
- orphaned temporary files.
Check and reduce disk pressure caused by canceled/failed artifacts.
Implement high-confidence code/config fixes.
Validate fixes with relevant local checks/smoke tests before resubmission.
Resubmit jobs with clear hypothesis notes and record new job IDs.
Return to deep-monitor loop and verify the fix in real run behavior.

Never cancel/cleanup outside scoped project/user ownership.

7) Completion workflow

Start this section only after the monitored jobs reach terminal states. Until then, stay in the monitoring loop.

When monitored jobs finish:

sync all relevant outputs,
run immediate post-run analysis of metrics, anomalies, and failures,
verify the finished job set matches the intended scope and that no expected job is still queued, running, missing, or silently dropped,
verify the final outputs/results/logs are present and sane enough for the intended use instead of assuming scheduler COMPLETED means success,
separate informative failures from wasted failures,
capture durable learnings and rationale in docs/code/tests,
propose next experiments only when asked or clearly warranted.

Reporting format (required)

Keep the report short by default.

In most updates include only:

Findings summary.
Scope and identity (project/user/prefix/host).
Standardized progress table (submitted/running/pending/completed/failed/canceled/intervened).
Intervention decision and rationale against the high-bar rule.
Next step, blocker, or completion state.

Add timeline detail, microscope diagnostics, sync status, learnings, or the evidence runbook only when they changed, when they are needed to justify the decision, or when the user asks.

Skill composition

When this skill is triggered, compose other skills as needed:

Use investigate when a suspicious job state, log pattern, or result artifact needs deeper root-cause work before intervening.
Use verify after any scoped fix or resubmission to prove the fix changed the real failing behavior.
Use battletest when a cluster-found issue points to a broader workflow risk that should be checked outside the single monitored run.
Use organise-docs when monitoring or intervention establishes durable operating limits, failure modes, or recovery rules worth keeping.
Use git-commit or checkpoint once a real fix is verified and the repo state is commit-eligible.

If there is a conflict, live monitoring correctness wins: companion skills should help explain, fix, verify, and preserve learning without breaking attachment to the scoped run.

Repeat invocations

Resume from prior state and focus on deltas.
If no material change occurred, report no material change and continue waiting.
Keep monitoring until completion or until intervention criteria are clearly met.

Adoption

olliecrow/cluster-monitor

$ install --global

Security Scan Results

SKILL.md

cluster-monitor

Multi-agent collaboration

Cluster-monitor-specific subagent split

Overview

Proactive autonomy and knowledge compounding

Long-task checkpoint cadence

Interruption handoff (required for long runs)

Terminal state contract (must follow)

Terminal state examples (adapt to skill)

Scope and identity (must establish first)

Modes

Quick-status mode (must support)

Deep-monitor mode (default for long runs)

Trigger phrases

Prompt templates

Monitoring posture and intervention bar

Workflow

1) Preflight and wiring

2) Build monitored inventory

3) Low-noise monitoring loop

4) Microscope-level diagnostics

5) Intervention decision

6) Intervention workflow (mandatory sequence)

7) Completion workflow

Reporting format (required)

Skill composition

Repeat invocations

Related Skills

olliecrow/sentinel-research

olliecrow/handoff

olliecrow/codex-custom-review

olliecrow/yeet

olliecrow/cluster-monitor

$ install --global

Security Scan Results

SKILL.md

cluster-monitor

Multi-agent collaboration

Cluster-monitor-specific subagent split

Overview

Proactive autonomy and knowledge compounding

Long-task checkpoint cadence

Interruption handoff (required for long runs)

Terminal state contract (must follow)

Terminal state examples (adapt to skill)

Scope and identity (must establish first)

Modes

Quick-status mode (must support)

Deep-monitor mode (default for long runs)

Trigger phrases

Prompt templates

Monitoring posture and intervention bar

Workflow

1) Preflight and wiring

2) Build monitored inventory

3) Low-noise monitoring loop

4) Microscope-level diagnostics

5) Intervention decision

6) Intervention workflow (mandatory sequence)

7) Completion workflow

Reporting format (required)

Skill composition

Repeat invocations

Related Skills

olliecrow/sentinel-research

olliecrow/handoff

olliecrow/codex-custom-review

olliecrow/yeet