skills/cluster-monitor/SKILL.md
Primary Slurm cluster skill for this workspace. Monitor current conversation jobs and current project jobs over long horizons with low-noise polling, microscope-level log/output/result inspection, and high-bar interventions only when not intervening would likely produce invalid results or force costly reruns. Stay attached from queue through running to terminal completion, keep checking scheduler state plus accessible logs/outputs/results/files throughout, and do not call the work done until the finished outputs still make sense. When intervention is warranted, cancel scoped jobs, clean up artifacts/cache/logs, implement and verify fixes, resubmit, and continue monitoring until validated completion.
npx skillsauth add olliecrow/codex cluster-monitorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
queue-watcher owns scheduler deltas, timing changes, and stuck-state signals, and updates plan/current/cluster-monitor.md.artifact-watcher owns logs, outputs, result files, and sanity checks, and updates the same note with evidence paths and latest health.fresh-checker does a clean second pass before intervention, cancel/resubmit, or final closeout when the state is messy or surprising.cluster-monitor is the primary cluster skill and subsumes former cluster-check behavior.
Use it for:
squeue/sacct/sinfo/QoS),Default objective: maximize correct completion and valid learning throughput while minimizing wasted wall-clock and duplicate reruns.
rg -> find/grep, python -> python3, alternate repo-native scripts).plan/, ensure required plan directories exist before reading/writing them (create when edits are allowed; otherwise use an in-memory fallback and call it out).investigate -> plan -> fix -> verify -> battletest -> organise-docs -> git-commit -> re-review; cleanup scan -> prioritize -> clean -> verify -> re-scan; docs audit -> update -> verify -> re-audit.organise-docs frequently during execution to capture durable decisions and learnings, not only at the end.git-commit when changes are commit-eligible, checks are green, and repo policy permits commits.docs/ accurate and up to date, and promote durable learnings and decisions from work into docs.git-commit to create a small logical checkpoint commit once relevant checks are green and repo policy permits commits.organise-docs whenever durable learnings/decisions appear, and prune stale plan/ scratch artifacts.plan/handoffs/ or plan/current/notes.md.project, cluster_user, job IDs/batches),The skill is complete only when all of the following are true:
blocked with concrete blocker evidence.done, blocked, or not-applicable, with brief evidence or rationale.Stop only after this terminal contract is satisfied; otherwise continue iterating.
done: monitored job set reaches the requested end condition, every scoped job was watched through queue/pending and running to terminal state, and the best-available outputs/results/logs were checked and found consistent with the intended run.blocked: scheduler/cluster access or required project wiring is unavailable after bounded retries; blocker evidence and exact unblock command are reported.not-applicable: intervention steps are skipped with rationale when high-bar intervention criteria are not met.Determine and record:
project_root: current repo root.project_name: inferred from repo basename unless overridden by explicit user instruction.cluster_user: from env/config/project scripts, falling back only when high confidence.job_prefix or batch identifiers: from project scripts, submitted job IDs, or naming conventions.cluster_host: from project cluster wrappers/env/ssh config.Build the monitored set from both scopes:
Always scope cancellation and cleanup to the current project + cluster user + selected monitored set.
Use for fast operational answers such as queue usage, node capacity, QoS, and completion checks.
Requirements:
squeue, sinfo, sacct, scontrol show qos).Use for long-running monitoring, deep diagnosis, and intervention decisions.
Requirements:
PENDING/queued through RUNNING to terminal completion; queue time is still active monitoring time, not a reason to stop waiting.Use quick-status mode for prompts like:
check current cluster usageper-node cpu/gpu usagewhat is the qosare jobs finishedcluster configuration right nowUse deep-monitor mode for prompts like:
monitor these jobs until donewatch this slurm batch and intervene only if neededdiagnose cluster failures and resubmit if requiredmicroscope-check logs, outputs, results, and files while jobs wait or runUse these copy-paste templates:
[$cluster-monitor] quick-status: per-node cpu/gpu usage, queue counts by state, and qos values with timestamp.[$cluster-monitor] quick-status: are all conversation/project jobs finished? include a submitted/running/pending/completed/failed/canceled summary table.[$cluster-monitor] deep-monitor: monitor current conversation jobs + current project jobs from queue/pending through running to terminal completion, inspect scheduler/log/output/result/file evidence under a microscope, do not stop at scheduler completion alone, include a submitted/running/pending/completed/failed/canceled/intervened summary table at each material update, and intervene only if invalid-output or costly-rerun risk is high.[$cluster-monitor] deep-monitor: if intervention is warranted, cancel scoped jobs, clean logs/outputs/cache/temp + disk pressure, apply verified fixes, resubmit, and continue monitoring until the finished outputs, logs, and results have been checked and still make sense.Default policy bands (batch-level, same failure pattern):
10% similar failures,>10% similar failures,>=15% similar failures plus evidence of invalid output risk or rerun inevitability.Hard-stop override (intervene earlier):
pwd, repo root, branch, and required tools.ssh + squeue/sacct/scontrol only when needed.RUNNING, PENDING, reasons, nodes).sacct) for monitored batches.Cadence defaults:
600-900s,180-300s,60-120s.At each poll, gather and compare deltas:
Even when every monitored job is still pending, remain in the loop and keep checking the accessible scheduler/log/output/result/file evidence until jobs finish or intervention is required.
Prefer compact snapshots over repeated full table dumps. Every material update must include a standardized progress table with at least:
submittedrunningpendingcompletedfailedcanceledintervenedIf a field is unknown, say unknown explicitly rather than omitting it.
For changed or suspect jobs, inspect deeply:
Traceback, ERROR, Exception, OOM, timeout, NaN, corruption patterns),Do this while jobs are pending or running whenever the evidence is available; do not wait until after completion to start checking logs, outputs, or produced files.
Classify health as:
healthy, degraded, or systemic.Do not intervene for:
Intervene only when evidence indicates invalid outputs or expensive reruns are likely without action.
When intervention is justified, execute this order:
Never cancel/cleanup outside scoped project/user ownership.
Start this section only after the monitored jobs reach terminal states. Until then, stay in the monitoring loop.
When monitored jobs finish:
COMPLETED means success,Keep the report short by default.
In most updates include only:
submitted/running/pending/completed/failed/canceled/intervened).Add timeline detail, microscope diagnostics, sync status, learnings, or the evidence runbook only when they changed, when they are needed to justify the decision, or when the user asks.
When this skill is triggered, compose other skills as needed:
investigate when a suspicious job state, log pattern, or result artifact needs deeper root-cause work before intervening.verify after any scoped fix or resubmission to prove the fix changed the real failing behavior.battletest when a cluster-found issue points to a broader workflow risk that should be checked outside the single monitored run.organise-docs when monitoring or intervention establishes durable operating limits, failure modes, or recovery rules worth keeping.git-commit or checkpoint once a real fix is verified and the repo state is commit-eligible.If there is a conflict, live monitoring correctness wins: companion skills should help explain, fix, verify, and preserve learning without breaking attachment to the scoped run.
no material change and continue waiting.data-ai
Use automatically for Sentinel repo sessions, trading research questions, market/company/ticker/source questions, or any request that should use Sentinel's read-only data sources and reference context. Enforces Sentinel's high-confidence, read-only, no-local-query-trace research posture.
documentation
Compact the current conversation into a handoff document for another agent to pick up.
tools
Run the Codex custom review feature from the CLI for arbitrary review instructions. Use when the user asks to use /review, custom review, or Codex review without tying the review to commits, uncommitted changes, or a base branch; prefer multicodex exec review with no explicit account and fall back to codex exec review only when multicodex is unavailable.
tools
Use only when the user explicitly asks to stage, commit, push, and open a GitHub pull request in one flow using the GitHub CLI (`gh`).