skills/run-status-monitor/SKILL.md
Use when probing the status of an existing job — queued, stuck, running, or finished — across local, SLURM, RunAI, or SSH. Not for launching new jobs (use run-experiment). Not for debugging NaN/OOM/engineering failures (use experiment-debugger). Not for interpreting valid but surprising results (use result-diagnosis).
npx skillsauth add a-green-hand-jack/ml-research-skills run-status-monitorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Answer lightweight operational questions about active runs while keeping raw logs out of the main conversation.
Use this skill for:
Do not paste long scheduler output or training logs into chat. Probe, compress, write a short status artifact, and report only the summary.
<installed-skill-dir>/
├── SKILL.md
├── scripts/
│ └── run_status_probe.py
├── references/
│ └── backends.md
└── templates/
├── runs.yaml
└── status.md
.agent/ run artifacts.sleep/poll/log-watch loops in the main agent transcript. A single bounded probe is acceptable; repeated checks must be done through a status artifact, a project wrapper, or a sidecar/background monitor that writes a short artifact.remote-cmd for simple commands and remote-bash for project scripts or any command containing loops, $variables, command substitution, pipes, globs, find, or awk.sidecar-task-runner or a project-local monitor artifact when status tracking needs more than one check, noisy log interpretation, multiple jobs, or delayed follow-up. The sidecar/monitor should own polling and log compression; the main agent should read only its final short artifact.invalid_grant, stop retrying API probes in this turn, mark API monitoring blocked, and switch to filesystem/project-wrapper fallback when available.result-diagnosis after creating the status artifact.ContainerCreating, environment startup, or unknown, and recommend the smallest compatible next action.underutilized rather than simply healthy, and recommend the next launch shape: fewer GPUs, scheduler array, per-GPU worker pool, or native multi-GPU launcher..agent/run-status/
├── runs.yaml # project-local run monitor config, private if it contains hosts/paths
└── raw/ # optional raw probe captures, ignored/private
docs/ops/runs/
└── <run-id>-status.md # short status artifact safe for main-agent reading
references/backends.md.--config.agent/run-status/runs.yamldocs/ops/runs.yamlinfra/remote-projects.yaml plus project-specific notespython3 <installed-skill-dir>/scripts/run_status_probe.py \
--config .agent/run-status/runs.yaml \
--run <run-id>
templates/runs.yaml with the minimum backend fields and ask only for missing run identity fields.status_artifact, not raw logs, before answering the user.docs/ops/current-status.md or project memory only when the status changes durable project state — durable state includes: run completed with surprising or paper-facing metrics, confirmed failure with identified cause, resource occupancy pattern that should inform the next launch policy, or a new run ID that becomes the canonical reference for a claim or experiment.For ongoing monitoring:
.agent/run-status/ or use a project wrapper that writes docs/ops/runs/<run-id>-status.md.Every user-facing answer should fit this shape:
Run: <id>
State: running | pending | succeeded | failed | stale | unknown
Progress: <short>
Resources: <allocated vs active GPUs, utilization, memory, or unknown>
Latest metrics: <short>
Last update: <time or unknown>
ETA: <estimate or unknown>
Risk: <short>
Artifact: <status artifact path>
Escalate when:
ContainerCreating long enough to consume the smoke/debug budgettesting
Bootstrap project-local ml-research-skills. Use from global installs when creating a new ML research project, enabling this collection in an existing ML research repo, or deciding whether to install the full bundle locally. Route to project-init for new projects; do not handle paper or experiment work directly.
development
Route project operations tasks — git, memory, bootstrap, remote, workspace, code review, timeline, ops — to the correct skill. Use when the task involves commits, pushes, worktrees, project memory, enabling project-local skills, SSH/server coordination, sidecar runners, or audits. Do not solve the ops task directly.
testing
Route ML/AI paper writing tasks to the correct skill — contract planning, prose drafting, section writing, consistency editing, review simulation, rebuttal, submission, or citation work. Use when the task involves writing, revising, reviewing, or submitting a paper instead of guessing between paper-writing-assistant, paper-writing-contract-planner, paper-reviewer-simulator, auto-paper-improvement-loop, or citation skills. Do not draft prose directly.
data-ai
Project-local router for ML research skill selection. Use inside an initialized ML research project, or while maintaining this skill repo, when the user describes an ML research/paper/experiment/discovery/ops/release workflow and may not know the skill; route to a domain router or high-signal leaf. Do not use for generic non-ML projects.