Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

a-green-hand-jack/run-status-monitor

Name: run-status-monitor
Author: a-green-hand-jack

skills/run-status-monitor/SKILL.md

npx skillsauth add a-green-hand-jack/ml-research-skills run-status-monitor

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Run Status Monitor

Answer lightweight operational questions about active runs while keeping raw logs out of the main conversation.

Use this skill for:

"现在实验进行到哪里了？"
"有什么中间结果？"
"预计还要多久结束？"
"这个 job 是卡住了还是还在跑？"

Do not paste long scheduler output or training logs into chat. Probe, compress, write a short status artifact, and report only the summary.

Skill Directory Layout

<installed-skill-dir>/
├── SKILL.md
├── scripts/
│   └── run_status_probe.py
├── references/
│   └── backends.md
└── templates/
    ├── runs.yaml
    └── status.md

Core Contract

Raw logs stay in their original location or private .agent/ run artifacts.
The main agent reads only a short generated status artifact.
Do not run open-ended sleep/poll/log-watch loops in the main agent transcript. A single bounded probe is acceptable; repeated checks must be done through a status artifact, a project wrapper, or a sidecar/background monitor that writes a short artifact.
Prefer project/private wrapper commands for server-specific probes. For SSH-backed status checks, prefer remote-cmd for simple commands and remote-bash for project scripts or any command containing loops, $variables, command substitution, pipes, globs, find, or awk.
Use sidecar-task-runner or a project-local monitor artifact when status tracking needs more than one check, noisy log interpretation, multiple jobs, or delayed follow-up. The sidecar/monitor should own polling and log compression; the main agent should read only its final short artifact.
Use an authentication circuit breaker for scheduler/API probes. If a RunAI/Kubernetes/cluster API command reports OAuth/session refresh failure such as invalid_grant, stop retrying API probes in this turn, mark API monitoring blocked, and switch to filesystem/project-wrapper fallback when available.
If a run appears failed, stale, or scientifically surprising, route to result-diagnosis after creating the status artifact.
If a run is pending, distinguish scheduler/resource causes from code causes. Summarize whether the blocker appears to be pool/partition capacity, quota/fair-share, CPU/memory request, image pull, ContainerCreating, environment startup, or unknown, and recommend the smallest compatible next action.
If a run is spending time creating or syncing a new uv environment, report that as environment setup overhead. Check whether the job used an existing project/stage env or created a job-specific env, and flag avoidable env proliferation.
Report resource occupancy when available: allocated GPU count, active GPU count, per-GPU utilization/memory, node/process ownership, and whether the observed usage matches the workload shape.
If a job is running normally but leaves allocated GPUs idle, mark it underutilized rather than simply healthy, and recommend the next launch shape: fewer GPUs, scheduler array, per-GPU worker pool, or native multi-GPU launcher.

Expected Project Layout

.agent/run-status/
├── runs.yaml              # project-local run monitor config, private if it contains hosts/paths
└── raw/                   # optional raw probe captures, ignored/private

docs/ops/runs/
└── <run-id>-status.md     # short status artifact safe for main-agent reading

Workflow

Locate project root.
Read references/backends.md.
Locate run config in this order:
- user-provided --config
- .agent/run-status/runs.yaml
- docs/ops/runs.yaml
- infra/remote-projects.yaml plus project-specific notes
Run the probe script when a config exists:

python3 <installed-skill-dir>/scripts/run_status_probe.py \
  --config .agent/run-status/runs.yaml \
  --run <run-id>

If no config exists, create one from templates/runs.yaml with the minimum backend fields and ask only for missing run identity fields.
Read only the generated status_artifact, not raw logs, before answering the user.
Update docs/ops/current-status.md or project memory only when the status changes durable project state — durable state includes: run completed with surprising or paper-facing metrics, confirmed failure with identified cause, resource occupancy pattern that should inform the next launch policy, or a new run ID that becomes the canonical reference for a claim or experiment.

For ongoing monitoring:

Write a small monitor plan under .agent/run-status/ or use a project wrapper that writes docs/ops/runs/<run-id>-status.md.
If waiting is needed, keep it outside the main transcript as a bounded background/session task and have it overwrite the short status artifact.
Return to the main agent with only the status artifact path and its compressed fields.
Stop monitoring when the run reaches a terminal state, auth is blocked, or the next useful check is far enough away that a human reminder is cheaper than agent polling.

Output Rules

Every user-facing answer should fit this shape:

Run: <id>
State: running | pending | succeeded | failed | stale | unknown
Progress: <short>
Resources: <allocated vs active GPUs, utilization, memory, or unknown>
Latest metrics: <short>
Last update: <time or unknown>
ETA: <estimate or unknown>
Risk: <short>
Artifact: <status artifact path>

Escalate when:

the run is failed/stale
the run is pending and a cheaper or lower-wait compatible resource could unblock a smoke/debug check
the run is stuck in image pull or ContainerCreating long enough to consume the smoke/debug budget
metrics are surprising
logs show repeated exceptions, OOM, NaN, or checkpoint failures
scheduler/API auth is blocked and needs a single explicit login refresh action
ETA cannot be estimated because progress markers are absent
allocated resources are idle or only partly used and the workload could be packed or sharded more effectively
the probe command needs network/server approval

a-green-hand-jack/run-status-monitor

skills/run-status-monitor/SKILL.md

Use when probing the status of an existing job — queued, stuck, running, or finished — across local, SLURM, RunAI, or SSH. Not for launching new jobs (use run-experiment). Not for debugging NaN/OOM/engineering failures (use experiment-debugger). Not for interpreting valid but surprising results (use result-diagnosis).

4 stars

development

Updated May 19, 2026

$ install --global

skillsauth

npx skillsauth add a-green-hand-jack/ml-research-skills run-status-monitor

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 19, 2026, 5:21 AM173.1s6 files scanned

SKILL.md

name:: run-status-monitor
description:: Use when probing the status of an existing job — queued, stuck, running, or finished — across local, SLURM, RunAI, or SSH. Not for launching new jobs (use run-experiment). Not for debugging NaN/OOM/engineering failures (use experiment-debugger). Not for interpreting valid but surprising results (use result-diagnosis).
argument-hint:: [run-id] [--config .agent/run-status/runs.yaml] [--output docs/ops/runs/<run-id>-status.md]
allowed-tools:: Read, Write, Edit, Bash, Glob

Run Status Monitor

Answer lightweight operational questions about active runs while keeping raw logs out of the main conversation.

Use this skill for:

"现在实验进行到哪里了？"
"有什么中间结果？"
"预计还要多久结束？"
"这个 job 是卡住了还是还在跑？"

Do not paste long scheduler output or training logs into chat. Probe, compress, write a short status artifact, and report only the summary.

Skill Directory Layout

<installed-skill-dir>/
├── SKILL.md
├── scripts/
│   └── run_status_probe.py
├── references/
│   └── backends.md
└── templates/
    ├── runs.yaml
    └── status.md

Core Contract

Raw logs stay in their original location or private .agent/ run artifacts.
The main agent reads only a short generated status artifact.
Do not run open-ended sleep/poll/log-watch loops in the main agent transcript. A single bounded probe is acceptable; repeated checks must be done through a status artifact, a project wrapper, or a sidecar/background monitor that writes a short artifact.
Prefer project/private wrapper commands for server-specific probes. For SSH-backed status checks, prefer remote-cmd for simple commands and remote-bash for project scripts or any command containing loops, $variables, command substitution, pipes, globs, find, or awk.
Use sidecar-task-runner or a project-local monitor artifact when status tracking needs more than one check, noisy log interpretation, multiple jobs, or delayed follow-up. The sidecar/monitor should own polling and log compression; the main agent should read only its final short artifact.
Use an authentication circuit breaker for scheduler/API probes. If a RunAI/Kubernetes/cluster API command reports OAuth/session refresh failure such as invalid_grant, stop retrying API probes in this turn, mark API monitoring blocked, and switch to filesystem/project-wrapper fallback when available.
If a run appears failed, stale, or scientifically surprising, route to result-diagnosis after creating the status artifact.
If a run is pending, distinguish scheduler/resource causes from code causes. Summarize whether the blocker appears to be pool/partition capacity, quota/fair-share, CPU/memory request, image pull, ContainerCreating, environment startup, or unknown, and recommend the smallest compatible next action.
If a run is spending time creating or syncing a new uv environment, report that as environment setup overhead. Check whether the job used an existing project/stage env or created a job-specific env, and flag avoidable env proliferation.
Report resource occupancy when available: allocated GPU count, active GPU count, per-GPU utilization/memory, node/process ownership, and whether the observed usage matches the workload shape.
If a job is running normally but leaves allocated GPUs idle, mark it underutilized rather than simply healthy, and recommend the next launch shape: fewer GPUs, scheduler array, per-GPU worker pool, or native multi-GPU launcher.

Expected Project Layout

.agent/run-status/
├── runs.yaml              # project-local run monitor config, private if it contains hosts/paths
└── raw/                   # optional raw probe captures, ignored/private

docs/ops/runs/
└── <run-id>-status.md     # short status artifact safe for main-agent reading

Workflow

Locate project root.
Read references/backends.md.
Locate run config in this order:
- user-provided --config
- .agent/run-status/runs.yaml
- docs/ops/runs.yaml
- infra/remote-projects.yaml plus project-specific notes
Run the probe script when a config exists:

python3 <installed-skill-dir>/scripts/run_status_probe.py \
  --config .agent/run-status/runs.yaml \
  --run <run-id>

If no config exists, create one from templates/runs.yaml with the minimum backend fields and ask only for missing run identity fields.
Read only the generated status_artifact, not raw logs, before answering the user.
Update docs/ops/current-status.md or project memory only when the status changes durable project state — durable state includes: run completed with surprising or paper-facing metrics, confirmed failure with identified cause, resource occupancy pattern that should inform the next launch policy, or a new run ID that becomes the canonical reference for a claim or experiment.

For ongoing monitoring:

Write a small monitor plan under .agent/run-status/ or use a project wrapper that writes docs/ops/runs/<run-id>-status.md.
If waiting is needed, keep it outside the main transcript as a bounded background/session task and have it overwrite the short status artifact.
Return to the main agent with only the status artifact path and its compressed fields.
Stop monitoring when the run reaches a terminal state, auth is blocked, or the next useful check is far enough away that a human reminder is cheaper than agent polling.

Output Rules

Every user-facing answer should fit this shape:

Run: <id>
State: running | pending | succeeded | failed | stale | unknown
Progress: <short>
Resources: <allocated vs active GPUs, utilization, memory, or unknown>
Latest metrics: <short>
Last update: <time or unknown>
ETA: <estimate or unknown>
Risk: <short>
Artifact: <status artifact path>

Escalate when:

the run is failed/stale
the run is pending and a cheaper or lower-wait compatible resource could unblock a smoke/debug check
the run is stuck in image pull or ContainerCreating long enough to consume the smoke/debug budget
metrics are surprising
logs show repeated exceptions, OOM, NaN, or checkpoint failures
scheduler/API auth is blocked and needs a single explicit login refresh action
ETA cannot be estimated because progress markers are absent
allocated resources are idle or only partly used and the workload could be packed or sharded more effectively
the probe command needs network/server approval

Related Skills

a-green-hand-jack/ml-research-bootstrap

testing

VerifiedTrustedCommunity

Bootstrap project-local ml-research-skills. Use from global installs when creating a new ML research project, enabling this collection in an existing ML research repo, or deciding whether to install the full bundle locally. Route to project-init for new projects; do not handle paper or experiment work directly.

4SKILL.mdUpdated May 26, 2026

a-green-hand-jack/ml-research-bootstrap

a-green-hand-jack/project-ops-router

development

VerifiedTrustedCommunity

Route project operations tasks — git, memory, bootstrap, remote, workspace, code review, timeline, ops — to the correct skill. Use when the task involves commits, pushes, worktrees, project memory, enabling project-local skills, SSH/server coordination, sidecar runners, or audits. Do not solve the ops task directly.

4SKILL.mdUpdated May 19, 2026

a-green-hand-jack/project-ops-router

a-green-hand-jack/paper-writing-router

testing

VerifiedTrustedCommunity

Route ML/AI paper writing tasks to the correct skill — contract planning, prose drafting, section writing, consistency editing, review simulation, rebuttal, submission, or citation work. Use when the task involves writing, revising, reviewing, or submitting a paper instead of guessing between paper-writing-assistant, paper-writing-contract-planner, paper-reviewer-simulator, auto-paper-improvement-loop, or citation skills. Do not draft prose directly.

4SKILL.mdUpdated May 19, 2026

a-green-hand-jack/paper-writing-router

a-green-hand-jack/ml-research-router

data-ai

VerifiedTrustedCommunity

Project-local router for ML research skill selection. Use inside an initialized ML research project, or while maintaining this skill repo, when the user describes an ML research/paper/experiment/discovery/ops/release workflow and may not know the skill; route to a domain router or high-signal leaf. Do not use for generic non-ML projects.

4SKILL.mdUpdated May 19, 2026

a-green-hand-jack/ml-research-router

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/a-green-hand-jack/ml-research-skills.git

# Copy into Claude Code skills folder (global)
cp -r ml-research-skills/skills/run-status-monitor ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

a-green-hand-jack/ml-research-skills

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT