skills/skills-codex/monitor-experiment/SKILL.md
Monitor running experiments, check progress, collect results. Use when user says "check results", "is it done", "monitor", or wants experiment output.
npx skillsauth add wanshuiyin/Auto-claude-code-research-in-sleep monitor-experimentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Monitor: $ARGUMENTS
First identify the backend from AGENTS.md, run notes, or launch summary: local, SSH, Vast.ai, or Modal. Monitor the backend that was actually used; do not assume a plain SSH screen session when the run was launched through Vast.ai or Modal.
ssh <server> "screen -ls"
For Vast.ai, also check instance state, SSH reachability, hourly cost, and whether auto_destroy is pending. For Modal, check the Modal run/app logs, function status, timeout, volume outputs, and cloud cost exposure.
For each screen session, capture the last N lines:
ssh <server> "screen -S <name> -X hardcopy /tmp/screen_<name>.txt && tail -50 /tmp/screen_<name>.txt"
If hardcopy fails, check for log files or tee output.
ssh <server> "ls -lt <results_dir>/*.json 2>/dev/null | head -20"
If JSON results exist, fetch and parse them:
ssh <server> "cat <results_dir>/<latest>.json"
wandb: true in AGENTS.md)If the project enables W&B, pull metrics before interpreting results. Prefer W&B as the source of training curves and recent eval state, while still checking logs for crashes.
List recent runs:
python3 - <<'PY'
import wandb
api = wandb.Api()
for run in api.runs("<entity>/<project>", per_page=20):
print(run.name, run.state, run.url)
PY
Pull recent history for a specific run:
python3 - <<'PY'
import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
for row in run.history(samples=50, keys=["train/loss", "eval/loss", "eval/accuracy", "train/lr"]):
print(row)
print("summary:", dict(run.summary))
PY
If W&B is configured but unavailable, report the connectivity problem and fall back to screen/log/json evidence. Do not interpret missing W&B data as experiment failure by itself.
Always include W&B dashboard links (run.url) when available so later review and paper-writing agents can inspect the exact training curves.
Present results in a comparison table:
| Experiment | Metric | Delta vs Baseline | Status |
|-----------|--------|-------------------|--------|
| Baseline | X.XX | — | done |
| Method A | X.XX | +Y.Y | done |
After results are collected, check ~/.codex/feishu.json:
experiment_done notification: results summary table, delta vs baseline"off": skip entirely (no-op)data-ai
Generate and rank research ideas given a broad direction. Use when user says "找idea", "brainstorm ideas", "generate research ideas", "what can we work on", or wants to explore a research area for publishable directions.
development
Get a deep critical review of research from GPT using a secondary Codex agent. Use when user says "review my research", "help me review", "get external review", or wants critical feedback on research ideas, papers, or experimental results.
data-ai
Generate and rank research ideas given a broad direction. Use when user says "找idea", "brainstorm ideas", "generate research ideas", "what can we work on", or wants to explore a research area for publishable directions.
development
Autonomous multi-round research review loop. Repeatedly reviews using a secondary Codex agent, implements fixes, and re-reviews until positive assessment or max rounds reached. Use when user says "auto review loop", "review until it passes", or wants autonomous iterative improvement.