skills/training-check/SKILL.md
Periodically check WandB metrics during training to catch problems early (NaN, loss divergence, idle GPUs). Avoids wasting GPU hours on broken runs. Use when training is running and you want automated health checks.
npx skillsauth add wanshuiyin/Auto-claude-code-research-in-sleep training-checkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Periodically read WandB metrics during training to catch problems early. Do not wait until training finishes to discover it was a waste of GPU time.
⏱ This skill is correctly cron-wired (see below): it polls machine-checkable training health (NaN / divergence / idle GPU) — the additive external-wait shape in
shared-references/external-cadence.md. The occasional Codex call for an ambiguous metric is a one-shot check per tick, not a multi-round verdict loop, so it stays additive — it never grows into a wrapped verdict skill.
entity/project/run_id)gpt-5.5 — used via Codex MCP for ambiguous cases onlyimport wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
history = run.history()
If WandB is unreachable (API error, network issue), fall back to reading the log file directly via SSH:
ssh server "tail -100 /path/to/training.log"
Check these signals:
| Signal | Judgment | Action | |--------|----------|--------| | NaN/Inf in loss | Clearly bad | Stop training, investigate | | Loss diverging (increasing for >N steps) | Clearly bad | Stop training, investigate | | Eval metrics significantly worse than baseline | Clearly bad | Stop training, investigate | | Loss decreasing, metrics improving | Clearly fine | Continue, increase check interval | | Loss flat but not diverging | Unsure | → Step 3 (Codex judgment) | | Metrics noisy, can't tell trend | Unsure | → Step 3 (Codex judgment) | | Slightly worse than baseline but still early | Unsure | → Step 3 (Codex judgment) |
Only escalate to Codex when the signal is ambiguous. For clearly good or clearly bad signals, act directly.
mcp__codex__codex:
config: {"model_reasoning_effort": "high"}
prompt: |
TRAINING HEALTH CHECK — need your judgment on ambiguous metrics.
Run: <entity>/<project>/<run_id>
Current epoch/step: X / Y total
Training loss (last 10 checkpoints): [values]
Eval metrics (last 3 evals): [values]
Baseline reference: [numbers from paper/reproduction]
What I'm unsure about: [specific concern]
Please respond with exactly one of:
- STOP: clearly problematic, should kill training
- CONTINUE: looks fine, check again next interval
- WAIT: not enough data to judge, check again sooner
| Decision | Action | |----------|--------| | Stop | Kill the training session. Save the WandB run URL, key metrics, and reason for stopping. Log to project notes for debugging. | | Continue | Do nothing. Will be invoked again at next interval (increase interval if consistently healthy). | | Wait | Do nothing but keep the current short interval (don't increase). |
Training-check and watchdog.py operate at different levels:
| Layer | Tool | What it checks | Frequency | |-------|------|----------------|-----------| | Process health | watchdog.py | Session alive? GPU active? | Every 60s (continuous) | | Training quality | training-check | Loss trend? Metrics improving? | Every 10-60 min (periodic) |
Use both together:
After training is confirmed stable:
CronCreate (recurring, every 10 minutes initially):
"Run /training-check for wandb run <entity>/<project>/<run_id>"
As the check interval increases, delete the old CronCreate job and create a new one with the longer interval.
research
Generate a structured paper outline from review conclusions and experiment results. Use when user says \"写大纲\", \"paper outline\", \"plan the paper\", \"论文规划\", or wants to create a paper plan before writing.
research
Generate a structured paper outline from review conclusions and experiment results. Use when user says "写大纲", "paper outline", "plan the paper", "论文规划", or wants to create a paper plan before writing.
development
Get a deep critical review of research from an external reviewer backend (Codex or manual). Use when user says "review my research", "help me review", "get external review", or wants critical feedback on research ideas, papers, or experimental results.
research
Turn a vague research direction into a problem-anchored, elegant, frontier-aware, implementation-oriented method plan via iterative GPT-5.5 review. Use when the user says "refine my approach", "帮我细化方案", "decompose this problem", "打磨idea", "refine research plan", "细化研究方案", or wants a concrete research method that stays simple, focused, and top-venue ready instead of a vague or overbuilt idea.