skills/42-wanshuiyin-ARIS/skills/skills-codex/training-check/SKILL.md
Periodically check WandB metrics during training to catch problems early (NaN, loss divergence, idle GPUs). Avoids wasting GPU hours on broken runs. Use when training is running and you want automated health checks.
npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research training-checkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Periodically read WandB metrics during training to catch problems early. Do not wait until training finishes to discover it was a waste of GPU time.
entity/project/run_id.gpt-5.4 - Used via a secondary Codex agent for ambiguous cases only.import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
history = run.history()
If WandB is unreachable (API error, network issue), fall back to reading the log file directly via SSH:
ssh server "tail -100 /path/to/training.log"
Check these signals:
| Signal | Judgment | Action | |--------|----------|--------| | NaN/Inf in loss | Clearly bad | Stop training, investigate | | Loss diverging (increasing for >N steps) | Clearly bad | Stop training, investigate | | Eval metrics significantly worse than baseline | Clearly bad | Stop training, investigate | | Loss decreasing, metrics improving | Clearly fine | Continue, increase check interval | | Loss flat but not diverging | Unsure | -> Step 3 (secondary review) | | Metrics noisy, can't tell trend | Unsure | -> Step 3 (secondary review) | | Slightly worse than baseline but still early | Unsure | -> Step 3 (secondary review) |
Only escalate when the signal is ambiguous. For clearly good or clearly bad signals, act directly.
spawn_agent:
model: REVIEWER_MODEL
reasoning_effort: high
message: |
TRAINING HEALTH CHECK - need your judgment on ambiguous metrics.
Run: <entity>/<project>/<run_id>
Current epoch/step: X / Y total
Training loss (last 10 checkpoints): [values]
Eval metrics (last 3 evals): [values]
Baseline reference: [numbers from paper/reproduction]
What I'm unsure about: [specific concern]
Please respond with exactly one of:
- STOP: clearly problematic, should kill training
- CONTINUE: looks fine, check again next interval
- WAIT: not enough data to judge, check again sooner
If delegation is unavailable, make a local judgment using the same rubric and mark the decision [pending external review]. In ambiguous cases with no hard failure, prefer WAIT over STOP.
| Decision | Action | |----------|--------| | Stop | Kill the training session. Save the WandB run URL, key metrics, and reason for stopping. Log to project notes for debugging. | | Continue | Do nothing. Re-run at the next interval (increase interval if consistently healthy). | | Wait | Do nothing but keep the current short interval (do not increase). |
training-check and watchdog-style monitoring operate at different levels:
| Layer | Tool | What it checks | Frequency | |-------|------|----------------|-----------| | Process health | watchdog | Session alive? GPU active? | Every 60s (continuous) | | Training quality | training-check | Loss trend? Metrics improving? | Every 10-60 min (periodic) |
Use both together:
training-check catches subtle quality issues (loss plateau, metric degradation)After training is confirmed stable:
Create a recurring job (cron, task scheduler, tmux loop, etc.)
that runs `/training-check <entity>/<project>/<run_id>` every 10 minutes.
As the check interval increases, update the old recurring job to match the new interval.
development
Conduct rigorous thematic analysis (TA) of qualitative data following Braun and Clarke's (2006) six-phase framework. Use whenever the user mentions 'thematic analysis', 'TA', 'Braun and Clarke', 'qualitative coding', 'identifying themes', or asks for help analysing interviews, focus groups, open-ended survey responses, or transcripts to identify patterns. Also trigger for questions about inductive vs theoretical coding, semantic vs latent themes, essentialist vs constructionist epistemology, building a thematic map, or writing up a qualitative findings section. Covers all six phases, the four upfront analytic decisions, the 15-point quality checklist, and the five common pitfalls. Produces a Word document write-up and an annotated thematic map. Does NOT cover IPA, grounded theory, discourse analysis, conversation analysis, or narrative analysis — use a different method for those.
development
Guide users through writing a systematic literature review (SLR) following the PRISMA 2020 framework. Use this skill whenever the user mentions 'systematic review', 'systematic literature review', 'SLR', 'PRISMA', 'PRISMA 2020', 'PRISMA flow diagram', 'PRISMA checklist', or asks for help writing, structuring, or auditing a literature review that follows reporting guidelines. Also trigger when the user asks about inclusion/exclusion criteria for a review, search strategies for databases like Scopus/WoS/PubMed, study selection processes, risk of bias assessment, or narrative synthesis for a review paper. This skill covers the full PRISMA 2020 checklist (27 items), produces a Word document manuscript in strict journal article format, generates an annotated PRISMA flow diagram, and enforces APA 7th Edition referencing throughout. It does NOT cover meta-analysis or statistical pooling. By Chuah Kee Man.
testing
Performs placebo-in-time sensitivity analysis with hierarchical null model and optional Bayesian assurance. Use when checking model robustness, verifying lack of pre-intervention effects, or estimating study power.
data-ai
Fit, summarize, plot, and interpret a chosen CausalPy experiment. Use after the causal method has been selected, including when configuring PyMC/sklearn models and scale-aware custom priors.