Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

brycewang-stanford/training-check

Name: training-check
Author: brycewang-stanford

skills/42-wanshuiyin-ARIS/skills/skills-codex/training-check/SKILL.md

npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research training-check

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Training Check

Periodically read WandB metrics during training to catch problems early. Do not wait until training finishes to discover it was a waste of GPU time.

Context: $ARGUMENTS

Constants

WANDB_RUN - Read from project notes or pass as entity/project/run_id.
CHECK_INTERVAL - Starts at 10 minutes, then gradually increases if consistently healthy: 10 min -> 20 min -> 30 min -> 60 min (cap).
REVIEWER_MODEL = gpt-5.4 - Used via a secondary Codex agent for ambiguous cases only.

When to Use

After training is confirmed running (session alive, loss decreasing for the first few steps)
When the user wants recurring health checks during training
This skill checks training QUALITY, not process HEALTH. Process health (session alive, GPU utilization) belongs to watchdog-style monitoring.

Workflow

Step 1: Read WandB Metrics

import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
history = run.history()

If WandB is unreachable (API error, network issue), fall back to reading the log file directly via SSH:

ssh server "tail -100 /path/to/training.log"

Check these signals:

Loss trend - Is training loss decreasing over the last N steps?
Eval metrics - Are evaluation metrics improving (or at least not degrading)?
NaN / Inf - Any NaN or Inf values in loss or gradients?
Spikes - Sudden large jumps in loss (>10x normal variance)?
Learning rate - Is the schedule behaving as expected?
Gradient norm - Exploding or vanishing?

Step 2: Judgment

| Signal | Judgment | Action | |--------|----------|--------| | NaN/Inf in loss | Clearly bad | Stop training, investigate | | Loss diverging (increasing for >N steps) | Clearly bad | Stop training, investigate | | Eval metrics significantly worse than baseline | Clearly bad | Stop training, investigate | | Loss decreasing, metrics improving | Clearly fine | Continue, increase check interval | | Loss flat but not diverging | Unsure | -> Step 3 (secondary review) | | Metrics noisy, can't tell trend | Unsure | -> Step 3 (secondary review) | | Slightly worse than baseline but still early | Unsure | -> Step 3 (secondary review) |

Step 3: Secondary Codex Judgment (only when unsure)

Only escalate when the signal is ambiguous. For clearly good or clearly bad signals, act directly.

spawn_agent:
  model: REVIEWER_MODEL
  reasoning_effort: high
  message: |
    TRAINING HEALTH CHECK - need your judgment on ambiguous metrics.

    Run: <entity>/<project>/<run_id>
    Current epoch/step: X / Y total
    Training loss (last 10 checkpoints): [values]
    Eval metrics (last 3 evals): [values]
    Baseline reference: [numbers from paper/reproduction]

    What I'm unsure about: [specific concern]

    Please respond with exactly one of:
    - STOP: clearly problematic, should kill training
    - CONTINUE: looks fine, check again next interval
    - WAIT: not enough data to judge, check again sooner

If delegation is unavailable, make a local judgment using the same rubric and mark the decision [pending external review]. In ambiguous cases with no hard failure, prefer WAIT over STOP.

Step 4: Act

| Decision | Action | |----------|--------| | Stop | Kill the training session. Save the WandB run URL, key metrics, and reason for stopping. Log to project notes for debugging. | | Continue | Do nothing. Re-run at the next interval (increase interval if consistently healthy). | | Wait | Do nothing but keep the current short interval (do not increase). |

Integration with Watchdog

training-check and watchdog-style monitoring operate at different levels:

| Layer | Tool | What it checks | Frequency | |-------|------|----------------|-----------| | Process health | watchdog | Session alive? GPU active? | Every 60s (continuous) | | Training quality | training-check | Loss trend? Metrics improving? | Every 10-60 min (periodic) |

Use both together:

Watchdog catches crashes and idle GPUs immediately
training-check catches subtle quality issues (loss plateau, metric degradation)

Rules

Do not stop training on the first sign of noise - some loss spikes are normal. Look at trends over multiple checkpoints.
When stopping training, always save the WandB run URL and key metrics as evidence.
If both WandB and log files are unreachable, report the connectivity issue and try again next interval. Do not assume training is broken.
Gradually increase check interval when healthy (10 -> 20 -> 30 -> 60 min). Reset to 10 min after any anomaly.
This skill is meant to be automated via a recurring scheduler. If the user wants ongoing monitoring, set up the best local mechanism available instead of waiting for manual reruns.

Recurring Setup Example

After training is confirmed stable:
  Create a recurring job (cron, task scheduler, tmux loop, etc.)
  that runs `/training-check <entity>/<project>/<run_id>` every 10 minutes.

As the check interval increases, update the old recurring job to match the new interval.

brycewang-stanford/training-check

skills/42-wanshuiyin-ARIS/skills/skills-codex/training-check/SKILL.md

Periodically check WandB metrics during training to catch problems early (NaN, loss divergence, idle GPUs). Avoids wasting GPU hours on broken runs. Use when training is running and you want automated health checks.

1,232 stars

testing

Updated May 26, 2026

$ install --global

skillsauth

npx skillsauth add brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research training-check

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 28, 2026, 7:25 AM36.5s1 file scanned

SKILL.md

name:: training-check
description:: Periodically check WandB metrics during training to catch problems early (NaN, loss divergence, idle GPUs). Avoids wasting GPU hours on broken runs. Use when training is running and you want automated health checks.
allowed-tools:: Bash(*), Read, Grep, Glob, Write, Edit, Agent

Training Check

Periodically read WandB metrics during training to catch problems early. Do not wait until training finishes to discover it was a waste of GPU time.

Context: $ARGUMENTS

Constants

WANDB_RUN - Read from project notes or pass as entity/project/run_id.
CHECK_INTERVAL - Starts at 10 minutes, then gradually increases if consistently healthy: 10 min -> 20 min -> 30 min -> 60 min (cap).
REVIEWER_MODEL = gpt-5.4 - Used via a secondary Codex agent for ambiguous cases only.

When to Use

After training is confirmed running (session alive, loss decreasing for the first few steps)
When the user wants recurring health checks during training
This skill checks training QUALITY, not process HEALTH. Process health (session alive, GPU utilization) belongs to watchdog-style monitoring.

Workflow

Step 1: Read WandB Metrics

import wandb
api = wandb.Api()
run = api.run("<entity>/<project>/<run_id>")
history = run.history()

If WandB is unreachable (API error, network issue), fall back to reading the log file directly via SSH:

ssh server "tail -100 /path/to/training.log"

Check these signals:

Loss trend - Is training loss decreasing over the last N steps?
Eval metrics - Are evaluation metrics improving (or at least not degrading)?
NaN / Inf - Any NaN or Inf values in loss or gradients?
Spikes - Sudden large jumps in loss (>10x normal variance)?
Learning rate - Is the schedule behaving as expected?
Gradient norm - Exploding or vanishing?

Step 2: Judgment

Step 3: Secondary Codex Judgment (only when unsure)

Only escalate when the signal is ambiguous. For clearly good or clearly bad signals, act directly.

spawn_agent:
  model: REVIEWER_MODEL
  reasoning_effort: high
  message: |
    TRAINING HEALTH CHECK - need your judgment on ambiguous metrics.

    Run: <entity>/<project>/<run_id>
    Current epoch/step: X / Y total
    Training loss (last 10 checkpoints): [values]
    Eval metrics (last 3 evals): [values]
    Baseline reference: [numbers from paper/reproduction]

    What I'm unsure about: [specific concern]

    Please respond with exactly one of:
    - STOP: clearly problematic, should kill training
    - CONTINUE: looks fine, check again next interval
    - WAIT: not enough data to judge, check again sooner

If delegation is unavailable, make a local judgment using the same rubric and mark the decision [pending external review]. In ambiguous cases with no hard failure, prefer WAIT over STOP.

Step 4: Act

Integration with Watchdog

training-check and watchdog-style monitoring operate at different levels:

Use both together:

Watchdog catches crashes and idle GPUs immediately
training-check catches subtle quality issues (loss plateau, metric degradation)

Rules

Do not stop training on the first sign of noise - some loss spikes are normal. Look at trends over multiple checkpoints.
When stopping training, always save the WandB run URL and key metrics as evidence.
If both WandB and log files are unreachable, report the connectivity issue and try again next interval. Do not assume training is broken.
Gradually increase check interval when healthy (10 -> 20 -> 30 -> 60 min). Reset to 10 min after any anomaly.
This skill is meant to be automated via a recurring scheduler. If the user wants ongoing monitoring, set up the best local mechanism available instead of waiting for manual reruns.

Recurring Setup Example

After training is confirmed stable:
  Create a recurring job (cron, task scheduler, tmux loop, etc.)
  that runs `/training-check <entity>/<project>/<run_id>` every 10 minutes.

As the check interval increases, update the old recurring job to match the new interval.

Related Skills

brycewang-stanford/literature-review-tools

tools

VerifiedTrustedCommunity

Recommend AND run open-source AI tools, agents, Claude Code / Codex skills, and MCP servers for any stage of a literature review — searching, reading, extracting, synthesizing, screening, citation-checking, and paper writing. Use when the user asks "what tool should I use to..." OR "install/run/use <tool> to ..." for research/lit-review work: automating a survey or related-work section, PDF→Markdown extraction for LLMs (MinerU/marker/docling), PRISMA / systematic review (ASReview), citation-backed Q&A over PDFs (PaperQA2), wiring papers into Claude/Cursor via MCP (arxiv/paper-search/zotero servers), or chatting with a Zotero library. Ships a launcher (scripts/litrun.py) that installs each tool in an isolated venv and runs it. Curated catalog of 70+ vetted projects. 支持中英文（用于「文献综述工具选型」与「一键安装/运行」）。

3,109SKILL.mdUpdated Jul 28, 2026

brycewang-stanford/literature-review-tools

brycewang-stanford/auto-empirical-research-skills

development

VerifiedTrustedCommunity

Route empirical-research requests through the Auto-Empirical Research Skills catalog when this whole repository is installed as one skill in Codex, CodeBuddy, Claude Code, or another IDE. Use to choose and load the right vendored AERS skill for causal inference, econometrics, replication, data acquisition, manuscript writing, peer review and referee responses, citation checking, de-AIGC editing, or full empirical-paper workflows without reading the entire repository at once.

3,109SKILL.mdUpdated Jun 27, 2026

brycewang-stanford/auto-empirical-research-skills

brycewang-stanford/aer-preregistration

documentation

VerifiedTrustedCommunity

Use when the project collects primary data or runs a field, lab, or survey experiment, before the intervention begins — write the pre-analysis plan, size the sample from a power calculation, and register with the AEA RCT Registry. Apply after the design is chosen in aer-identification and before any outcome data are seen.

3,021SKILL.mdUpdated Jul 23, 2026

brycewang-stanford/aer-preregistration

brycewang-stanford/economist-data-skill

tools

VerifiedTrustedCommunity

Guide economists to authoritative data sources with explicit, confirmed data specifications before retrieval; interfaces with Playwright MCP to navigate portals and extract real data, not articles about data.

3,021SKILL.mdUpdated Jul 23, 2026

brycewang-stanford/economist-data-skill

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research.git

# Copy into Claude Code skills folder (global)
cp -r Awesome-Agent-Skills-for-Empirical-Research/skills/42-wanshuiyin-ARIS/skills/skills-codex/training-check ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

brycewang-stanford/Awesome-Agent-Skills-for-Empirical-Research

1,232 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT