skills/experiment-debugger/SKILL.md
Use when training has engineering failures — NaN/gradient issues, GPU OOM, slow data loading, wrong metrics, reproducibility failures. Not for checking job queue/status (use run-status-monitor). Not for valid-but-surprising scientific results (use result-diagnosis). Not for confound or claim audit before writing (use research-results-auditor).
npx skillsauth add a-green-hand-jack/ml-research-skills experiment-debuggerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Fix engineering failures that prevent valid experiment results from being produced. This skill separates bugs from science: if the code is broken, wrong, or non-reproducible, fix the code first before interpreting results.
Use this skill when:
Do not use this skill for interpreting surprising but valid results — use result-diagnosis. Do not use this skill for choosing baselines or experiment designs — use baseline-selection-audit and experiment-design-planner.
Pair this skill with:
result-diagnosis after engineering issues are fixed and results are valid but still surprisingrun-experiment to resubmit a corrected job after the bug is resolvedrun-status-monitor to probe log artifacts and confirm whether the fix took effectdata-pipeline-manager when the bug traces to a data loading, preprocessing, or split issueresearch-project-memory to record debugging decisions that affect experimental validity claims<installed-skill-dir>/
├── SKILL.md
└── references/
└── failure-taxonomy.md
references/failure-taxonomy.md.memory/risk-board.md when the bug may recur or affect other runs.Read:
logs/, wandb/, runs/, output/, or docs/ops/runs/<run-id>-status.mdconfigs/, experiments/, jobs/<run-id>.yaml, or equivalentRecord:
Read references/failure-taxonomy.md.
Choose one primary mode:
nan-gradient: NaN or inf in loss or gradientsoom: GPU out-of-memoryslow-training: training throughput is below expectedslow-data: data loading is the bottleneckmetric-error: metric value is impossible, negative, > 1 for accuracy, wrong shape, or suspiciously roundrepro-failure: different results despite fixed seedcrash: process exits with non-zero code, SIGKILL, or timeout without producing outputsilent-error: the run completes but results are clearly wrong or absentIf multiple modes apply, list them in priority order.
When nan-gradient is the mode:
torch.autocast with dtype=torch.float32 to isolate.torch.nn.utils.clip_grad_norm_ with max_norm=1.0 can stabilize and surface the layer.torch.autograd.set_detect_anomaly(True) to find the exact operation.Record which layer/operation produces NaN first.
When oom is the mode:
torch.cuda.memory_summary() at the OOM step.--mem-per-gpu or RunAI resource spec.When slow-training or slow-data is the mode:
nvidia-smi or run-status-monitor for GPU occupancy. If < 80%, training is data-starved or CPU-bound.num_workers=0 or low num_workers is a common bottleneck.torch.profiler for a few steps to find the dominant overhead.When metric-error is the mode:
When repro-failure is the mode:
torch.manual_seed, numpy.random.seed, random.seed, and torch.cuda.manual_seed_all.torch.backends.cudnn.deterministic = True and torch.backends.cudnn.benchmark = False.worker_init_fn is a common source of order non-determinism.If determinism cannot be achieved, document the expected variance and report mean ± std across seeds instead.
After identifying the root cause:
docs/ops/runs/<run-id>-debug.md.memory/risk-board.md if the bug reveals a recurring risk to other runs.run-experiment to resubmit the corrected job.Before declaring the bug fixed:
testing
Bootstrap project-local ml-research-skills. Use from global installs when creating a new ML research project, enabling this collection in an existing ML research repo, or deciding whether to install the full bundle locally. Route to project-init for new projects; do not handle paper or experiment work directly.
development
Route project operations tasks — git, memory, bootstrap, remote, workspace, code review, timeline, ops — to the correct skill. Use when the task involves commits, pushes, worktrees, project memory, enabling project-local skills, SSH/server coordination, sidecar runners, or audits. Do not solve the ops task directly.
testing
Route ML/AI paper writing tasks to the correct skill — contract planning, prose drafting, section writing, consistency editing, review simulation, rebuttal, submission, or citation work. Use when the task involves writing, revising, reviewing, or submitting a paper instead of guessing between paper-writing-assistant, paper-writing-contract-planner, paper-reviewer-simulator, auto-paper-improvement-loop, or citation skills. Do not draft prose directly.
data-ai
Project-local router for ML research skill selection. Use inside an initialized ML research project, or while maintaining this skill repo, when the user describes an ML research/paper/experiment/discovery/ops/release workflow and may not know the skill; route to a domain router or high-signal leaf. Do not use for generic non-ML projects.