skills/experiment-queue/SKILL.md
SSH job queue for multi-seed/multi-config ML experiments with OOM-aware retry, stale-screen cleanup, and wave-transition race prevention. Use when user says "batch experiments", "队列实验", "run grid", "multi-seed sweep", "auto-chain experiments", or when /run-experiment is insufficient for 10+ jobs that need orchestration.
npx skillsauth add shaun-z/auto-claude-code-research-in-sleep experiment-queueInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Orchestrate large batches of ML experiments on SSH remote GPU servers with proper state tracking, OOM retry, stale cleanup, and wave transitions.
Use when /run-experiment is insufficient:
Do NOT use for:
/run-experiment)Based on session audit (2026-04-16), the major wall-clock sinks in multi-seed grid experiments are:
All of these are pure engineering friction that can be orchestrated.
A manifest lists jobs with explicit state:
project: dllm_distill
cwd: /home/rfyang/rfyang_code/dllm_experiments_torch
conda: dllm
# Optional: override conda hook path if conda is not at a standard location.
# Can be a bare path (wrapped automatically) or a full `eval "$(... shell.bash hook)"` string.
# Falls back to auto-detect of ~/anaconda3, ~/miniconda3, /opt/anaconda3, etc.,
# or the ARIS_CONDA_HOOK environment variable.
# conda_hook: /custom/path/to/conda
ssh: SJTUServer5
default_cmd: >
python run_pc_distill_exp.py --backbone softmax --lam 0.5
--K 500 --L 96 --W 16 --n_steps 30000 --batch_size 128 --lr 1e-4
preconditions:
- type: checkpoint_exists
path: checkpoints/transformer/pcc_softmax_L96_K500_N{N}_wikitext103.pt
gpus: [0, 1, 2, 3, 4, 5, 6, 7]
max_parallel: 8
gpu_free_threshold_mib: 500 # optional, default 500; raise for shared servers, lower for tight packing
oom_retry:
delay: 120
max_attempts: 3
jobs:
- id: s200_N64_n50K
args: {seed: 200, n_hidden: 64, n_train_subset: 50000, subset_seed: 2024}
- id: s200_N128_n50K
args: {seed: 200, n_hidden: 128, n_train_subset: 50000, subset_seed: 2024}
# ... 14 more
pending → running → completed
↘ failed_oom → pending (after delay) [retry up to N]
↘ failed_other → stuck (needs manual inspection)
stale_screen_detected → cleaned → pending
A "wave" is a batch of jobs that fit available GPUs. Next wave only starts when:
Input can be:
N=[64,128,256] × n=[50K,150K,500K,652K])Save the built manifest to <project>/experiment_queue/<timestamp>/manifest.json for reproducibility.
cwd exists on remotemax_parallel free GPUs)If any precondition fails, show user which jobs are blocked and why.
Run tools/queue_manager.py (bundled with this skill) as a detached nohup process on the SSH host:
ssh <server> 'nohup python3 ~/.aris_queue/queue_manager.py \
--manifest /tmp/manifest.json \
--state /tmp/queue_state.json \
--log /tmp/queue.log \
> /tmp/queue_mgr.log 2>&1 &'
The scheduler:
screenqueue_state.json continuouslyUser can check state anytime:
ssh <server> cat /tmp/queue_state.json | jq '.jobs | group_by(.status) | map({(.[0].status): length}) | add'
Or invoke /monitor-experiment which reads the state file.
When all jobs in manifest.json are completed or stuck:
<project>/experiment_queue/<timestamp>/summary.md/analyze-results if analyze_on_complete: trueInstead of writing 24 job entries manually:
grid:
N: [64, 128, 256]
n: [50000, 150000, 500000, 652000]
seed: [42, 200, 201]
template:
id: "s${seed}_N${N}_n${n}"
args: {seed: ${seed}, n_hidden: ${N}, n_train_subset: ${n}}
Expands to 36 jobs automatically.
For sequential phases (teacher → student):
phases:
- name: train_teachers
grid:
N: [384, 512]
template:
cmd: python run_pc_exp.py --direction c --backbone softmax --n_hidden ${N} ...
output_check: checkpoints/transformer/pcc_softmax_L96_K500_N${N}_wikitext103.pt
- name: distill_students
depends_on: train_teachers
grid:
N: [384, 512]
seed: [42, 200, 201]
template:
cmd: python run_pc_distill_exp.py --n_hidden ${N} --seed ${seed} ...
output_check: figures/pcdistill_sw_N${N}_*_seed${seed}.json
Scheduler enforces depends_on: distill_students jobs stay pending until all
train_teachers jobs are completed.
Detect OOM from stdout:
torch\.OutOfMemoryError: CUDA out of memory
On detection:
failed_oomoom_retry.delay secondspendingoom_retry.max_attempts before marking stuckEvery 60s, for each running screen:
screen -ls)ps -p)completed, kill stale screenfailed_other, kill screenIf scheduler crashes / is killed:
queue_state.jsonrunning job: check screen; if still alive, keep; if not, re-evaluate statepending: continue normally# Experiment Queue Summary
**Project**: dllm_distill
**Started**: 2026-04-16 11:36:29
**Completed**: 2026-04-16 18:02:14
**Total wall-clock**: 6h 25m
**Jobs**: 40 completed, 2 OOM-retried then completed, 0 stuck
## Phases
| Phase | Jobs | Success | OOM retries | Duration |
| --- | --- | --- | --- | --- |
| train_teachers | 2 | 2 | 0 | 58m |
| distill_students | 24 | 24 | 2 | 4h 02m |
| multi_seed_validation | 16 | 16 | 0 | 1h 25m |
## Results Files
- 42 JSON files in `figures/pcdistill_sw_*.json`
## Next Steps
- Run `/analyze-results` on output JSONs
- Figures auto-regen via `artifact-sync` (if configured)
/run-experiment| Feature | /run-experiment | experiment-queue |
| --- | --- | --- |
| Single-shot experiment | ✅ | ✅ (overkill) |
| Multi-GPU parallel | Basic | Proper scheduling |
| Wave transitions | Manual | Automatic |
| OOM retry | Manual | Automatic |
| Stale screen cleanup | Manual | Automatic |
| Teacher→student chain | Manual | Built-in |
| State persistence | No | Yes (JSON) |
| Resume on crash | No | Yes |
| Grid expansion | Manual | Declarative |
Rule: Use /run-experiment for ≤5 jobs. Use experiment-queue for ≥10 jobs or anything with phases.
memory.used < 500 MiB before launching new jobqueue_state.jsonstuck and alertstuck, alertsUser: "跑 T5+T6 全部实验:T5 = N∈{80,192} × n 4 values × seed {200,201}, T6 = N∈{384,512} × n 4 values × seed {42,200,201}; T6 需要先 train teacher"
Claude invokes /experiment-queue:
Then user can check anytime or wait for summary report.
/run-experiment — single experiment deployment/monitor-experiment — check progress (now reads from queue_state.json)/analyze-results — post-hoc analysistools/queue_manager.py (bundled) — the scheduler implementationtools/build_manifest.py (bundled) — build manifest from grid specIdentified via 2026-04-16 post-mortem analysis (Codex GPT-5.4 xhigh) of a 1.5-day multi-seed paper experiment session:
This skill targets the wall-clock sink specifically; see artifact-sync and
paper-fix-auto-apply for the other two.
development
Generate publication-quality academic illustrations through a local Codex app-server bridge that uses Codex native image generation. This is a separate experimental alternative to `paper-illustration`, intended for Claude Code users who want a GPT-image-style renderer without modifying the original skill.
development
Two-way sync between a local paper directory and an Overleaf project via the Overleaf Git bridge (Premium feature). Lets you keep ARIS audit/edit workflows on the local copy while collaborators edit in the Overleaf web UI. Token never touches the agent — user does the one-time auth via macOS Keychain. Use when user says "同步 overleaf", "overleaf sync", "推送到 overleaf", "connect overleaf", "Overleaf 桥接", "pull overleaf", "push overleaf", or wants to bridge their ARIS paper directory with an Overleaf project.
development
Zero-context verification that every bibliographic entry in the paper is real, correctly attributed, and used in a context the cited paper actually supports. Uses a fresh cross-model reviewer with web/DBLP/arXiv lookup to catch hallucinated authors, wrong years, fabricated venues, version mismatches, and wrong-context citations (cite present but the cited paper does not establish the claim). Use when user says "审查引用", "check citations", "citation audit", "verify references", "引用核对", or before submission to ensure bibliography integrity.
data-ai
Paragraph-level structural blueprint for 10-12 page systems papers targeting OSDI, SOSP, ASPLOS, NSDI, and EuroSys. Provides page allocation, paragraph templates, and writing patterns. Use when user says "写系统论文", "systems paper structure", "OSDI paper", "SOSP paper", or wants fine-grained structural guidance for a systems conference submission.