skills/skills-codex/experiment-queue/SKILL.md
SSH job queue for multi-seed/multi-config ML experiments with OOM-aware retry, stale-screen cleanup, and wave-transition race prevention. Use when user says "batch experiments", "队列实验", "run grid", "multi-seed sweep", "auto-chain experiments", or when /run-experiment is insufficient for 10+ jobs that need orchestration.
npx skillsauth add wanshuiyin/Auto-claude-code-research-in-sleep experiment-queueInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Orchestrate large batches of ML experiments on SSH remote GPU servers with proper state tracking, OOM retry, stale cleanup, and wave transitions.
Use when /run-experiment is insufficient:
Do NOT use for:
/run-experiment)Based on session audit (2026-04-16), the major wall-clock sinks in multi-seed grid experiments are:
All of these are pure engineering friction that can be orchestrated.
A manifest lists jobs with explicit state:
project: my_grid_experiment
cwd: /home/user/your_project
conda: my_env
# Optional: override conda hook path if conda is not at a standard location.
# Can be a bare path (wrapped automatically) or a full `eval "$(... shell.bash hook)"` string.
# Falls back to auto-detect of ~/anaconda3, ~/miniconda3, /opt/anaconda3, etc.,
# or the ARIS_CONDA_HOOK environment variable.
# conda_hook: /custom/path/to/conda
ssh: gpu-server
default_cmd: >
python run_distill.py --backbone softmax --lam 0.5
--K 500 --L 96 --W 16 --n_steps 30000 --batch_size 128 --lr 1e-4
preconditions:
- type: checkpoint_exists
path: checkpoints/transformer/teacher_L96_K500_N{N}.pt
gpus: [0, 1, 2, 3, 4, 5, 6, 7]
max_parallel: 8
gpu_free_threshold_mib: 500 # optional, default 500; raise for shared servers, lower for tight packing
oom_retry:
delay: 120
max_attempts: 3
jobs:
- id: s200_N64_n50K
args: {seed: 200, n_hidden: 64, n_train_subset: 50000, subset_seed: 2024}
- id: s200_N128_n50K
args: {seed: 200, n_hidden: 128, n_train_subset: 50000, subset_seed: 2024}
# ... 14 more
pending → running → completed
↘ failed_oom → pending (after delay) [retry up to N]
↘ failed_other → stuck (needs manual inspection)
stale_screen_detected → cleaned → pending
A "wave" is a batch of jobs that fit available GPUs. Next wave only starts when:
Input can be:
N=[64,128,256] × n=[50K,150K,500K,652K])Bind run identifiers once so every later step refers to the same paths:
# REPLACE the placeholder path before running, or pre-export PROJECT_DIR:
PROJECT_DIR="${PROJECT_DIR:?set PROJECT_DIR to the local project root}"
RUN_TS=$(date -u +%Y%m%dT%H%M%SZ)
LOCAL_RUN_DIR="$PROJECT_DIR/experiment_queue/$RUN_TS"
mkdir -p "$LOCAL_RUN_DIR"
Save the built manifest to $LOCAL_RUN_DIR/manifest.json for reproducibility.
cwd exists on remotemax_parallel free GPUs)If any precondition fails, show user which jobs are blocked and why.
Resolve the bundled helper directory ($PROJECT_DIR / $RUN_TS / $LOCAL_RUN_DIR already set in Step 1). Phase 3.3 (Arch C) moved the canonical scripts to skills/experiment-queue/scripts/; tools/experiment_queue/ retains os.execv shims for legacy resolver layers:
if [ -z "${ARIS_REPO:-}" ] && [ -f .aris/installed-skills-codex.txt ]; then
ARIS_REPO=$(awk -F'\t' '$1=="repo_root"{print $2; exit}' .aris/installed-skills-codex.txt 2>/dev/null) || true
fi
[ -n "${ARIS_REPO:-}" ] || { echo "ERROR: ARIS_REPO not set. Use install_aris_codex.sh managed install or export ARIS_REPO=/path/to/ARIS."; exit 1; }
# Prefer the new canonical location; fall back to legacy tools/ shim path.
QUEUE_TOOLS="$ARIS_REPO/skills/experiment-queue/scripts"
[ -f "$QUEUE_TOOLS/queue_manager.py" ] || QUEUE_TOOLS="$ARIS_REPO/tools/experiment_queue"
[ -f "$QUEUE_TOOLS/queue_manager.py" ] || { echo "ERROR: queue_manager.py not found at $ARIS_REPO/skills/experiment-queue/scripts/ or $ARIS_REPO/tools/experiment_queue/"; exit 1; }
Compute remote paths (note: modern scp runs in SFTP mode and does NOT reliably expand $HOME in destination paths — use remote-relative for scp, $HOME-prefixed for ssh command strings):
REMOTE_RUN_REL=".aris_queue/runs/$RUN_TS"
REMOTE_RUN_DIR="\$HOME/$REMOTE_RUN_REL"
Bootstrap remote run dir + copy helpers + copy manifest. Per-invocation, idempotent:
ssh <server> "mkdir -p \"$REMOTE_RUN_DIR/logs\" \"\$HOME/.aris_queue\""
scp "$QUEUE_TOOLS/queue_manager.py" "$QUEUE_TOOLS/build_manifest.py" <server>:.aris_queue/
scp "$LOCAL_RUN_DIR/manifest.json" <server>:"$REMOTE_RUN_REL/manifest.json"
Launch the scheduler as a detached nohup process:
ssh <server> "nohup python3 \"\$HOME/.aris_queue/queue_manager.py\" \\
--manifest \"$REMOTE_RUN_DIR/manifest.json\" \\
--state \"$REMOTE_RUN_DIR/queue_state.json\" \\
--log-dir \"$REMOTE_RUN_DIR/logs\" \\
> \"$REMOTE_RUN_DIR/queue_mgr.log\" 2>&1 &"
Notes: --log-dir is what queue_manager.py actually consumes (per-job log files for OOM detection). Do NOT pass --log <path> — that flag is declared but unused.
Persist run identifiers for monitoring + resume (sourceable later):
{
printf 'PROJECT_DIR=%q\n' "$PROJECT_DIR"
printf 'RUN_TS=%q\n' "$RUN_TS"
printf 'LOCAL_RUN_DIR=%q\n' "$LOCAL_RUN_DIR"
printf 'REMOTE_RUN_REL=%q\n' "$REMOTE_RUN_REL"
printf 'REMOTE_RUN_DIR=%q\n' "$REMOTE_RUN_DIR"
} > "$LOCAL_RUN_DIR/run_meta.txt"
%q shell-escapes values; REMOTE_RUN_DIR keeps a literal $HOME (correct for later reuse inside ssh "...").
Resume an existing queue. Do NOT regenerate RUN_TS. Reload from run_meta.txt and re-run only the launch command above (not the bootstrap):
LOCAL_RUN_DIR="/abs/path/to/project/experiment_queue/<existing-run-ts>"
. "$LOCAL_RUN_DIR/run_meta.txt"
# Then re-run the launch command verbatim; do NOT re-run mkdir/scp.
The scheduler:
screenqueue_state.json continuouslyUser can check state anytime, using $REMOTE_RUN_DIR from Step 3 (or reload it from $LOCAL_RUN_DIR/run_meta.txt):
ssh <server> "cat \"$REMOTE_RUN_DIR/queue_state.json\"" \
| jq '.jobs | group_by(.status) | map({(.[0].status): length}) | add'
Note: /monitor-experiment is currently focused on screen sessions, result JSONs, and W&B; it does not yet read queue_state.json directly. For queue-state monitoring, use the literal command above.
When all jobs in manifest.json are completed or stuck:
queue_manager.py) exits cleanly with All jobs done to its own stdout (captured in $REMOTE_RUN_DIR/queue_mgr.log). It does NOT write the local summary.$LOCAL_RUN_DIR/summary.md (read $REMOTE_RUN_DIR/queue_state.json, group by status, optionally pull per-job logs)./analyze-results if analyze_on_complete: true.Instead of writing 24 job entries manually:
grid:
N: [64, 128, 256]
n: [50000, 150000, 500000, 652000]
seed: [42, 200, 201]
template:
id: "s${seed}_N${N}_n${n}"
args: {seed: ${seed}, n_hidden: ${N}, n_train_subset: ${n}}
Expands to 36 jobs automatically.
For sequential phases (teacher → student):
phases:
- name: train_teachers
grid:
N: [384, 512]
template:
cmd: python run_train.py --direction c --backbone softmax --n_hidden ${N} ...
output_check: checkpoints/transformer/teacher_L96_K500_N${N}.pt
- name: distill_students
depends_on: train_teachers
grid:
N: [384, 512]
seed: [42, 200, 201]
template:
cmd: python run_distill.py --n_hidden ${N} --seed ${seed} ...
output_check: figures/distill_sw_N${N}_*_seed${seed}.json
Scheduler enforces depends_on: distill_students jobs stay pending until all
train_teachers jobs are completed.
Detect OOM from stdout:
torch\.OutOfMemoryError: CUDA out of memory
On detection:
failed_oomoom_retry.delay secondspendingoom_retry.max_attempts before marking stuckEvery 60s, for each running screen:
screen -ls)ps -p)completed, kill stale screenfailed_other, kill screenIf scheduler crashes / is killed:
queue_state.jsonrunning job: check screen; if still alive, keep; if not, re-evaluate statepending: continue normally# Experiment Queue Summary
**Project**: my_grid_experiment
**Started**: 2026-04-16 11:36:29
**Completed**: 2026-04-16 18:02:14
**Total wall-clock**: 6h 25m
**Jobs**: 40 completed, 2 OOM-retried then completed, 0 stuck
## Phases
| Phase | Jobs | Success | OOM retries | Duration |
| --- | --- | --- | --- | --- |
| train_teachers | 2 | 2 | 0 | 58m |
| distill_students | 24 | 24 | 2 | 4h 02m |
| multi_seed_validation | 16 | 16 | 0 | 1h 25m |
## Results Files
- 42 JSON files in `figures/distill_sw_*.json`
## Next Steps
- Run `/analyze-results` on output JSONs
- Figures auto-regen via `artifact-sync` (if configured)
/run-experiment| Feature | /run-experiment | experiment-queue |
| --- | --- | --- |
| Single-shot experiment | ✅ | ✅ (overkill) |
| Multi-GPU parallel | Basic | Proper scheduling |
| Wave transitions | Manual | Automatic |
| OOM retry | Manual | Automatic |
| Stale screen cleanup | Manual | Automatic |
| Teacher→student chain | Manual | Built-in |
| State persistence | No | Yes (JSON) |
| Resume on crash | No | Yes |
| Grid expansion | Manual | Declarative |
Rule: Use /run-experiment for ≤5 jobs. Use experiment-queue for ≥10 jobs or anything with phases.
memory.used < 500 MiB before launching new jobqueue_state.jsonstuck and alertstuck, alertsUser: "跑 T5+T6 全部实验:T5 = N∈{80,192} × n 4 values × seed {200,201}, T6 = N∈{384,512} × n 4 values × seed {42,200,201}; T6 需要先 train teacher"
Claude invokes /experiment-queue:
Then user can check anytime or wait for summary report.
/run-experiment — single experiment deployment/monitor-experiment — check progress (now reads from queue_state.json)/analyze-results — post-hoc analysisskills/experiment-queue/scripts/queue_manager.py (canonical, Phase 3.3 move) — the scheduler implementation. Legacy entry at tools/experiment_queue/queue_manager.py is an os.execv shim.skills/experiment-queue/scripts/build_manifest.py (canonical, Phase 3.3 move) — build manifest from grid spec. Legacy entry at tools/experiment_queue/build_manifest.py is an os.execv shim.Identified via 2026-04-16 post-mortem analysis (Codex GPT-5.5 xhigh) of a 1.5-day multi-seed paper experiment session:
This skill targets the wall-clock sink specifically; see artifact-sync and
paper-fix-auto-apply for the other two.
research
Generate a structured paper outline from review conclusions and experiment results. Use when user says \"写大纲\", \"paper outline\", \"plan the paper\", \"论文规划\", or wants to create a paper plan before writing.
research
Generate a structured paper outline from review conclusions and experiment results. Use when user says "写大纲", "paper outline", "plan the paper", "论文规划", or wants to create a paper plan before writing.
development
Get a deep critical review of research from an external reviewer backend (Codex or manual). Use when user says "review my research", "help me review", "get external review", or wants critical feedback on research ideas, papers, or experimental results.
research
Turn a vague research direction into a problem-anchored, elegant, frontier-aware, implementation-oriented method plan via iterative GPT-5.5 review. Use when the user says "refine my approach", "帮我细化方案", "decompose this problem", "打磨idea", "refine research plan", "细化研究方案", or wants a concrete research method that stays simple, focused, and top-venue ready instead of a vague or overbuilt idea.