Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

wanshuiyin/experiment-queue

Name: experiment-queue
Author: wanshuiyin

skills/skills-codex/experiment-queue/SKILL.md

npx skillsauth add wanshuiyin/Auto-claude-code-research-in-sleep experiment-queue

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Experiment Queue

Orchestrate large batches of ML experiments on SSH remote GPU servers with proper state tracking, OOM retry, stale cleanup, and wave transitions.

When to Use This Skill

Use when /run-experiment is insufficient:

≥10 jobs that need batching across GPUs
Multi-seed sweeps (e.g., 21 seeds × 12 cells)
Wave transitions (run wave 1, wait, run wave 2, wait, run wave 3...)
Teacher+student chains (train teacher then distill; auto-trigger student after teacher done)
OOM-prone configs where you need to retry with different GPU or wait
Mixed seed grids where failed cells need re-running

Do NOT use for:

Single ad-hoc experiment (use /run-experiment)
Modal/Vast.ai deployments (those have their own orchestration)
Experiments that need manual inspection between runs

Why This Exists

Based on session audit (2026-04-16), the major wall-clock sinks in multi-seed grid experiments are:

Stale screens — python finishes, wandb uploads, screen hangs, next wave blocked
OOM on shared GPU — previous job's memory not yet released
Wave race — new wave launches before previous wave fully settles
Missing checkpoints — student launches before teacher saved
Parser duplication — rewriting multi-seed analysis python every batch

All of these are pure engineering friction that can be orchestrated.

Core Concepts

Environment contract: queue jobs assume the target env is already built and validated per ../shared-references/compute-env-contract.md (spec-hash ledger + kernel witness). A wave of jobs dying at import time = the env contract was skipped, not a queue bug; check the provider's .aris/compute/<provider>.md ledger before re-queueing.

Job Manifest

A manifest lists jobs with explicit state:

project: my_grid_experiment
cwd: /home/user/your_project
conda: my_env
# Optional: override conda hook path if conda is not at a standard location.
# Can be a bare path (wrapped automatically) or a full `eval "$(... shell.bash hook)"` string.
# Falls back to auto-detect of ~/anaconda3, ~/miniconda3, /opt/anaconda3, etc.,
# or the ARIS_CONDA_HOOK environment variable.
# conda_hook: /custom/path/to/conda
ssh: gpu-server
default_cmd: >
  python run_distill.py --backbone softmax --lam 0.5
  --K 500 --L 96 --W 16 --n_steps 30000 --batch_size 128 --lr 1e-4

preconditions:
  - type: checkpoint_exists
    path: checkpoints/transformer/teacher_L96_K500_N{N}.pt

gpus: [0, 1, 2, 3, 4, 5, 6, 7]
max_parallel: 8
gpu_free_threshold_mib: 500  # optional, default 500; raise for shared servers, lower for tight packing
oom_retry:
  delay: 120
  max_attempts: 3

jobs:
  - id: s200_N64_n50K
    args: {seed: 200, n_hidden: 64, n_train_subset: 50000, subset_seed: 2024}
  - id: s200_N128_n50K
    args: {seed: 200, n_hidden: 128, n_train_subset: 50000, subset_seed: 2024}
  # ... 14 more

Job State Machine

pending → running → completed
                 ↘ failed_oom → pending (after delay) [retry up to N]
                 ↘ failed_other → stuck (needs manual inspection)
stale_screen_detected → cleaned → pending

Operator note on stuck (the agent's move, not the queue's): the queue deterministically parks failed_other jobs as stuck — that part is code and unchanged. Before handing a stuck batch to the human, the OPERATING AGENT should check: if the same failure repeats across jobs, try ONE clean reimplement of the agent-generated wrapper/attempt script only — never user/project source, the manifest, queue state, logs, or results (per external-cadence.md, "Let a broken attempt restart, not just patch"). Reserve the human handoff for contract/environment doubts, not merely broken attempt code.

Wave Orchestration

A "wave" is a batch of jobs that fit available GPUs. Next wave only starts when:

All current-wave python processes have exited
No stale screens remain for current-wave tags
GPU memory has dropped below threshold (≤500 MiB)
Precondition checks pass for next-wave jobs

Workflow

Step 1: Parse Manifest / Build from Grid

Input can be:

YAML manifest (explicit job list, recommended for complex cases)
Grid spec (Cartesian product of param values, e.g., N=[64,128,256] × n=[50K,150K,500K,652K])
Natural language description (Claude parses into manifest)

Bind run identifiers once so every later step refers to the same paths:

# REPLACE the placeholder path before running, or pre-export PROJECT_DIR:
PROJECT_DIR="${PROJECT_DIR:?set PROJECT_DIR to the local project root}"
RUN_TS=$(date -u +%Y%m%dT%H%M%SZ)
LOCAL_RUN_DIR="$PROJECT_DIR/experiment_queue/$RUN_TS"
mkdir -p "$LOCAL_RUN_DIR"

Save the built manifest to $LOCAL_RUN_DIR/manifest.json for reproducibility.

Step 2: Pre-flight

Check SSH connection works
Check conda env exists on remote
Check cwd exists on remote
Check all preconditions (checkpoints, input files)
Check GPU availability (at least max_parallel free GPUs)

If any precondition fails, show user which jobs are blocked and why.

Step 3: Launch Scheduler

Resolve the bundled helper directory ($PROJECT_DIR / $RUN_TS / $LOCAL_RUN_DIR already set in Step 1). Phase 3.3 (Arch C) moved the canonical scripts to skills/experiment-queue/scripts/; tools/experiment_queue/ retains os.execv shims for legacy resolver layers:

if [ -z "${ARIS_REPO:-}" ] && [ -f .aris/installed-skills-codex.txt ]; then
    ARIS_REPO=$(awk -F'\t' '$1=="repo_root"{print $2; exit}' .aris/installed-skills-codex.txt 2>/dev/null) || true
fi
[ -n "${ARIS_REPO:-}" ] || { echo "ERROR: ARIS_REPO not set. Use install_aris_codex.sh managed install or export ARIS_REPO=/path/to/ARIS."; exit 1; }
# Prefer the new canonical location; fall back to legacy tools/ shim path.
QUEUE_TOOLS="$ARIS_REPO/skills/experiment-queue/scripts"
[ -f "$QUEUE_TOOLS/queue_manager.py" ] || QUEUE_TOOLS="$ARIS_REPO/tools/experiment_queue"
[ -f "$QUEUE_TOOLS/queue_manager.py" ] || { echo "ERROR: queue_manager.py not found at $ARIS_REPO/skills/experiment-queue/scripts/ or $ARIS_REPO/tools/experiment_queue/"; exit 1; }

Compute remote paths (note: modern scp runs in SFTP mode and does NOT reliably expand $HOME in destination paths — use remote-relative for scp, $HOME-prefixed for ssh command strings):

REMOTE_RUN_REL=".aris_queue/runs/$RUN_TS"
REMOTE_RUN_DIR="\$HOME/$REMOTE_RUN_REL"

Bootstrap remote run dir + copy helpers + copy manifest. Per-invocation, idempotent:

ssh <server> "mkdir -p \"$REMOTE_RUN_DIR/logs\" \"\$HOME/.aris_queue\""
scp "$QUEUE_TOOLS/queue_manager.py" "$QUEUE_TOOLS/build_manifest.py" <server>:.aris_queue/
scp "$LOCAL_RUN_DIR/manifest.json" <server>:"$REMOTE_RUN_REL/manifest.json"

Launch the scheduler as a detached nohup process:

ssh <server> "nohup python3 \"\$HOME/.aris_queue/queue_manager.py\" \\
  --manifest \"$REMOTE_RUN_DIR/manifest.json\" \\
  --state    \"$REMOTE_RUN_DIR/queue_state.json\" \\
  --log-dir  \"$REMOTE_RUN_DIR/logs\" \\
  > \"$REMOTE_RUN_DIR/queue_mgr.log\" 2>&1 &"

Notes: --log-dir is what queue_manager.py actually consumes (per-job log files for OOM detection). Do NOT pass --log <path> — that flag is declared but unused.

Persist run identifiers for monitoring + resume (sourceable later):

{
  printf 'PROJECT_DIR=%q\n'    "$PROJECT_DIR"
  printf 'RUN_TS=%q\n'         "$RUN_TS"
  printf 'LOCAL_RUN_DIR=%q\n'  "$LOCAL_RUN_DIR"
  printf 'REMOTE_RUN_REL=%q\n' "$REMOTE_RUN_REL"
  printf 'REMOTE_RUN_DIR=%q\n' "$REMOTE_RUN_DIR"
} > "$LOCAL_RUN_DIR/run_meta.txt"

%q shell-escapes values; REMOTE_RUN_DIR keeps a literal $HOME (correct for later reuse inside ssh "...").

Resume an existing queue. Do NOT regenerate RUN_TS. Reload from run_meta.txt and re-run only the launch command above (not the bootstrap):

LOCAL_RUN_DIR="/abs/path/to/project/experiment_queue/<existing-run-ts>"
. "$LOCAL_RUN_DIR/run_meta.txt"
# Then re-run the launch command verbatim; do NOT re-run mkdir/scp.

The scheduler:

Reads manifest
Loops: for each pending job, assign to free GPU, launch via screen
Polls job status (every 60s)
Detects stale screens (python exited but screen detached → kill)
Detects OOM (CUDA OOM in log → mark failed_oom → retry after delay)
Detects completion (expected output JSON/file exists) → mark completed
Launches next wave when current wave settles
Writes state to queue_state.json continuously

Step 4: Monitoring

User can check state anytime, using $REMOTE_RUN_DIR from Step 3 (or reload it from $LOCAL_RUN_DIR/run_meta.txt):

ssh <server> "cat \"$REMOTE_RUN_DIR/queue_state.json\"" \
  | jq '.jobs | group_by(.status) | map({(.[0].status): length}) | add'

Note: /monitor-experiment is currently focused on screen sessions, result JSONs, and W&B; it does not yet read queue_state.json directly. For queue-state monitoring, use the literal command above.

Step 5: Post-completion

When all jobs in manifest.json are completed or stuck:

The remote scheduler (queue_manager.py) exits cleanly with All jobs done to its own stdout (captured in $REMOTE_RUN_DIR/queue_mgr.log). It does NOT write the local summary.
The local skill agent then aggregates state into $LOCAL_RUN_DIR/summary.md (read $REMOTE_RUN_DIR/queue_state.json, group by status, optionally pull per-job logs).
Local skill agent invokes /analyze-results if analyze_on_complete: true.

Grid Spec Syntax

Instead of writing 24 job entries manually:

grid:
  N: [64, 128, 256]
  n: [50000, 150000, 500000, 652000]
  seed: [42, 200, 201]
template:
  id: "s${seed}_N${N}_n${n}"
  args: {seed: ${seed}, n_hidden: ${N}, n_train_subset: ${n}}

Expands to 36 jobs automatically.

Wave Chaining

For sequential phases (teacher → student):

phases:
  - name: train_teachers
    grid:
      N: [384, 512]
    template:
      cmd: python run_train.py --direction c --backbone softmax --n_hidden ${N} ...
      output_check: checkpoints/transformer/teacher_L96_K500_N${N}.pt
  
  - name: distill_students
    depends_on: train_teachers
    grid:
      N: [384, 512]
      seed: [42, 200, 201]
    template:
      cmd: python run_distill.py --n_hidden ${N} --seed ${seed} ...
      output_check: figures/distill_sw_N${N}_*_seed${seed}.json

Scheduler enforces depends_on: distill_students jobs stay pending until all train_teachers jobs are completed.

OOM Handling

Detect OOM from stdout:

torch\.OutOfMemoryError: CUDA out of memory

On detection:

Mark job failed_oom
Kill screen
Wait oom_retry.delay seconds
Check if current GPU is free; if not, try another free GPU
Requeue as pending
Max oom_retry.max_attempts before marking stuck

Stale Screen Detection

Every 60s, for each running screen:

Check screen exists (screen -ls)
Check python PID still running (ps -p)
If screen exists but python exited:
- If expected output file exists → mark completed, kill stale screen
- If no output file → mark failed_other, kill screen

Resume-on-restart

If scheduler crashes / is killed:

Read queue_state.json
For each running job: check screen; if still alive, keep; if not, re-evaluate state
For each pending: continue normally
Idempotent: safe to restart scheduler without losing state

Output: Summary Report

# Experiment Queue Summary

**Project**: my_grid_experiment
**Started**: 2026-04-16 11:36:29
**Completed**: 2026-04-16 18:02:14
**Total wall-clock**: 6h 25m
**Jobs**: 40 completed, 2 OOM-retried then completed, 0 stuck

## Phases
| Phase | Jobs | Success | OOM retries | Duration |
| --- | --- | --- | --- | --- |
| train_teachers | 2 | 2 | 0 | 58m |
| distill_students | 24 | 24 | 2 | 4h 02m |
| multi_seed_validation | 16 | 16 | 0 | 1h 25m |

## Results Files
- 42 JSON files in `figures/distill_sw_*.json`

## Next Steps
- Run `/analyze-results` on output JSONs
- Figures auto-regen via `artifact-sync` (if configured)

Comparison with `/run-experiment`

| Feature | /run-experiment | experiment-queue | | --- | --- | --- | | Single-shot experiment | ✅ | ✅ (overkill) | | Multi-GPU parallel | Basic | Proper scheduling | | Wave transitions | Manual | Automatic | | OOM retry | Manual | Automatic | | Stale screen cleanup | Manual | Automatic | | Teacher→student chain | Manual | Built-in | | State persistence | No | Yes (JSON) | | Resume on crash | No | Yes | | Grid expansion | Manual | Declarative |

Rule: Use /run-experiment for ≤5 jobs. Use experiment-queue for ≥10 jobs or anything with phases.

Key Rules

Never overlap screens on the same GPU — always wait for memory.used < 500 MiB before launching new job
Always write state to disk — every state change flushed to queue_state.json
Idempotent scheduler — safe to restart; picks up from state file
Expected-output-based completion — don't trust screen state alone; verify output file exists
Bounded retry — max N OOM retries, then mark stuck and alert
Dependencies enforced at launch — never launch student before teacher checkpoint exists

Known Failure Modes

SSH connection drop during scheduling: scheduler keeps running on remote (nohup), just reconnect and check
GPU reservation by another user: scheduler waits, does not pre-empt
Disk full on remote: scheduler detects write failure, marks all pending stuck, alerts

Example Session

User: "跑 T5+T6 全部实验：T5 = N∈{80,192} × n 4 values × seed {200,201}, T6 = N∈{384,512} × n 4 values × seed {42,200,201}; T6 需要先 train teacher"

Claude invokes /experiment-queue:

Parses description into 2-phase manifest
Phase 1: T5 (16 jobs, no teacher dependency) + T6 teacher training (2 jobs)
Phase 2: T6 distillation (24 jobs, depends on teachers)
Deploys scheduler via nohup
Reports: "Scheduler PID 93534, total 42 jobs, estimated 6-7h wall-clock"

Then user can check anytime or wait for summary report.

Rationale / Source

Identified via 2026-04-16 post-mortem analysis (Codex GPT-5.5 xhigh) of a 1.5-day multi-seed paper experiment session:

Wall-clock sink: stale screens, OOM, wave transitions, manual parser
Token sink: re-writing orchestration code each session
Cognitive sink: tracking which cells succeeded, which failed, which to retry

This skill targets the wall-clock sink specifically; see artifact-sync and paper-fix-auto-apply for the other two.

wanshuiyin/experiment-queue

skills/skills-codex/experiment-queue/SKILL.md

SSH job queue for multi-seed/multi-config ML experiments with OOM-aware retry, stale-screen cleanup, and wave-transition race prevention. Use when user says "batch experiments", "队列实验", "run grid", "multi-seed sweep", "auto-chain experiments", or when /run-experiment is insufficient for 10+ jobs that need orchestration.

13,323 stars

testing

Updated Jul 13, 2026

$ install --global

skillsauth

npx skillsauth add wanshuiyin/Auto-claude-code-research-in-sleep experiment-queue

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jul 13, 2026, 4:46 AM45.2s1 file scanned

SKILL.md

name:: experiment-queue
description:: SSH job queue for multi-seed/multi-config ML experiments with OOM-aware retry, stale-screen cleanup, and wave-transition race prevention. Use when user says "batch experiments", "队列实验", "run grid", "multi-seed sweep", "auto-chain experiments", or when /run-experiment is insufficient for 10+ jobs that need orchestration.
argument-hint:: [manifest-or-grid-spec]
allowed-tools:: Bash(*), Read, Grep, Glob, Edit, Write, Skill(run-experiment), Skill(monitor-experiment)

Experiment Queue

Orchestrate large batches of ML experiments on SSH remote GPU servers with proper state tracking, OOM retry, stale cleanup, and wave transitions.

When to Use This Skill

Use when /run-experiment is insufficient:

≥10 jobs that need batching across GPUs
Multi-seed sweeps (e.g., 21 seeds × 12 cells)
Wave transitions (run wave 1, wait, run wave 2, wait, run wave 3...)
Teacher+student chains (train teacher then distill; auto-trigger student after teacher done)
OOM-prone configs where you need to retry with different GPU or wait
Mixed seed grids where failed cells need re-running

Do NOT use for:

Single ad-hoc experiment (use /run-experiment)
Modal/Vast.ai deployments (those have their own orchestration)
Experiments that need manual inspection between runs

Why This Exists

Based on session audit (2026-04-16), the major wall-clock sinks in multi-seed grid experiments are:

Stale screens — python finishes, wandb uploads, screen hangs, next wave blocked
OOM on shared GPU — previous job's memory not yet released
Wave race — new wave launches before previous wave fully settles
Missing checkpoints — student launches before teacher saved
Parser duplication — rewriting multi-seed analysis python every batch

All of these are pure engineering friction that can be orchestrated.

Core Concepts

Environment contract: queue jobs assume the target env is already built and validated per ../shared-references/compute-env-contract.md (spec-hash ledger + kernel witness). A wave of jobs dying at import time = the env contract was skipped, not a queue bug; check the provider's .aris/compute/<provider>.md ledger before re-queueing.

Job Manifest

A manifest lists jobs with explicit state:

project: my_grid_experiment
cwd: /home/user/your_project
conda: my_env
# Optional: override conda hook path if conda is not at a standard location.
# Can be a bare path (wrapped automatically) or a full `eval "$(... shell.bash hook)"` string.
# Falls back to auto-detect of ~/anaconda3, ~/miniconda3, /opt/anaconda3, etc.,
# or the ARIS_CONDA_HOOK environment variable.
# conda_hook: /custom/path/to/conda
ssh: gpu-server
default_cmd: >
  python run_distill.py --backbone softmax --lam 0.5
  --K 500 --L 96 --W 16 --n_steps 30000 --batch_size 128 --lr 1e-4

preconditions:
  - type: checkpoint_exists
    path: checkpoints/transformer/teacher_L96_K500_N{N}.pt

gpus: [0, 1, 2, 3, 4, 5, 6, 7]
max_parallel: 8
gpu_free_threshold_mib: 500  # optional, default 500; raise for shared servers, lower for tight packing
oom_retry:
  delay: 120
  max_attempts: 3

jobs:
  - id: s200_N64_n50K
    args: {seed: 200, n_hidden: 64, n_train_subset: 50000, subset_seed: 2024}
  - id: s200_N128_n50K
    args: {seed: 200, n_hidden: 128, n_train_subset: 50000, subset_seed: 2024}
  # ... 14 more

Job State Machine

pending → running → completed
                 ↘ failed_oom → pending (after delay) [retry up to N]
                 ↘ failed_other → stuck (needs manual inspection)
stale_screen_detected → cleaned → pending

Operator note on stuck (the agent's move, not the queue's): the queue deterministically parks failed_other jobs as stuck — that part is code and unchanged. Before handing a stuck batch to the human, the OPERATING AGENT should check: if the same failure repeats across jobs, try ONE clean reimplement of the agent-generated wrapper/attempt script only — never user/project source, the manifest, queue state, logs, or results (per external-cadence.md, "Let a broken attempt restart, not just patch"). Reserve the human handoff for contract/environment doubts, not merely broken attempt code.

Wave Orchestration

A "wave" is a batch of jobs that fit available GPUs. Next wave only starts when:

All current-wave python processes have exited
No stale screens remain for current-wave tags
GPU memory has dropped below threshold (≤500 MiB)
Precondition checks pass for next-wave jobs

Workflow

Step 1: Parse Manifest / Build from Grid

Input can be:

YAML manifest (explicit job list, recommended for complex cases)
Grid spec (Cartesian product of param values, e.g., N=[64,128,256] × n=[50K,150K,500K,652K])
Natural language description (Claude parses into manifest)

Bind run identifiers once so every later step refers to the same paths:

# REPLACE the placeholder path before running, or pre-export PROJECT_DIR:
PROJECT_DIR="${PROJECT_DIR:?set PROJECT_DIR to the local project root}"
RUN_TS=$(date -u +%Y%m%dT%H%M%SZ)
LOCAL_RUN_DIR="$PROJECT_DIR/experiment_queue/$RUN_TS"
mkdir -p "$LOCAL_RUN_DIR"

Save the built manifest to $LOCAL_RUN_DIR/manifest.json for reproducibility.

Step 2: Pre-flight

Check SSH connection works
Check conda env exists on remote
Check cwd exists on remote
Check all preconditions (checkpoints, input files)
Check GPU availability (at least max_parallel free GPUs)

If any precondition fails, show user which jobs are blocked and why.

Step 3: Launch Scheduler

if [ -z "${ARIS_REPO:-}" ] && [ -f .aris/installed-skills-codex.txt ]; then
    ARIS_REPO=$(awk -F'\t' '$1=="repo_root"{print $2; exit}' .aris/installed-skills-codex.txt 2>/dev/null) || true
fi
[ -n "${ARIS_REPO:-}" ] || { echo "ERROR: ARIS_REPO not set. Use install_aris_codex.sh managed install or export ARIS_REPO=/path/to/ARIS."; exit 1; }
# Prefer the new canonical location; fall back to legacy tools/ shim path.
QUEUE_TOOLS="$ARIS_REPO/skills/experiment-queue/scripts"
[ -f "$QUEUE_TOOLS/queue_manager.py" ] || QUEUE_TOOLS="$ARIS_REPO/tools/experiment_queue"
[ -f "$QUEUE_TOOLS/queue_manager.py" ] || { echo "ERROR: queue_manager.py not found at $ARIS_REPO/skills/experiment-queue/scripts/ or $ARIS_REPO/tools/experiment_queue/"; exit 1; }

Compute remote paths (note: modern scp runs in SFTP mode and does NOT reliably expand $HOME in destination paths — use remote-relative for scp, $HOME-prefixed for ssh command strings):

REMOTE_RUN_REL=".aris_queue/runs/$RUN_TS"
REMOTE_RUN_DIR="\$HOME/$REMOTE_RUN_REL"

Bootstrap remote run dir + copy helpers + copy manifest. Per-invocation, idempotent:

ssh <server> "mkdir -p \"$REMOTE_RUN_DIR/logs\" \"\$HOME/.aris_queue\""
scp "$QUEUE_TOOLS/queue_manager.py" "$QUEUE_TOOLS/build_manifest.py" <server>:.aris_queue/
scp "$LOCAL_RUN_DIR/manifest.json" <server>:"$REMOTE_RUN_REL/manifest.json"

Launch the scheduler as a detached nohup process:

ssh <server> "nohup python3 \"\$HOME/.aris_queue/queue_manager.py\" \\
  --manifest \"$REMOTE_RUN_DIR/manifest.json\" \\
  --state    \"$REMOTE_RUN_DIR/queue_state.json\" \\
  --log-dir  \"$REMOTE_RUN_DIR/logs\" \\
  > \"$REMOTE_RUN_DIR/queue_mgr.log\" 2>&1 &"

Notes: --log-dir is what queue_manager.py actually consumes (per-job log files for OOM detection). Do NOT pass --log <path> — that flag is declared but unused.

Persist run identifiers for monitoring + resume (sourceable later):

{
  printf 'PROJECT_DIR=%q\n'    "$PROJECT_DIR"
  printf 'RUN_TS=%q\n'         "$RUN_TS"
  printf 'LOCAL_RUN_DIR=%q\n'  "$LOCAL_RUN_DIR"
  printf 'REMOTE_RUN_REL=%q\n' "$REMOTE_RUN_REL"
  printf 'REMOTE_RUN_DIR=%q\n' "$REMOTE_RUN_DIR"
} > "$LOCAL_RUN_DIR/run_meta.txt"

%q shell-escapes values; REMOTE_RUN_DIR keeps a literal $HOME (correct for later reuse inside ssh "...").

Resume an existing queue. Do NOT regenerate RUN_TS. Reload from run_meta.txt and re-run only the launch command above (not the bootstrap):

LOCAL_RUN_DIR="/abs/path/to/project/experiment_queue/<existing-run-ts>"
. "$LOCAL_RUN_DIR/run_meta.txt"
# Then re-run the launch command verbatim; do NOT re-run mkdir/scp.

The scheduler:

Reads manifest
Loops: for each pending job, assign to free GPU, launch via screen
Polls job status (every 60s)
Detects stale screens (python exited but screen detached → kill)
Detects OOM (CUDA OOM in log → mark failed_oom → retry after delay)
Detects completion (expected output JSON/file exists) → mark completed
Launches next wave when current wave settles
Writes state to queue_state.json continuously

Step 4: Monitoring

User can check state anytime, using $REMOTE_RUN_DIR from Step 3 (or reload it from $LOCAL_RUN_DIR/run_meta.txt):

ssh <server> "cat \"$REMOTE_RUN_DIR/queue_state.json\"" \
  | jq '.jobs | group_by(.status) | map({(.[0].status): length}) | add'

Step 5: Post-completion

When all jobs in manifest.json are completed or stuck:

The remote scheduler (queue_manager.py) exits cleanly with All jobs done to its own stdout (captured in $REMOTE_RUN_DIR/queue_mgr.log). It does NOT write the local summary.
The local skill agent then aggregates state into $LOCAL_RUN_DIR/summary.md (read $REMOTE_RUN_DIR/queue_state.json, group by status, optionally pull per-job logs).
Local skill agent invokes /analyze-results if analyze_on_complete: true.

Grid Spec Syntax

Instead of writing 24 job entries manually:

grid:
  N: [64, 128, 256]
  n: [50000, 150000, 500000, 652000]
  seed: [42, 200, 201]
template:
  id: "s${seed}_N${N}_n${n}"
  args: {seed: ${seed}, n_hidden: ${N}, n_train_subset: ${n}}

Expands to 36 jobs automatically.

Wave Chaining

For sequential phases (teacher → student):

phases:
  - name: train_teachers
    grid:
      N: [384, 512]
    template:
      cmd: python run_train.py --direction c --backbone softmax --n_hidden ${N} ...
      output_check: checkpoints/transformer/teacher_L96_K500_N${N}.pt
  
  - name: distill_students
    depends_on: train_teachers
    grid:
      N: [384, 512]
      seed: [42, 200, 201]
    template:
      cmd: python run_distill.py --n_hidden ${N} --seed ${seed} ...
      output_check: figures/distill_sw_N${N}_*_seed${seed}.json

Scheduler enforces depends_on: distill_students jobs stay pending until all train_teachers jobs are completed.

OOM Handling

Detect OOM from stdout:

torch\.OutOfMemoryError: CUDA out of memory

On detection:

Mark job failed_oom
Kill screen
Wait oom_retry.delay seconds
Check if current GPU is free; if not, try another free GPU
Requeue as pending
Max oom_retry.max_attempts before marking stuck

Stale Screen Detection

Every 60s, for each running screen:

Check screen exists (screen -ls)
Check python PID still running (ps -p)
If screen exists but python exited:
- If expected output file exists → mark completed, kill stale screen
- If no output file → mark failed_other, kill screen

Resume-on-restart

If scheduler crashes / is killed:

Read queue_state.json
For each running job: check screen; if still alive, keep; if not, re-evaluate state
For each pending: continue normally
Idempotent: safe to restart scheduler without losing state

Output: Summary Report

# Experiment Queue Summary

**Project**: my_grid_experiment
**Started**: 2026-04-16 11:36:29
**Completed**: 2026-04-16 18:02:14
**Total wall-clock**: 6h 25m
**Jobs**: 40 completed, 2 OOM-retried then completed, 0 stuck

## Phases
| Phase | Jobs | Success | OOM retries | Duration |
| --- | --- | --- | --- | --- |
| train_teachers | 2 | 2 | 0 | 58m |
| distill_students | 24 | 24 | 2 | 4h 02m |
| multi_seed_validation | 16 | 16 | 0 | 1h 25m |

## Results Files
- 42 JSON files in `figures/distill_sw_*.json`

## Next Steps
- Run `/analyze-results` on output JSONs
- Figures auto-regen via `artifact-sync` (if configured)

Comparison with `/run-experiment`

Rule: Use /run-experiment for ≤5 jobs. Use experiment-queue for ≥10 jobs or anything with phases.

Key Rules

Never overlap screens on the same GPU — always wait for memory.used < 500 MiB before launching new job
Always write state to disk — every state change flushed to queue_state.json
Idempotent scheduler — safe to restart; picks up from state file
Expected-output-based completion — don't trust screen state alone; verify output file exists
Bounded retry — max N OOM retries, then mark stuck and alert
Dependencies enforced at launch — never launch student before teacher checkpoint exists

Known Failure Modes

SSH connection drop during scheduling: scheduler keeps running on remote (nohup), just reconnect and check
GPU reservation by another user: scheduler waits, does not pre-empt
Disk full on remote: scheduler detects write failure, marks all pending stuck, alerts

Example Session

User: "跑 T5+T6 全部实验：T5 = N∈{80,192} × n 4 values × seed {200,201}, T6 = N∈{384,512} × n 4 values × seed {42,200,201}; T6 需要先 train teacher"

Claude invokes /experiment-queue:

Parses description into 2-phase manifest
Phase 1: T5 (16 jobs, no teacher dependency) + T6 teacher training (2 jobs)
Phase 2: T6 distillation (24 jobs, depends on teachers)
Deploys scheduler via nohup
Reports: "Scheduler PID 93534, total 42 jobs, estimated 6-7h wall-clock"

Then user can check anytime or wait for summary report.

Rationale / Source

Identified via 2026-04-16 post-mortem analysis (Codex GPT-5.5 xhigh) of a 1.5-day multi-seed paper experiment session:

Wall-clock sink: stale screens, OOM, wave transitions, manual parser
Token sink: re-writing orchestration code each session
Cognitive sink: tracking which cells succeeded, which failed, which to retry

This skill targets the wall-clock sink specifically; see artifact-sync and paper-fix-auto-apply for the other two.

Related Skills

wanshuiyin/web-debug-search

development

VerifiedTrustedCommunity

Search GitHub Issues and Discussions for software errors, version compatibility problems, and exact error-string matches. Use for debugging and discovery only; results are not paper-citation evidence.

13,732SKILL.mdUpdated Jul 23, 2026

wanshuiyin/web-debug-search

wanshuiyin/web-debug-search

development

VerifiedTrustedCommunity

13,732SKILL.mdUpdated Jul 23, 2026

wanshuiyin/web-debug-search

wanshuiyin/integrity-forensics

testing

VerifiedTrustedCommunity

Run the Anti-Autoresearch integrity-forensics sweep (span-anchored evidence ledger → GPT auditors propose findings → deterministic rules-only adjudicator) against a paper via a SHA-pinned thin launcher — then convert the verdict into a typed policy gate (BLOCK/WARN/NO_NEW_BLOCKER) and an append-only obligations ledger. Use when user says "integrity forensics", "forensic audit this paper", "投稿前自查诚信", "审这篇论文的诚信", or says "anti-autoresearch" when the upstream repo's own skills are not installed. Also invoked by /paper-writing (submission self-forensics, default ON), /peer-review (forensic appendix), /resubmit-pipeline.

13,401SKILL.mdUpdated Jul 13, 2026

wanshuiyin/integrity-forensics

wanshuiyin/meta-apply

testing

VerifiedTrustedCommunity

Privileged applier that LANDS meta-optimize / corpus-audit patches the user approved — the ONLY skill permitted to mutate the skill corpus from a self-modification proposal, with cross-model jury and human approval at landing. Use when the user says "meta apply", "/meta-apply", "land the staged patches", "应用优化", after a /meta-optimize run.

13,401SKILL.mdUpdated May 31, 2026

wanshuiyin/meta-apply

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/wanshuiyin/Auto-claude-code-research-in-sleep.git

# Copy into Claude Code skills folder (global)
cp -r Auto-claude-code-research-in-sleep/skills/skills-codex/experiment-queue ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

wanshuiyin/Auto-claude-code-research-in-sleep

13,323 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

wanshuiyin/experiment-queue

$ install --global

Security Scan Results

SKILL.md

Experiment Queue

When to Use This Skill

Why This Exists

Core Concepts

Job Manifest

Job State Machine

Wave Orchestration

Workflow

Step 1: Parse Manifest / Build from Grid

Step 2: Pre-flight

Step 3: Launch Scheduler

Step 4: Monitoring

Step 5: Post-completion

Grid Spec Syntax

Wave Chaining

OOM Handling

Stale Screen Detection

Resume-on-restart

Output: Summary Report

Comparison with /run-experiment

Key Rules

Known Failure Modes

Example Session

See Also

Rationale / Source

Related Skills

wanshuiyin/web-debug-search

wanshuiyin/web-debug-search

wanshuiyin/integrity-forensics

wanshuiyin/meta-apply

wanshuiyin/experiment-queue

$ install --global

Security Scan Results

SKILL.md

Experiment Queue

When to Use This Skill

Why This Exists

Core Concepts

Job Manifest

Job State Machine

Wave Orchestration

Workflow

Step 1: Parse Manifest / Build from Grid

Step 2: Pre-flight

Step 3: Launch Scheduler

Step 4: Monitoring

Step 5: Post-completion

Grid Spec Syntax

Wave Chaining

OOM Handling

Stale Screen Detection

Resume-on-restart

Output: Summary Report

Comparison with /run-experiment

Key Rules

Known Failure Modes

Example Session

See Also

Rationale / Source

Related Skills

wanshuiyin/web-debug-search

wanshuiyin/web-debug-search

wanshuiyin/integrity-forensics

wanshuiyin/meta-apply

Comparison with `/run-experiment`

Comparison with `/run-experiment`