skills/run-experiment/SKILL.md
Use when launching or preparing a new ML experiment job — local, SLURM, or RunAI. Not for checking existing job status (use run-status-monitor). Not for NaN/OOM/crash debugging (use experiment-debugger). Not for computing costs before deciding to run (use compute-budget-planner).
npx skillsauth add a-green-hand-jack/ml-research-skills run-experimentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
git rev-parse --git-common-dir and git rev-parse --show-toplevel. If they differ, you are inside a worktree — adjust write paths accordingly.memory/BRIEFING.md exists in the project root:
memory/BRIEFING.md — compact project state snapshotmemory/project-conventions.md — active conventions, wrappers, and forbidden commands for this project.agent/worktree-status.md before writing anything. Write in-progress results there; do not write to root memory/ until a result is confirmed.Skipping this step risks violating project-specific rules (compute wrappers, environment policies, scope constraints) that override the defaults below.
Submit an ML experiment to a compute environment — local machine, SLURM HPC (Ibex, UW, etc.), or RunAI/Kubernetes (EPFL).
Generates a reproducible job script in jobs/ that is committed alongside the code, then provides the exact submit command to run.
Pair this skill with research-project-memory when a launched job should be linked to a planned experiment, evidence item, worktree, or project action.
Pair this skill with project toolchain gates when generated job scripts should be checked before launch. Use shellcheck and shfmt when available, but do not require the user to install them just to generate a draft script.
Pair this skill with remote-project-control and run-status-monitor when server resources, queues, or pending jobs should influence the launch choice.
Terminology:
local: the machine where the agent is running, usually the user's MacGit remote: GitHub/GitLab remote used for code sync, such as originserver: SSH/HPC/RunAI execution environment such as quest, ibex-vscode, or epfl-haasWhen launching on SLURM or RunAI, call it a server run rather than a remote run unless referring to the Git remote.
ContainerCreating.run-status-monitor or a project wrapper to check actual resource occupancy. If a job is running but underutilizes allocated GPUs, record that as an operational feedback item and update project memory/status with the next launch policy.<installed-skill-dir>/
├── SKILL.md
├── environments.yaml # Environment profiles (extend for new clusters)
└── templates/
├── slurm_job.sh # SLURM template (Ibex, UW, any SLURM cluster)
├── runai_job.sh # RunAI/Kubernetes template (EPFL)
└── local_run.sh # Local tmux/nohup template
Resolve <installed-skill-dir> as the directory containing this SKILL.md, then read <installed-skill-dir>/environments.yaml.
List the available environments to the user with a one-line description each.
Ask the user in a single message:
uv run python train.py --lr 1e-3)baseline-cifar10, ablation-no-attn). Default: script basename + date..venv path (if applicable)outputs/<job-name>/)If --env, --script, --name, or --gpus were passed as arguments, pre-fill those answers.
Before generating a job that uses uv or UV_PROJECT_ENVIRONMENT, choose the environment strategy deliberately:
.venv, shared RunAI env, or stage-level env when dependencies are unchanged.uv sync races.uv sync when the chosen env is already known to be current; use uv run --frozen or the project's documented no-sync command when available.Before generating or submitting a server job, classify the task:
smoke: codepath validation, 1-4 step run, shape check, environment check, or tiny subsetdebug: interactive diagnosis, short profiler run, OOM reproduction, or log inspectionformal: paper-facing training/eval, benchmark, ablation, seed sweep, or final metric runThen choose resources by matching need to availability:
run-status-monitor.smoke and debug jobs, even if they are slower, but only after checking image availability, CUDA/software compatibility, and expected startup overhead.formal jobs; do not change GPU class, precision assumptions, batch size, or distributed setup without recording the reason.ContainerCreating, ImagePullBackOff, or an image-pull phase beyond the short smoke budget, classify it as node/image startup overhead rather than code failure. For smoke/debug jobs, prefer deleting or abandoning that attempt and rerouting to a compatible pool or node family with a warmer image/cache history.Before choosing GPU count or node shape, identify the workload shape:
single-device: one process uses one GPU, and more GPUs will not help without code changesmulti-gpu-native: the command uses DDP, FSDP, tensor parallelism, data parallelism, or a framework launcher that binds all requested GPUsindependent-targets: targets, seeds, prompts, molecules, structures, folds, eval shards, or checkpoints can run independentlypipeline: stages have different resource needs and should not necessarily share one allocation shapeThen align the launch shape:
single-device, request one GPU unless using a node-local worker pool for multiple independent commands.multi-gpu-native, verify the launcher and binding mechanism, such as torchrun --nproc_per_node, accelerate launch, srun, CUDA_VISIBLE_DEVICES, or project-specific GPU assignment.independent-targets, prefer a scheduler array or per-GPU worker pool when target count and runtime justify parallelism. Ensure each worker has isolated output paths or lock-safe resume behavior.The generated run notes or job script comments should state:
git rev-parse --show-toplevel 2>/dev/null || pwd
Also capture the short git commit hash:
git rev-parse --short HEAD 2>/dev/null || echo "no-git"
Based on the environment type:
Read the SLURM template from <installed-skill-dir>/templates/slurm_job.sh.
Fill in all {PLACEHOLDER} variables:
| Placeholder | Value |
|---|---|
| {PROJECT} | project directory name |
| {ENV_NAME} | environment key (e.g., ibex) |
| {ENV_DISPLAY} | display name from profile |
| {DATE} | today's date YYYY-MM-DD |
| {COMMIT} | short git SHA |
| {JOB_NAME} | user-provided job name |
| {SCRIPT_NAME} | filename of the generated script |
| {PARTITION} | from env profile defaults (or user override) |
| {CPUS} | cpus_per_task from profile (or user override) |
| {GPUS} | user-provided GPU count |
| {MEM} | from profile defaults (or user override) |
| {WALLTIME} | user-provided or profile default |
| {LOG_DIR} | outputs/logs/<job-name> |
| {OUTPUT_DIR} | outputs/<job-name> |
| {PROJECT_ROOT} | absolute project root path |
| {CONDA_ENV} | user-provided env name |
| {RUN_COMMAND} | user-provided command |
| {SCRATCH} | scratch path from env profile |
Uncomment the relevant module load lines based on the env profile's common_modules.
Uncomment the conda activate or source .venv/activate line based on user's answer.
If scratch path is in the env profile, uncomment the TMPDIR block.
Output path: jobs/<job-name>.sh
Read the RunAI template from <installed-skill-dir>/templates/runai_job.sh.
Fill in placeholders:
| Placeholder | Value |
|---|---|
| {PROJECT} | project directory name |
| {DATE} | today's date |
| {COMMIT} | short git SHA |
| {JOB_NAME} | user-provided job name |
| {SCRIPT_NAME} | filename of generated script |
| {IMAGE} | from env profile default_image (ask user to confirm) |
| {RUNAI_PROJECT} | from env profile project |
| {GPUS} | GPU count |
| {CPUS} | CPU count from profile defaults |
| {MEM} | memory from profile defaults |
| {PVC_FLAGS} | generated from pvc_mounts in profile: --pvc claim:path \ per mount |
| {RUN_COMMAND} | user-provided command |
Output path: jobs/<job-name>-runai.sh
For RunAI/uv jobs, include existing UV_PROJECT_ENVIRONMENT, UV_PYTHON_INSTALL_DIR, and cache settings only when they are part of the project policy or user-provided command. Do not invent a job-specific UV_PROJECT_ENVIRONMENT from the job name. If the command needs a new env, explain the reason in the script comments or run pointer.
Read the local template from <installed-skill-dir>/templates/local_run.sh.
Fill in placeholders similarly. Uncomment conda/venv activation as appropriate.
Output path: jobs/<job-name>-local.sh
If the user specifies an environment not in environments.yaml:
environments.yaml for future use?"Create the job script directory, log directory, and output directory before previewing or submitting:
mkdir -p <project-root>/jobs
mkdir -p <project-root>/outputs/logs/<job-name>
mkdir -p <output-dir>
Write the filled-in script to jobs/<job-name>.sh (or -runai.sh / -local.sh).
Show the user the full generated script for review.
After writing the job script and before offering to launch it, run non-mutating shell gates when the tools are available:
command -v shellcheck >/dev/null && shellcheck jobs/<job-name>.sh || true
command -v shfmt >/dev/null && shfmt -d jobs/<job-name>.sh || true
bash -n jobs/<job-name>.sh
Report whether each gate passed, failed, or was skipped because the tool is not installed. Treat bash -n failures as blockers before launch. Treat shellcheck and shfmt -d failures as warnings unless the project policy marks them required.
Do not run shfmt -w silently. If formatting is requested or required by policy, run shfmt -w jobs/<job-name>.sh and review the diff before submitting.
Print the exact command(s) to submit, tailored to the environment:
# If you're already on the login node:
sbatch jobs/<job-name>.sh
# If submitting from your local machine to a server (requires ssh access):
scp jobs/<job-name>.sh <ssh-alias>:<project-root>/jobs/
ssh <ssh-alias> "cd <project-root> && mkdir -p outputs/logs/<job-name> <output-dir> jobs && sbatch jobs/<job-name>.sh"
# Monitor:
squeue -u $USER
sacct -j <jobid> --format=JobID,State,Elapsed,AllocGRES
tail -f outputs/logs/<job-name>/slurm-<jobid>.out
bash jobs/<job-name>-runai.sh
# Monitor:
runai list
runai logs <job-name> -f
# Attached (output in terminal):
bash jobs/<job-name>-local.sh
# Detached in tmux:
tmux new-session -d -s <job-name> "bash jobs/<job-name>-local.sh"
tmux attach -t <job-name>
# Background with nohup:
nohup bash jobs/<job-name>-local.sh &
Ask: "Want me to run the submit command now?"
scp + ssh sbatch command (requires ssh key auth to be set up).jobs/ and ready to submit.If a jobs/README.md or jobs/index.md exists, offer to append a one-line entry:
| {DATE} | {JOB_NAME} | {ENV_NAME} | {COMMIT} | {RUN_COMMAND_BRIEF} |
If the repo follows the code evidence layout from init-python-project, also offer to create or update a short run pointer under:
docs/runs/<DATE>-<job-name>.md
This file should contain the command, config, commit, output path, expected metric, and monitor command. It should not contain raw logs.
If the repo has memory/ or a worktree .agent/worktree-status.md, update only verified run pointers:
memory/evidence-board.md: add or update the linked EXP-### with job script path, commit, command, output directory, and status planned, submitted, or running only if verifiedmemory/provenance-board.md: add only planned or available provenance for run pointers; do not mark final metrics as verified until outputs are checkeddocs/runs/: write a small run pointer when the code repo uses that conventionmemory/action-board.md: mark the launch action as doing or create a monitor actionmemory/handoff-board.md: create a monitor/fetch/report handoff only when another module is expected to consume the run outputmemory/current-status.md: record the latest known job and what must be checked next<worktree>/.agent/worktree-status.md: link the run to the worktree purpose and exit conditionmemory/project-conventions.md: if this session established a new stable compute convention (resource pool, partition name, job script location, smoke-test size limit, environment reuse policy), add it under compute category; if a convention is now obsolete (server decommissioned, pool renamed, policy changed), expire the rowDo not store queue state, job success, or final metric values as durable facts unless they were verified in this session. Use needs-verification for monitor tasks.
All environments are defined in environments.yaml. The current known environments:
| Key | Type | Cluster | Notes |
|---|---|---|---|
| local | local | — | Current machine, tmux/nohup |
| ibex | slurm | KAUST Ibex | ilogin.ibex.kaust.edu.sa; gpu/batch/himem partitions |
| uw | slurm | UW HPC | Placeholder — update environments.yaml with actual details |
| runai | runai | EPFL RunAI | Kubernetes; update project/image in environments.yaml |
Edit <installed-skill-dir>/environments.yaml and add a block:
my-cluster:
type: slurm # or runai / local
display_name: "My University HPC"
login_node: "login.cluster.edu"
ssh_alias: mycluster
scheduler: slurm
partitions:
gpu:
name: gpu
flag: "--partition=gpu"
gpu_flag: "--gres=gpu:{count}"
max_gpus_per_job: 4
defaults:
partition: gpu
gpus: 1
cpus_per_task: 4
mem: "32G"
walltime: "12:00:00"
max_walltime: "48:00:00"
storage:
home: "/home/{user}"
scratch: "/scratch/{user}"
module_system: lmod
common_modules:
- "cuda/12.1"
- "python/3.11"
notes: "..."
Every generated job script includes:
GIT_COMMIT)outputs/<job-name>/ for checkpoints, outputs/logs/<job-name>/ for logsThe jobs/ directory should be committed to git (the scripts are small text files). Actual outputs go to outputs/ which is typically .gitignored.
/run-experiment # interactive wizard
/run-experiment --env ibex --script train.py --gpus 2
/run-experiment --env local --script eval.py --name eval-baseline
/run-experiment --env runai --gpus 4 --name big-run
/run-experiment --env ibex --script sweep.py --name sweep --gpus 1
When the user says "I want to sweep over N configs":
configs/sweep.yaml with N entries).#SBATCH --array=0-{N-1}%{max_concurrent} to the script.--config configs/sweep.yaml --config-idx $SLURM_ARRAY_TASK_IDoutputs/<job-name>/$SLURM_ARRAY_TASK_ID/When GPUs > 1 and the env is SLURM:
--ntasks-per-node={GPUS} directivetorchrun --nproc_per_node={GPUS} or srun python -m torch.distributed.launchWhen the user wants to debug interactively (not submit a batch job):
Ibex:
srun --partition=gpu --gres=gpu:1 --cpus-per-task=4 --mem=32G --time=2:00:00 --pty bash
RunAI:
runai submit <name> --image <image> --gpu 1 --interactive --stdin -- bash
runai bash <name>
Generate this command directly without creating a script file.
testing
Bootstrap project-local ml-research-skills. Use from global installs when creating a new ML research project, enabling this collection in an existing ML research repo, or deciding whether to install the full bundle locally. Route to project-init for new projects; do not handle paper or experiment work directly.
development
Route project operations tasks — git, memory, bootstrap, remote, workspace, code review, timeline, ops — to the correct skill. Use when the task involves commits, pushes, worktrees, project memory, enabling project-local skills, SSH/server coordination, sidecar runners, or audits. Do not solve the ops task directly.
testing
Route ML/AI paper writing tasks to the correct skill — contract planning, prose drafting, section writing, consistency editing, review simulation, rebuttal, submission, or citation work. Use when the task involves writing, revising, reviewing, or submitting a paper instead of guessing between paper-writing-assistant, paper-writing-contract-planner, paper-reviewer-simulator, auto-paper-improvement-loop, or citation skills. Do not draft prose directly.
data-ai
Project-local router for ML research skill selection. Use inside an initialized ML research project, or while maintaining this skill repo, when the user describes an ML research/paper/experiment/discovery/ops/release workflow and may not know the skill; route to a domain router or high-signal leaf. Do not use for generic non-ML projects.