marlowe-slurm-operator/SKILL.md
Operates Stanford's Marlowe HPC cluster with Slurm. Use when the task involves Marlowe-specific partitions, account suffixes, GPU-hour tracking, job submission, queue diagnosis, or cancellation while still verifying live cluster state first.
npx skillsauth add tianyudu/slurm-hpc-agent-skill marlowe-slurm-operatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill adapts the generic Slurm workflow to Stanford's Marlowe Computing Platform. Use Marlowe-specific documented defaults as hints, but verify current cluster state with live commands before acting.
references/marlowe-reference.md for Stanford-specific details, policies, and examples.templates/preempt-interactive.sh for an interactive preempt shell.templates/batch-gpu.sbatch for a medium-project GPU batch job.templates/hero-gpu.sbatch for a large-project GPU batch job.marlowe-<project-id>marlowe-<project-id>-pmXXmarlowe-<project-id>-plXXpreempt with a medium or large suffix account, assume it consumes GPU hours unless current site docs say otherwise.sinfo and scontrol show partition before submission.slurm, nvhpc, and cudnn/cuda12/9.3.0.75.gcc/64 after nvhpc.srun, salloc, or sbatch for GPU work.sbatch --test-only or srun --test-only when the user wants validation rather than immediate execution.These are documented defaults, not permanent guarantees:
preempt: basic access, up to 8 nodes, up to 12 hours, preemptible with about 15 minutes noticebatch: medium projects, up to 16 nodes, up to 2 days, requires a medium suffix accounthero: large projects, up to 25 nodes, up to 24 hours, requires a large suffix accountVerify these with:
scontrol version
scontrol show config
sinfo --summarize
sinfo --Node --long
sinfo -o "%#P %.5a %.10l %.10L %.6D %G"
scontrol show partition
Use scontrol show node <node_name> before making assumptions about per-node GPUs, memory, features, or constraints.
Before submitting, confirm which account form applies:
sacctmgr show assoc where user=$USER format=cluster,account,user,partition,defaultqos,qos,maxjobs,maxsubmitjobs,maxtresperjob,maxtrespernode,maxwall
Use the result plus current site docs to decide:
preempt without GPU-hour charging intent: prefer the basic marlowe-<project-id> accountbatch: require the correct -pmXX suffixhero: require the correct -plXX suffixpreempt with a suffix account: warn that GPU hours may be chargedIf the user forgets a valid account, expect an error similar to ACCOUNT ERROR: Did you remember to set your account?.
If sacctmgr is restricted, say so clearly and fall back to verified site documentation plus known working account examples.
preemptUse a command like:
srun -N <nodes> -G <gpus> -A marlowe-<project-id>[-<suffix>] -p <partition> --pty bash
Before running it:
preempt job will consume GPU hourssalloc -N <nodes> -A marlowe-<project-id>[-<suffix>] -p <partition> -t <time>
Use sbatch with explicit directives for account, partition, nodes, GPUs, time, and output paths.
Minimum checklist:
sbatch --test-only job.sh
Use standard Slurm commands for job state:
squeue -j <job_id> --start
scontrol show jobid=<job_id>
sstat -j <job_id>
sacct -j <job_id>
For medium and large projects, track GPU-hour usage with sreport:
sreport cluster UserUtilizationByAccount -T gres/gpu Start=<YYYY-MM-DD> End=<YYYY-MM-DD> account=marlowe-<project-id>-<suffix> -t hours
Always verify the suffix in the usage query. If the task involves batch, hero, or suffix-coded preempt jobs, mention whether GPU hours are expected to be consumed.
If sacctmgr, sshare, or sreport output is restricted:
sinfo, scontrol, squeue, and documented Marlowe conventionsWhen helping on Marlowe:
unknown over a guesstesting
Inspects and operates Slurm-managed HPC clusters safely and portably. Use when the task involves discovering partitions, accounts, QoS, hardware, GRES, memory or time limits, validating or submitting jobs, monitoring queue state, diagnosing pending jobs, reading accounting data, or canceling jobs without assuming site-specific defaults.
testing
Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".
testing
Host security hardening and risk-tolerance configuration for OpenClaw deployments. Use when a user asks for security audits, firewall/SSH/update hardening, risk posture, exposure review, OpenClaw cron scheduling for periodic checks, or version status checks on a machine running OpenClaw (laptop, workstation, Pi, VPS).
testing
Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".