slurm-hpc-operator/SKILL.md
Inspects and operates Slurm-managed HPC clusters safely and portably. Use when the task involves discovering partitions, accounts, QoS, hardware, GRES, memory or time limits, validating or submitting jobs, monitoring queue state, diagnosing pending jobs, reading accounting data, or canceling jobs without assuming site-specific defaults.
npx skillsauth add tianyudu/slurm-hpc-agent-skill slurm-hpc-operatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Treat every cluster as site-specific until verified. Never invent partition names, account names, QoS names, GPU types, memory defaults, submission caps, or time limits.
references/command-map.md for quick command selection.references/pending-reasons.md when a job is pending and the reason needs interpretation.references/site-customization.md to record local cluster conventions.templates/cpu-job.sbatch, templates/gpu-job.sbatch, and templates/array-job.sbatch as starting points after discovery.scontrol update or sacctmgr add/modify/delete unless the user explicitly asks and clearly has admin authority.sbatch submits a batch script.salloc obtains an interactive allocation.srun launches tasks, including job steps inside an allocation.--test-only when the user wants validation rather than immediate execution.sacct. For running jobs, prefer squeue, scontrol show job, and sstat.Run these before proposing site-specific sbatch, salloc, or srun commands unless the user already provided verified local details.
scontrol version
scontrol show config
Use this to identify the installed Slurm version and inspect high-level site configuration.
sinfo --summarize
sinfo --Node --long
sinfo -o "%#P %.5a %.10l %.10L %.6D %G"
scontrol show partition
Look for:
*)DefaultTime, MaxTime, DefMemPerNode, MaxMemPerNode, OverSubscribe, and TRESIf the first view looks incomplete, also try:
sinfo --all
scontrol show node <node_name>
Use this before choosing constraints, features, specific GPU types, or node-local assumptions.
Start with:
sacctmgr show tres
sacctmgr show assoc where user=$USER format=cluster,account,user,partition,defaultqos,qos,maxjobs,maxsubmitjobs,maxtresperjob,maxtrespernode,maxwall
sacctmgr show qos format=name,grpjobs,grpsubmitjobs,grptres,grpwall,flags
sshare -l
Use these to learn:
MaxJobs and MaxSubmitJobsgres/gpuGrpJobs, GrpSubmitJobs, GrpTRES, and GrpWallIf these commands fail, state that accounting visibility is restricted or unavailable. Then fall back to sinfo, scontrol, local site docs, and previously working job scripts.
Inspect MaxArraySize in:
scontrol show config
Do not assume job arrays are large enough for the requested fan-out.
squeue -u $USER
squeue --start -u $USER
Use --start only as an estimate. It depends on site scheduling configuration.
sinfo --summarizescontrol show partitionsinfo --Node --longscontrol show node <node_name>sinfo -o "%#P %.6D %G"scontrol show node <node_name>sacctmgr show tresscontrol show partitionsacctmgr show assoc where user=$USER ...sacctmgr show qos ...sacctmgr show assoc where user=$USER format=cluster,account,user,partition,maxjobs,maxsubmitjobssacctmgr show qos format=name,grpjobs,grpsubmitjobs,flagssqueue -u $USERsqueue -l -j <jobid>scontrol show job <jobid>sstat -j <jobid>sacct -j <jobid>Use sbatch for noninteractive work.
cpus-per-task, memory policy, GRES, and output paths.sbatch --test-only job.sh
sbatch job.sh
Use salloc to obtain an allocation and then srun inside it, or use srun directly when that matches the workflow.
salloc -p <partition> -A <account> -t <time> ...
srun <program> [args]
or
srun --test-only -p <partition> -A <account> -t <time> ...
srun -p <partition> -A <account> -t <time> ...
Use srun for the launched application unless the local site has a documented alternative pattern.
Start with:
squeue -j <jobid>
squeue -l -j <jobid>
scontrol show job <jobid>
For pending jobs:
For running jobs:
sstat -j <jobid>
Use this only for running jobs and expect metric availability to vary by site configuration.
For completed jobs:
sacct -j <jobid>
Use this to inspect elapsed time, state, exit code, node list, requested versus allocated resources, and any recorded blocking reasons.
Use:
scancel <jobid>
Cancel only when the user explicitly asks or when cleanup is clearly required by the workflow.
Do not assume there is one global limit knob. Check multiple layers.
Inspect:
scontrol show partitionsacctmgr show qossacctmgr show assoc where user=$USERsqueue or scontrol show jobsshare -lWhen a request is blocked, explain which layer appears responsible and which command established that conclusion.
If a command is restricted or unavailable:
Examples:
sacctmgr output or local site docs."When helping interactively:
sbatch, salloc, or srun command or scripttesting
Operates Stanford's Marlowe HPC cluster with Slurm. Use when the task involves Marlowe-specific partitions, account suffixes, GPU-hour tracking, job submission, queue diagnosis, or cancellation while still verifying live cluster state first.
testing
Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".
testing
Host security hardening and risk-tolerance configuration for OpenClaw deployments. Use when a user asks for security audits, firewall/SSH/update hardening, risk posture, exposure review, OpenClaw cron scheduling for periodic checks, or version status checks on a machine running OpenClaw (laptop, workstation, Pi, VPS).
testing
Create, edit, improve, or audit AgentSkills. Use when creating a new skill from scratch or when asked to improve, review, audit, tidy up, or clean up an existing skill or SKILL.md file. Also use when editing or restructuring a skill directory (moving files to references/ or scripts/, removing stale content, validating against the AgentSkills spec). Triggers on phrases like "create a skill", "author a skill", "tidy up a skill", "improve this skill", "review the skill", "clean up the skill", "audit the skill".