skills/hpc/SKILL.md
Yale YCRC HPC cluster reference for the Musser Lab. Use when writing SLURM batch scripts, configuring job resources, managing cluster storage, running bioinformatics tools on HPC, setting up Snakemake pipelines, or connecting to the cluster remotely (SSH setup, Positron/VS Code Remote SSH, interactive sessions). Covers McCleary, Bouchet, and Misha clusters with lab-specific storage paths, partition tables, and tool resource templates.
npx skillsauth add musserlab/lab-claude-skills hpcInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Yale Center for Research Computing (YCRC) cluster conventions for the Musser Lab. Full YCRC documentation: https://docs.ycrc.yale.edu/
Request an account at https://research.computing.yale.edu/account-request. You need a Yale NetID and PI approval (Jacob Musser).
ssh <netid>@mccleary.ycrc.yale.edu
ssh <netid>@bouchet.ycrc.yale.edu
ssh <netid>@misha.ycrc.yale.edu
Use SSH keys for passwordless access. Add to ~/.ssh/config:
Host mccleary
HostName mccleary.ycrc.yale.edu
User <netid>
Host bouchet
HostName bouchet.ycrc.yale.edu
User <netid>
Host misha
HostName misha.ycrc.yale.edu
User <netid>
After logging in, test with a minimal job:
sbatch <<'EOF'
#!/bin/bash
#SBATCH --job-name=test
#SBATCH --partition=devel
#SBATCH --time=0:05:00
#SBATCH --cpus-per-task=1
#SBATCH --mem=1G
echo "Hello from $(hostname) at $(date)"
EOF
Check status with squeue --me, view output in slurm-<jobid>.out.
| Cluster | Primary use | Status | |---------|------------|--------| | McCleary | Life sciences, YCGA data analysis | Decommissioning 2026 — but has YCGA partition (free compute for YCGA data) | | Bouchet | General HPC, GPU workloads | Active — primary cluster going forward | | Misha | Wu Tsai Institute | Active |
ycga partition is exempt from compute chargesEach cluster has a web portal for Jupyter, RStudio, VSCode, and Remote Desktop:
https://ood-mccleary.ycrc.yale.eduhttps://ood-bouchet.ycrc.yale.eduhttps://ood-misha.ycrc.yale.eduMax 4 interactive app instances per user simultaneously. Yale VPN required off-campus.
Never run heavy computation on login nodes. Acceptable login-node activities:
wc, head, ls, etc.)Everything else must be submitted as a SLURM job.
Transfer nodes (e.g., transfer1.bouchet) are for data transfer only (rsync, Globus,
scp). Do not use them to run tools — they often have older CPUs that cause
"Illegal instruction" errors with compiled binaries. If you need to run module load,
prefetch, fasterq-dump, or any analysis tool, use an interactive compute node instead
(see Interactive jobs below).
| Cluster | PI storage (canonical) | Home symlink (lab convention) | Scratch |
|---------|-----------|-------------------------------|---------|
| McCleary | /vast/palmer/pi/musser | — | /vast/palmer/scratch/musser/ |
| Bouchet | /nfs/roberts/project/pi_jm284/ | ~/project_pi_jm284/ (= /home/<netid>/project_pi_jm284/) | /nfs/roberts/scratch/pi_jm284 |
| Misha | /gpfs/radev/project/musser | — | /gpfs/radev/scratch |
On Bouchet, lab members typically work via the home symlink ~/project_pi_jm284/ and
keep their projects under a per-user subdirectory:
~/project_pi_jm284/<netid>/projects/<project_name>/. This resolves to
/nfs/roberts/project/pi_jm284/<netid>/projects/<project_name>/ underneath. Either
form works; the home-symlink form is more discoverable and matches cd history.
Reference databases that are reusable across projects (BUSCO lineages, eggNOG DB, DIAMOND-formatted UniProt, sponge mitochondrial reference sets, etc.) live at:
~/project_pi_jm284/shared/databases/
busco_lineages/ # BUSCO odb10 datasets (metazoa, eukaryota, etc.)
... # other shared DBs as they're added
Resolves to /nfs/roberts/project/pi_jm284/shared/databases/ under the symlink.
When writing batch or setup scripts that reference these shared DBs, use a parameter that defaults to this location and can be overridden via env var:
SHARED_DBS="${SHARED_DBS:-$HOME/project_pi_jm284/shared/databases}"
This way the script picks up the lab default automatically but can be redirected on a non-Bouchet machine, or to a different location for testing, without editing.
| Type | Backed up? | Purge policy | Use for |
|------|-----------|-------------|---------|
| Home (~/) | Yes (snapshots) | None | Scripts, configs, small files. 125 GiB quota. |
| PI storage | Yes (snapshots) | None | Raw data, important results, conda environments |
| Project | Yes (snapshots) | None | Active project directories |
| Scratch | No | 60-day purge | Temporary/intermediate files, large job outputs |
Important:
getquota (McCleary) | List paths: mydirectoriesMirror the local project structure in PI storage or project space:
/nfs/roberts/project/pi_jm284/<project_name>/
.git/ # Same repo as local — sync via git push/pull
.claude/ # Project docs, plans, findings
batch/ # SLURM batch scripts (tracked in git)
logs/ # SLURM output files (tracked in git)
scripts/ # Analysis scripts
data/ # External/immutable inputs (gitignored, sync via Globus)
outs/ # Script-produced outputs (gitignored)
environment.yml # Conda environment specification
Use scratch only for large temporary files (sort buffers, intermediate alignments) that can be regenerated. Never store the project itself on scratch — use PI storage.
When a project is worked on both locally and on the cluster, the same git repo is cloned in both places. Conventions:
git pull before starting
work in either environment.data/, tool outputs) are gitignored and
transferred manually. Not all data exists in both places.BASEDIR=$(git rev-parse --show-toplevel) — no hardcoded paths.
This makes scripts work regardless of where the repo is cloned.cellranger_refs/, raw FASTQ staging directories)
are added to .gitignore on a per-project basis.Full docs: https://docs.ycrc.yale.edu/clusters-at-yale/job-scheduling/
Every batch script starts from these defaults. Override per-job as needed.
#!/bin/bash
#SBATCH --job-name=<tool>_<brief_description>
#SBATCH --partition=day
#SBATCH --time=4:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=5G
#SBATCH --output=logs/slurm-%j.out
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=<your email>
# ── Provenance ────────────────────────────────────────
BASEDIR=$(git rev-parse --show-toplevel)
cd "$BASEDIR"
echo "=== PROVENANCE ==="
echo "Job ID: $SLURM_JOB_ID"
echo "Script: $0"
echo "Git hash: $(git rev-parse HEAD)"
echo "Git dirty: $(git status --porcelain | head -5)"
echo "Date: $(date -Iseconds)"
echo "Node: $(hostname)"
# Modules (cluster-only tools — always pin versions)
module purge
module load Tool/1.2.3
module list 2>&1
# Conda (project environment)
module load miniconda
conda activate myenv
echo "Conda env: $CONDA_DEFAULT_ENV"
# Log versions of key tools used in this script
echo "tool: $(tool --version | head -1)"
echo "=== END PROVENANCE ==="
# ──────────────────────────────────────────────────────
# ── Main ──────────────────────────────────────────────
# Your commands here
The provenance block prints to stdout, which SLURM captures in the log file. Since logs are tracked in git, the full record (script version, tool versions, environment state) is version-controlled alongside the code that produced the results.
Provenance block requirements:
BASEDIR + cd — ensures git commands and relative paths workgit rev-parse HEAD — which version of the code rangit status --porcelain — catches uncommitted changes (scripts called by the batch job
are covered by the git hash only if the tree is clean)module list — all loaded modules and their exact versionsconda activate + env name — which conda environment was usedFile organization:
batch/ subdirectorylogs/ subdirectory (create with mkdir -p logs before first submit)batch/ is tracked in git (scripts are code)logs/ is tracked in git (logs are the reproducibility record — commit after jobs
complete so provenance is preserved)Critical: No space between # and SBATCH — otherwise the directive is ignored.
| Directive | Short | Lab default | Purpose |
|-----------|-------|------------|---------|
| --job-name | -J | <tool>_<desc> | Job identification (shows in squeue) |
| --time | -t | varies by tool | Walltime (D-HH:MM:SS) |
| --partition | -p | day | Target partition |
| --cpus-per-task | -c | varies by tool | Cores per task (for threading) |
| --mem-per-cpu | — | 5G | RAM per CPU (use for most jobs) |
| --mem | — | — | Total RAM (use instead of --mem-per-cpu for memory-hungry tools like PROST, Cell Ranger) |
| --gpus | -G | 0 | GPU count (must be explicit) |
| --output | -o | logs/slurm-%j.out | Combined stdout+stderr |
| --error | -e | (not set) | Separate stderr file (use when debugging with split output) |
| --mail-type | — | BEGIN,END,FAIL | Notifications (see below) |
| --mail-user | — | <your email> | Notification email |
| --nodes | -N | 1 | Compute nodes (rarely >1 except MPI) |
| --ntasks | -n | 1 | MPI task count |
| --mail-type value | When to use |
|---------------------|-------------|
| BEGIN,END,FAIL | Lab default for single jobs |
| BEGIN,END,FAIL,ARRAY_TASKS | Lab default for array jobs — sends per-task emails so individual failures are visible |
| FAIL | Very high-volume array jobs (hundreds of tasks) to reduce email flood |
For array jobs, always include ARRAY_TASKS. Without it, individual task failures within
a running array don't trigger emails — you only find out when the whole array finishes.
GPUs must be explicitly requested with --gpus. Key GPU partitions:
| Cluster | Partition | GPU | VRAM |
|---------|-----------|-----|------|
| Bouchet | gpu | RTX 5000 Ada | 32 GB |
| Bouchet | gpu_rtx6000 | RTX Pro 6000 Blackwell | 96 GB |
| Bouchet | gpu_h200 | H200 | 141 GB |
| McCleary | gpu | A5000/A100/RTX 3090 | 24-80 GB |
| Misha | gpu | H100/H200/A100/A40/L40S | 48-141 GB |
For GPU jobs, also module load CUDA before conda activation.
# Quick interactive session (testing, short tasks)
salloc -p devel -c 4 --mem=16G -t 2:00:00
# Claude Code working session (writing scripts, running them, submitting batch jobs)
salloc -p day -c 8 --mem=32G -t 6:00:00 --job-name=positron
Recommended for Claude Code sessions: 8 CPUs, 32 GB RAM. Most interactive work (writing scripts, parsing TSVs/GFFs, pandas, matplotlib) needs <4 GB, but 32 GB provides headroom for occasional spikes (loading large SQLite DBs, sorting big BLAST outputs, multi-panel figures). The 8 CPUs help with parallel grep/ripgrep and pigz.
Heavy compute (DIAMOND, Cell Ranger, STAR, BRAKER, eggNOG-mapper, PROST) should always be submitted as batch jobs — never run on the interactive node. Rule of thumb: if it takes
5 minutes or needs >8 CPUs, write a batch script.
Use devel for quick tests (6-hour limit, fast queue). Use day for full working
sessions (up to 24 hours). Add --x11 for graphical forwarding (requires X11 setup).
For setting up Positron (VS Code) to connect to an interactive session via SSH, see
references/positron-ssh-setup.md.
| Command | Purpose |
|---------|---------|
| squeue --me | List your running/pending jobs |
| sacct -j <id> | Job status and resource usage |
| jobstats <id> | Efficiency metrics (CPU/memory utilization) |
| scancel <id> | Cancel a job |
| sbatch --test-only script.sh | Estimate queue start time without submitting |
Always check resource usage with jobstats after jobs complete. Request only what you
need — wasteful allocations slow scheduling for everyone.
For many similar jobs (e.g., processing multiple samples), use job arrays or Dead Simple Queue (dsq) rather than submitting hundreds of individual jobs. Rate limit: 200 job submissions per hour.
#!/bin/bash
#SBATCH --array=1-50
#SBATCH --partition=day
#SBATCH --time=2:00:00
#SBATCH --cpus-per-task=4
#SBATCH --mem-per-cpu=4G
#SBATCH --output=logs/slurm-%A_%a.out
#SBATCH --mail-type=BEGIN,END,FAIL,ARRAY_TASKS
#SBATCH --mail-user=<your email>
SAMPLE=$(sed -n "${SLURM_ARRAY_TASK_ID}p" samples.txt)
# Process $SAMPLE
See references/partitions.md for full partition tables (McCleary, Bouchet, Misha) with
time limits, per-user resource limits, GPU types, and node counts.
Quick summary for partition selection:
day (1-day limit, generous CPU/memory)week (7-day) or long (28-day, McCleary only)gpu (RTX 5000 Ada on Bouchet), gpu_h200 (H200), gpu_rtx6000 (RTX Pro 6000 Blackwell)devel (6-hour limit, strict per-user caps)ycga on McCleary (exempt from compute charges)bigmem (up to 4 TiB/node on Bouchet)scavenge (free idle resources, may be killed)See references/tool_profiles.md for SLURM resource recommendations per bioinformatics
tool (CPUs, memory, time, partition, GPU). Always check jobstats after initial runs
and adjust.
For any given tool, use one or the other — never both. If a tool is available via both module and conda, choose one and stick with it. Having the same tool in both creates PATH conflicts where the activation order silently determines which version runs.
| Use conda (project env) for | Use modules for |
|--------------------------------|---------------------|
| Tools used in both local and cluster environments | Cluster-only heavy tools unlikely to run locally |
| Python/R packages | Tools with complex cluster-specific dependencies |
| Lightweight bioinformatics (samtools, MAFFT, DIAMOND) | Cell Ranger, STAR, PROST, EggNOG-mapper |
| Anything tracked in environment.yml | GPU-dependent tools requiring CUDA |
Always pin module versions — use module load CellRanger/9.0.1, never bare
module load CellRanger. Unpinned modules silently change when the cluster updates
defaults. The provenance block in the batch template logs loaded versions, but pinning
prevents the drift in the first place.
module purge
module load miniconda
# Create environment
conda create -n myenv -c conda-forge -c bioconda python=3.11 <packages>
# From file
conda env create -f environment.yml
# Activate
conda activate myenv
pip only as a fallback when a package is not available in conda-forgeA shared tools conda env for general-purpose cluster utilities (not project-specific):
conda create -n tools -c conda-forge gh git
conda activate tools
gh auth login # one-time setup: authenticate with GitHub
Activate tools at the start of sessions that need gh, git, or other general CLI
tools. Project-specific envs (Python analysis, bioinformatics pipelines) remain separate.
Use rig to manage R versions and renv for project-level package management, matching the local development workflow.
# Check available R versions via module system
module avail R
# Or install rig in userspace if not available as a module
# (check YCRC docs for current guidance)
# For renv projects, set cache to PI storage to avoid filling home quota:
export RENV_PATHS_CACHE="/vast/palmer/pi/musser/renv_cache" # McCleary
# export RENV_PATHS_CACHE="/nfs/roberts/project/pi_jm284/renv_cache" # Bouchet
# Then restore packages as usual
Rscript -e 'renv::restore()'
HPC-specific R notes:
module load (e.g., HDF5, GDAL, PROJ)
before renv::restore() will succeedRENV_PATHS_CACHE to PI storage so the package cache persists and doesn't eat home quotaSee references/snakemake.md for full Snakemake setup, SLURM executor configuration,
per-rule resource specification, profile setup, and tmux usage.
When to use Snakemake: Fan-out/fan-in workflows (many samples through same steps) or complex dependency graphs. For linear pipelines (A → B → C), a simple bash script with checkpointing is fine.
Checkpointing principle: All orchestrators must check for existing output before re-running a stage. The output file itself is the checkpoint — no sentinel files.
# Local to cluster
rsync -avz --progress local_dir/ <netid>@mccleary.ycrc.yale.edu:/vast/palmer/pi/musser/project/
# Cluster to local
rsync -avz --progress <netid>@mccleary.ycrc.yale.edu:/vast/palmer/pi/musser/project/results/ local_results/
scp file.fasta <netid>@mccleary.ycrc.yale.edu:/vast/palmer/pi/musser/project/data/raw/
For multi-GB transfers, use Globus (https://docs.ycrc.yale.edu/data/transfer/globus/). YCRC maintains Globus endpoints for each cluster.
Use Globus or direct transfer between cluster login nodes (they can reach each other).
Yale Center for Genome Analysis (YCGA) data is stored at /gpfs/ycga/sequencers on McCleary.
YCGA sends an email with a URL when data is ready. Use the ycgaFastq utility:
module load ycga-public
# From the URL in YCGA's notification email
ycgaFastq fcb.ycga.yale.edu:3010/randomstring/sample
# By netid and flowcell
ycgaFastq <netid> AHFH66DSXX
For 10x and PacBio data, use URLFetch:
module load ycga-public
URLFetch http://fcb.ycga.yale.edu:3010/randomstring/folder
Archives are in AWS Deep Glacier. Retrieval via web browser (http://archive.ycga.yale.edu)
or ycgaFastq. Normal retrieval: 48 hours. Expedited: 12 hours (8x more expensive).
Submit YCGA-related analysis jobs with -p ycga on McCleary to avoid compute charges.
Eligible users: Yale PIs using YCGA for sequencing and their authorized lab members.
When giving the user commands to run on the cluster (not batch scripts), follow these conventions:
Append & to any command that will take more than a few seconds:
pigz -p 8 *.fastq &
fasterq-dump --split-files --threads 8 SRR123456 &
This lets the user keep working in the same shell. Mention jobs and fg for checking
or re-attaching.
| Slow tool | Fast alternative | Notes |
|-----------|-----------------|-------|
| gzip | pigz -p N | module load pigz. N = number of cores |
| bzip2 | pbzip2 -p N | |
| samtools sort | samtools sort -@ N | Built-in threading |
| samtools index | samtools index -@ N | Built-in threading |
When operating on many independent files, use xargs -P:
ls *.fastq | xargs -P 8 -I {} gzip {} &
Commands should be unambiguous — include cd to the working directory or use absolute
paths so the user can copy-paste without guessing context.
When asked to create a SLURM job, Claude operates in one of two modes depending on where it is running.
.sh) in the project directoryrsync -avz batch/my_job.sh <netid>@mccleary.ycrc.yale.edu:/vast/palmer/pi/musser/project/batch/
ssh mccleary "sbatch /vast/palmer/pi/musser/project/batch/my_job.sh"
batch/ directorysbatch batch/my_job.shsqueue --me, jobstats <id>).py + .sh patternIn data science projects, batch scripts are thin SLURM wrappers that call .py analysis
scripts. The analysis logic lives entirely in the .py file; the .sh file handles only
SLURM directives, environment activation, and the python invocation.
Batch script template (calling a .py script):
#!/bin/bash
#SBATCH --job-name=<brief_name>
#SBATCH --partition=day
#SBATCH --time=2:00:00
#SBATCH --cpus-per-task=8
#SBATCH --mem-per-cpu=5G
#SBATCH --mail-type=BEGIN,END,FAIL
#SBATCH --mail-user=<your email>
#SBATCH --output=logs/slurm-<brief_name>-%j.out
# <One-line description of what this job does>
set -euo pipefail
BASEDIR=$(git rev-parse --show-toplevel)
cd "$BASEDIR"
echo "=== Job info ==="
echo "Job ID: $SLURM_JOB_ID"
echo "Node: $(hostname)"
echo "Start: $(date)"
echo "Git hash: $(git rev-parse --short HEAD)"
echo ""
module load miniconda
source $(conda info --base)/etc/profile.d/conda.sh
conda activate <env-name>
echo "Python: $(which python)"
# echo "tool: $(tool --version)" # Log versions of tools used by the .py script
echo ""
SECONDS=0
python scripts/<section>/XX_script.py
echo ""
echo "=== Completed in ${SECONDS}s ($(date)) ==="
This pattern keeps analysis code portable (runnable locally or interactively) while
SLURM configuration stays separate. See the script-organization skill for the full
convention including numbering.
Always commit scripts before submitting batch jobs. The git hash in BUILD_INFO.txt and
the SLURM log must reflect the code that actually ran. If you've been editing a .py script,
commit it (and the .sh wrapper) before sbatch. The provenance block logs
git status --porcelain as a safety net, but the intent is a clean tree at submit time.
This is the cluster-specific case of a general rule — see the script-organization skill
for the broader "commit before execute" convention that also covers .qmd rendering.
--job-name, --partition, --time, --cpus-per-task, memory, --outputmodule purge before loading modules (omit module purge when the only module is miniconda for conda-only jobs)module load CellRanger/9.0.1, never bare module load CellRanger)--mail-type=BEGIN,END,FAIL and --mail-user=<your email>logs/ subdirectory--partition=ycgamodule load for software. Run module avail to list available
packages. Always module purge before loading to avoid conflicts.ycga partition (McCleary) analyzing YCGA sequencing
data are exempt from compute charges.development
Phylogenetic tree visualization and formatting with ggtree (R) or iTOL (web). Use when rendering a phylogenetic tree as a figure, choosing tree layout, coloring branches or labels by taxonomy, collapsing clades, displaying support values, or adding overlays to a tree. Do NOT load for tree inference (use protein-phylogeny skill) or domain annotation (future separate skill).
development
Configure and manage Claude Code security protections for sensitive files, credentials, and data. Use when the user invokes /security-setup to set up or modify protections against unauthorized file access, credential exposure, or sensitive data leaks.
development
Script organization for data science analysis projects with numbered scripts, data/outs/ directories, and reproducibility conventions. Use when creating new analysis scripts in projects that follow data science conventions (numbered XX_ prefix scripts, outs/ directories, BUILD_INFO.txt). Do NOT load for documentation projects (Quarto books), infrastructure repos, or projects without data/outs/ directory structure.
testing
R renv package management for data science projects. Use when working with renv (renv.lock, renv::restore, renv::snapshot) in R analysis projects. Do NOT load for projects that do not use R or renv.