skills/scientific-computing/snakemake-workflow-engine/SKILL.md
Python-based workflow manager for reproducible, scalable pipelines. Define rules with file-based dependencies; Snakemake resolves execution order and parallelism. Runs local, SLURM, LSF, AWS, GCP via profiles; per-rule conda/Singularity envs. For NGS pipelines, ML training, and multi-step file processing. Use Nextflow for Groovy dataflow or nf-core integration.
npx skillsauth add jaechang-hits/scicraft snakemake-workflow-engineInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Snakemake is a Python-based workflow management system that scales analyses from laptop to HPC and cloud. Workflows are defined as rules with explicit input/output file dependencies; Snakemake resolves the execution order automatically and runs independent steps in parallel. Rules can call shell commands, Python/R/Julia scripts, or inline Python. Per-rule conda or Singularity environments make workflows fully reproducible. Widely used in bioinformatics for NGS, genome assembly, and variant-calling pipelines.
Nextflow instead when you need Groovy DSL + dataflow channels, or when leveraging the nf-core community pipeline libraryPrefect or Airflow instead for data engineering workflows with dynamic task graphs or time-based schedulingsnakemake, graphviz (for DAG visualization)Check before installing: The tool may already be available in the current environment (e.g., inside a
pixi/condaenv). Runcommand -v snakemakefirst and skip the install commands below if it returns a path. When running inside a pixi project, invoke the tool viapixi run snakemakerather than baresnakemake.
# Install via conda (includes optional dependencies)
conda install -c conda-forge -c bioconda snakemake
# Minimal pip install
pip install snakemake
# Verify
snakemake --version
# 8.x.x
# Snakefile — minimal 2-rule pipeline
SAMPLES = ["sampleA", "sampleB"]
rule all: # Target rule: request final outputs
input:
expand("results/{sample}.sorted.bam", sample=SAMPLES)
rule align:
input:
fastq="data/{sample}.fastq",
ref="refs/genome.fa"
output:
bam="results/{sample}.sorted.bam"
threads: 4
shell:
"bwa mem -t {threads} {input.ref} {input.fastq} "
"| samtools sort -@ {threads} -o {output.bam}"
# Run: dry-run first, then execute
snakemake -n # dry-run: show what would run
snakemake --cores 8 # execute with 8 cores
Each rule defines one analysis step with inputs, outputs, and an execution method.
# Shell rule: run a command with {input} and {output} placeholders
rule fastqc:
input:
fastq="data/{sample}.fastq"
output:
html="qc/{sample}_fastqc.html",
zip="qc/{sample}_fastqc.zip"
log:
"logs/fastqc/{sample}.log"
shell:
"fastqc {input.fastq} -o qc/ 2> {log}"
# Run rule: inline Python for logic-heavy steps
rule parse_stats:
input:
txt="results/{sample}.flagstat.txt"
output:
csv="results/{sample}.stats.csv"
run:
import re, csv
lines = open(input.txt).readlines()
mapped = re.search(r"(\d+) mapped", "".join(lines)).group(1)
with open(output.csv, "w") as f:
csv.writer(f).writerow([wildcards.sample, mapped])
# Script rule: delegate to external R/Python/Julia script
rule plot_coverage:
input:
depth="results/{sample}.depth.txt"
output:
pdf="results/{sample}.coverage.pdf"
script:
"scripts/plot_coverage.R"
# In the R script, access via snakemake object:
# depth_file <- snakemake@input[["depth"]]
# pdf_path <- snakemake@output[["pdf"]]
Wildcards let one rule process any number of samples; expand() generates all required file paths.
# Define sample list (from config or glob)
SAMPLES = ["ctrl_rep1", "ctrl_rep2", "treat_rep1", "treat_rep2"]
rule all:
input:
# expand() generates: qc/ctrl_rep1_fastqc.html, qc/ctrl_rep2_fastqc.html, ...
expand("qc/{sample}_fastqc.html", sample=SAMPLES),
expand("results/{sample}.bam", sample=SAMPLES)
# Access wildcard values inside shell/run
rule align:
input:
"data/{sample}.fastq"
output:
"results/{sample}.bam"
shell:
"echo Processing {wildcards.sample}; "
"bwa mem refs/genome.fa {input} | samtools view -b > {output}"
# Wildcard constraints prevent ambiguous matches
rule process:
input:
"data/{sample}_{rep}.fastq"
output:
"results/{sample}_{rep}.txt"
wildcard_constraints:
sample="[A-Za-z]+", # letters only
rep="\d+" # digits only
# multiext: multiple outputs sharing a common path base
rule bwa_index:
input:
"refs/genome.fa"
output:
multiext("refs/genome.fa", ".amb", ".ann", ".bwt", ".pac", ".sa")
shell:
"bwa index {input}"
Config files externalize settings; params passes rule-level values without file dependencies.
# Snakefile: declare config file
configfile: "config/config.yaml"
# config/config.yaml:
# samples: [ctrl, treat]
# threads:
# align: 8
# sort: 4
# min_mapq: 20
SAMPLES = config["samples"]
rule filter_reads:
input:
"results/{sample}.bam"
output:
"results/{sample}.filtered.bam"
params:
mapq=config["min_mapq"] # from config, not a file
threads:
config["threads"]["sort"]
shell:
"samtools view -q {params.mapq} -b {input} > {output}"
# Dynamic params via lambda functions
rule trim:
input:
fastq="data/{sample}.fastq"
output:
trimmed="trimmed/{sample}.fastq"
params:
# Adapt quality threshold based on sample name
quality=lambda wildcards: 25 if "ctrl" in wildcards.sample else 20
shell:
"fastp -q {params.quality} -i {input.fastq} -o {output.trimmed}"
Declare computational resources for scheduler integration; use conda/Singularity for tool isolation.
# Resource declaration (used by SLURM/LSF profiles)
rule variant_calling:
input:
bam="results/{sample}.deduped.bam",
ref="refs/genome.fa"
output:
vcf="variants/{sample}.vcf.gz"
resources:
mem_mb=16000, # memory in MB
runtime=240, # max walltime in minutes
disk_mb=20000 # scratch disk space
threads: 8
shell:
"bcftools mpileup -f {input.ref} {input.bam} "
"| bcftools call -m -Oz -o {output.vcf}"
# Conda environment per rule (for reproducibility)
rule star_align:
input:
reads="data/{sample}.fastq",
genome_dir="refs/star_index/"
output:
bam="star_out/{sample}/Aligned.sortedByCoord.out.bam"
conda:
"envs/star.yaml"
# envs/star.yaml:
# channels:
# - bioconda
# dependencies:
# - star=2.7.10b
# - samtools=1.17
threads: 8
shell:
"STAR --runThreadN {threads} --genomeDir {input.genome_dir} "
"--readFilesIn {input.reads} --outSAMtype BAM SortedByCoordinate"
# Singularity/Apptainer container
rule gatk_haplotypecaller:
input:
bam="results/{sample}.bam",
ref="refs/genome.fa"
output:
gvcf="gvcfs/{sample}.g.vcf.gz"
container:
"docker://broadinstitute/gatk:4.4.0.0"
shell:
"gatk HaplotypeCaller -I {input.bam} -R {input.ref} "
"-O {output.gvcf} -ERC GVCF"
Execute locally, on clusters, or in cloud; profiles configure executors without changing the Snakefile.
# Local execution
snakemake --cores 8 # use 8 CPU cores
snakemake --cores all # use all available cores
# Dry run: show tasks without executing
snakemake -n --cores 8
# Output: 12 of 24 steps are complete. 12 jobs to run.
# Force rerun (ignore existing outputs)
snakemake --forceall --cores 8
# Visualize DAG as PDF
snakemake --dag | dot -Tpdf > workflow_dag.pdf
# SLURM cluster profile (profiles/slurm/config.yaml)
# executor: slurm
# jobs: 50
# default-resources:
# mem_mb: 2000
# runtime: 60
# use-conda: true
# Run with profile (cluster submit + monitor)
snakemake --profile profiles/slurm --cores 128
# Override resources at runtime
snakemake --profile profiles/slurm \
--set-resources variant_calling:mem_mb=32000 --cores 128
# Override threads
snakemake --set-threads align=16 --cores 64
Handle temporary files, protected outputs, checkpoints, and output validation.
# temp: auto-delete after downstream rules consume it
rule sort_bam:
input:
"results/{sample}.raw.bam"
output:
temp("results/{sample}.sorted_temp.bam") # deleted after indexing
shell:
"samtools sort {input} -o {output}"
# protected: write-protect final outputs (prevent overwrite)
rule final_report:
input:
"results/{sample}.vcf.gz"
output:
protected("reports/{sample}.final.vcf.gz")
shell:
"cp {input} {output}"
# directory: rule that outputs a directory
rule denovo_assembly:
input:
fastq="data/{sample}.fastq"
output:
directory("assemblies/{sample}/")
shell:
"spades.py -s {input.fastq} -o {output}"
# touch: create empty flag file (for ordering-only dependencies)
rule validate_bam:
input:
"results/{sample}.bam"
output:
touch("checkpoints/{sample}.validated")
shell:
"samtools quickcheck {input} && echo OK"
# ensure: validate output properties before considering rule complete
rule download_reference:
output:
ensure("refs/genome.fa", min_size=1_000_000)
shell:
"wget -O {output} https://example.com/genome.fa"
Snakemake works backward from targets: given a list of desired output files, it builds a DAG of rules needed to produce them. Rules not needed for the current targets are ignored.
# rule all: declare all final outputs here
# Without this, snakemake runs only the first rule
rule all:
input:
expand("results/{sample}.vcf.gz", sample=SAMPLES),
expand("qc/{sample}_fastqc.html", sample=SAMPLES)
{sample} in rule input/output = wildcard: filled by Snakemake at execution timeexpand("results/{sample}.bam", sample=SAMPLES) = Python: generates a list of strings NOW (used in rule all)Goal: FastQC → trim → align → sort → dedup → flagstat for multiple samples.
configfile: "config/config.yaml"
SAMPLES = config["samples"]
rule all:
input:
expand("qc/{sample}_fastqc.html", sample=SAMPLES),
expand("results/{sample}.flagstat.txt", sample=SAMPLES)
rule fastqc:
input: "data/{sample}.fastq"
output: "qc/{sample}_fastqc.html", "qc/{sample}_fastqc.zip"
shell: "fastqc {input} -o qc/"
rule trim:
input: "data/{sample}.fastq"
output: "trimmed/{sample}.fastq"
shell: "fastp -q 20 -i {input} -o {output}"
rule align:
input:
fastq="trimmed/{sample}.fastq",
ref="refs/genome.fa"
output: temp("results/{sample}.raw.bam")
threads: 8
shell:
"bwa mem -t {threads} {input.ref} {input.fastq} | samtools view -b > {output}"
rule sort_dedup:
input: "results/{sample}.raw.bam"
output:
bam="results/{sample}.bam",
bai="results/{sample}.bam.bai"
threads: 4
shell:
"samtools sort -@ {threads} {input} | samtools markdup -r - {output.bam} "
"&& samtools index {output.bam}"
rule flagstat:
input: "results/{sample}.bam"
output: "results/{sample}.flagstat.txt"
shell: "samtools flagstat {input} > {output}"
Goal: Deploy the same Snakefile to HPC with per-job resource allocation.
# 1. Create profiles/slurm/config.yaml
mkdir -p profiles/slurm
cat > profiles/slurm/config.yaml << 'EOF'
executor: slurm
jobs: 100
default-resources:
mem_mb: 4000
runtime: 60
use-conda: true
latency-wait: 30
rerun-incomplete: true
EOF
# 2. Add resources to compute-heavy rules in Snakefile
# resources:
# mem_mb=16000, runtime=120
# 3. Submit
snakemake --profile profiles/slurm --cores 256 -n # dry-run
snakemake --profile profiles/slurm --cores 256 # submit
# 4. Monitor
snakemake --profile profiles/slurm --report report.html # after completion
| Parameter | Context | Default | Range/Options | Effect |
|-----------|---------|---------|---------------|--------|
| --cores | CLI | 1 | 1–N or all | Max concurrent jobs/threads |
| threads: | Rule | 1 | 1–N | Threads per rule (scales to --cores) |
| mem_mb: | resources: | None | integer | Memory in MB (used by SLURM profile) |
| runtime: | resources: | None | integer (min) | Max walltime per job |
| --profile | CLI | None | path | YAML profile for executor config |
| --use-conda | CLI | False | flag | Activate per-rule conda environments |
| --use-apptainer | CLI | False | flag | Enable Singularity/Apptainer containers |
| -n | CLI | False | flag | Dry-run (show tasks, don't execute) |
| --forceall | CLI | False | flag | Rerun all rules regardless of status |
| --rerun-incomplete | CLI | False | flag | Rerun rules with partial outputs |
| configfile: | Snakefile | None | YAML path | Load config dictionary from YAML |
Always define rule all: Without it, only the first rule in the Snakefile runs. rule all collects all final outputs; Snakemake runs everything needed to produce them.
Use temp() for large intermediates: BAM files before deduplication, unsorted BAMs, and intermediate assemblies can be marked temp() to auto-delete after consumption — saves significant disk.
Separate config from code: Put sample lists, thread counts, file paths, and thresholds in config.yaml. Hard-coded values in Snakefiles make pipelines brittle and non-reusable.
Test with snakemake -n first: The dry-run shows exactly which rules will run and in what order. Run it before every production execution to confirm the DAG is correct.
Use log: for every shell rule: Redirect tool stderr/stdout to per-rule log files (2> {log}). Without logs, debugging cluster job failures is nearly impossible.
Benchmark rules in production: Add benchmark: "benchmarks/{rule}/{sample}.txt" to measure actual runtime and memory — essential data for tuning SLURM resource requests.
# Auto-discover samples from input directory (no hardcoded list)
from pathlib import Path
SAMPLES = [p.stem.replace(".fastq", "") for p in Path("data/").glob("*.fastq")]
print(f"Found {len(SAMPLES)} samples: {SAMPLES[:3]}...")
rule all:
input:
expand("results/{sample}.bam", sample=SAMPLES)
configfile: "config/config.yaml"
# Only run deduplication for WGS (not amplicon) data
rule dedup:
input:
"results/{sample}.sorted.bam"
output:
"results/{sample}.deduped.bam"
run:
if config.get("assay_type") == "WGS":
shell("samtools markdup -r {input} {output}")
else:
shell("cp {input} {output}")
# Collect all per-sample stats into one summary table
rule multiqc:
input:
expand("qc/{sample}_fastqc.zip", sample=SAMPLES),
expand("results/{sample}.flagstat.txt", sample=SAMPLES)
output:
"multiqc/multiqc_report.html"
shell:
"multiqc qc/ results/ -o multiqc/"
| Problem | Cause | Solution |
|---------|-------|----------|
| AmbiguousRuleException | Multiple rules match same output | Add wildcard_constraints:, use ruleorder rule_a > rule_b, or rename outputs |
| MissingOutputException | Rule completed but output file absent | Check working directory in shell; verify output path; check disk space |
| TargetFileException | rule all requests a file no rule can produce | Verify expand() args match wildcard names; use -n to trace resolution |
| Cluster jobs all fail | Resources too low for tool | Increase mem_mb or runtime; check cluster queue with squeue |
| Conda env build fails | Package conflict or wrong channel | Add conda-forge before bioconda; pin package versions |
| Rule reruns unexpectedly | Output file timestamp older than input | Touch output files with snakemake --touch; or delete and rerun |
| PermissionError on protected output | protected() wrapper applied | Remove protection with --force; or delete and regenerate without protected() |
tools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.