claude/skills/scatter-gather/SKILL.md
Decide whether and how to scatter genomics workloads across chromosomes or region tiles, then gather the per-shard outputs back together correctly. Use proactively whenever the user mentions parallelizing per-chromosome, sharding by chrom, tiling the genome, splitting a BAM/VCF/BED by region, merging per-chrom outputs, or has a workflow with obvious per-chromosome parallelism (variant calling, methylation pileup/DMR, coverage, liftover, peak calling, SV calling). Also triggers on /scatter-gather, "scatter X across chromosomes", "shard this", "chunked variant calling", "merge per-chrom VCFs", "gather these bedmethyl files", "concat these bigwigs", or any per-region parallelism question. **Trigger even when the user is also using Snakemake or Nextflow** — those skills handle DAG plumbing while this one defines *what* to scatter, *whether* it's even safe to scatter (some computations like DSS DMLtest pool globally and break under naive sharding), and *how* to gather each output format without silent corruption. Especially trigger on questions about merging per-chromosome BAM / VCF / BED / bedMethyl / bigwig outputs, or whether a scatter-gather is equivalent to running on the whole genome.
npx skillsauth add sahuno/llm_configs scatter-gatherInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Genomics scatter-gather looks trivial until it's silently wrong. The lab has hit several variants of "scattered + concatenated, looked right, was actually corrupt" — DSS DMLtest is the canonical example. This skill is the decision and validation layer: figure out whether to scatter, how to scatter, how to gather correctly per format, and how to validate. Implementation goes to the workflow-engine skills.
Use it when the user wants to:
Skip when the parallelism is purely per-sample with no per-chromosome dimension — that's normal pipeline parallelism and the snakemake skill covers it.
This skill leans on the lab's existing infrastructure rather than restating it. Things to consult here vs there:
| Question | Where to look |
|---|---|
| Should I scatter at chromosome / region / 2D granularity? | references/scatter_strategies.md (this skill) |
| Per-shard memory/time scaling formulas | references/scatter_strategies.md |
| How do I gather format X correctly? | references/gather_methods.md (this skill) |
| Is tool X poolable / shardable? | references/poolability.md (this skill) — for DSS, see also rules/dss.md |
| How do I derive the contig list? | block-hardcoded-contigs.sh hook + CLAUDE.md §2 (lab rules) |
| Why does my Snakemake job fail with srun memory conflict? | rules/snakemake.md (lab rules) |
| Which partition / nodes? | rules/mskcc_partitions.md (lab rules) |
| Why does DSS silently corrupt under SLURM memory pressure? | rules/dss.md (lab rules — read this whenever DSS is involved) |
| #-header convention on BED-like outputs | CLAUDE.md §2 Genomic Output Conventions |
Answer in order. Stop at the first "no" or "needs work" and resolve it.
Is this step poolable, or is it shardable? Some computations pool across all data: DSS dispersion priors (see rules/dss.md), DESeq2 size factors, GATK BQSR table, joint genotyping, CNV segmentation, salmon index. Sharding+gathering ≠ whole-genome for these. See references/poolability.md for the per-tool catalogue and the "two-phase" pattern.
What's the right scatter granularity? Chromosome (~24 shards, simple, unbalanced), region tiles (balanced but boundary effects), per-sample × per-chromosome (2D). See references/scatter_strategies.md.
What's the gather method for this output format? cat is almost never correct. BAM, VCF, BED, bedMethyl, bigwig each have a format-aware merge tool with its own pitfalls. See references/gather_methods.md.
Who runs the DAG? Hand off: snakemake skill for Snakemake (expand, checkpoint, per-shard resource lambdas), nextflow-development / nfcore-module skills for Nextflow channels and pre-built nf-core scatter-gather modules.
What proves it worked? Run the sanity checks below.
These are the failure modes the lab has hit. Each one has cost real time at least once.
The chromosome list is not a constant. Hardcoding chr1..22,X,Y silently drops alts/decoys and breaks across builds (chrM ↔ MT). Always derive at runtime from <ref>.fa.fai or chrom.sizes (paths in profiles/databases/databases_config.yaml). The contig subset (autosomes only? +X/Y? +MT? +_random/_alt/decoys?) is a config decision the user must approve, not a default. The block-hardcoded-contigs.sh hook will catch the worst version of this.
Load imbalance is the silent killer. chr1 is ~5× chrY. One-job-per-chromosome looks parallel but is bounded by the longest shard, and identical resource asks waste cluster capacity. Region tiling solves balance; per-shard resource lambdas solve waste. Recipes and formulas in references/scatter_strategies.md. When benchmarking scatter speedup, look at the longest-shard wall time, not the mean.
Some steps are not shardable. Anything that computes a global parameter from the full dataset (empirical-Bayes prior, size factor, BQSR table, salmon index) cannot be reproduced by sharding + concatenating. The rules/dss.md incident is the canonical lab example: per-chrom DSS runs are not the same as the same-chr slice of a whole-genome run. The fix is the two-phase pattern: run the global step on all data first, then scatter the local step using the global parameter. See references/poolability.md for the catalogue.
The gather is not a cat. Format-aware merge is required: samtools merge (preserve @SQ, dedupe @PG), bcftools concat -a (sorted, sample columns aligned), sort + bedtools merge (BED), bigWigCat, etc. Plain cat of compressed shards almost always produces something that looks fine, fails downstream tools (tabix can't index, IGV won't load), and forces a debug session. Full per-format table with commands and pitfalls in references/gather_methods.md. Always regenerate indexes (.bai/.tbi/.csi) on the gathered output.
Failure granularity needs a DAG engine. A scatter of 24 jobs where 1 fails should re-run 1, not 24. Use Snakemake or Nextflow — never an ad-hoc bash for-loop, which loses provenance and forces full reruns. Trigger the snakemake or nextflow-development skill for the actual mechanics. Cap shard count: tens to low hundreds is the sweet spot; >1000 shards is scheduler-hostile.
These are cheap and catch ~all silent gather bugs.
# 1. Contig coverage matches the scatter list
zcat gathered.vcf.gz | grep -v '^#' | awk '{print $1}' | sort -u \
| diff - <(sort -u scatter.contigs.txt)
# 2. Per-shard line counts sum to gathered line count (within boundary tolerance)
shard_lines=$(for s in shards/*.vcf.gz; do zcat "$s" | grep -vc '^#'; done | paste -sd+ | bc)
gather_lines=$(zcat gathered.vcf.gz | grep -vc '^#')
echo "shards=$shard_lines gather=$gather_lines"
# 3. Sort order valid
bcftools view gathered.vcf.gz > /dev/null && echo "VCF sort OK"
# 4. Index regenerated and current
test gathered.vcf.gz.tbi -nt gathered.vcf.gz || echo "STALE INDEX"
For BAM:
samtools quickcheck gathered.bam — corruption check.samtools idxstats gathered.bam — every expected contig present.flagstat totals ≈ gathered flagstat (small drift from @PG dedupe is normal).For BED-like:
sort -c -k1,1 -k2,2n gathered.bed returns success.LC_ALL=C is mandatory (RHEL UTF-8 default breaks tabix); preserve the # header per CLAUDE.md §2.For bigwig:
bigWigInfo gathered.bw reports expected number of contigs.bigWigAverageOverBed.For DSS specifically: rules/dss.md lists the post-run verification triple (sacct state, recycling-warning grep, per-chrom row-count parity). Run all three.
This skill defines what and how to validate. Implementation goes elsewhere:
Snakemake: trigger the snakemake skill. Use expand() for static chromosome scatter; use checkpoint for region tiles computed at runtime. Resource scaling via mem_mb_per_cpu = lambda wc: f(wc.chrom). Mind the Snakemake 9 srun memory conflict (use mem_mb_per_cpu, not mem_mb) — see rules/snakemake.md.
Nextflow / nf-core: trigger the nextflow-development or nfcore-module skills. Channel idioms: .combine(), .groupTuple(). Many nf-core modules already implement scatter-gather correctly (gatk4/scatterintervalsbyns, bcftools/concat, samtools/merge) — prefer these over reimplementing.
references/scatter_strategies.md — chromosome vs region tiling vs 2D grid; boundary handling; resource scaling formulas; shard-count caps.references/gather_methods.md — format-by-format gather recipes (BAM, CRAM, VCF, gVCF, BED, BEDGRAPH, bedMethyl, bigwig, FASTQ, HDF5/Zarr, TSV).references/poolability.md — per-tool catalogue: fully shardable / two-phase / not shardable, with examples for GATK BQSR, DESeq2, salmon, modkit, deeptools. For DSS specifics, see rules/dss.md.tools
Build self-contained, offline HTML genomic-region reports with igv-reports (create_report). Each HTML bundles igv.js viewers per region with embedded BAM/VCF data slices and default tracks (CpG islands, gencode, RepeatMasker); a reviewer clicks the variant table to inspect read-level evidence with no internet, no server, no IGV install. USE this skill whenever the user wants an HTML, clickable, or browseable viewer of genomic data — phrases like "HTML IGV report", "offline IGV", "self-contained HTML", "clickable viewer", "create_report", "igv-reports", "email this viewer", or any browseable HTML of reads at variants, fusion breakpoints, SV junctions, viral integrations, ChIP peaks, or ROIs. Trigger even when the user doesn't say "igv-reports" — giveaway is HTML/clickable/offline plus genomic regions. Also fire on /igv-reports. DO NOT use for static PNG/PDF/SVG IGV screenshots — use the igv-screenshots skill. Supports hg38, mm10, mm39, T2T. Defaults: --flanking 300, --standalone, genome-tagged output.
development
Verify that structural-variant / breakpoint calls are actually real by checking the chimeric reads that support them. Use whenever the user has caller output (Severus, Manta, Sniffles2, Delly, GRIDSS, MELT, Arriba, SvABA) and wants to validate / audit / QC / double-check their calls — viral integrations (HTLV-1, HBV, HPV, EBV), gene fusions (BCR-ABL, IGH translocations), mobile element insertions (L1, Alu, SVA), translocations. Trigger on phrasings like "is this integration real?", "should I trust this fusion call?", "are these false positives?", "are these PASS calls actually supported by reads?", "QC my SV calls", or any per-call chimeric-read / contamination / bimodality / T-vs-N read overlap question. Also fires on BAM @PG -Y / SA-tag questions on chimeric BAMs, and on /chimeric-read-validation. Output is a per-call TSV with pass / needs_review / fail verdicts. Do not use for calling SVs (use the caller), IGV screenshots (use igv-reports), or RNA-level fusion FDR (use Arriba).
tools
Run a stage-gated runtime/resource optimization study for any bioinformatics tool or command-line program on a SLURM HPC cluster. Walks through preflight, OFAT factor scan, 2^k confirmation factorial, build-mode + alternative-implementation comparison, input-size scan, out-of-sample validation, and produces a fitted predictive resource model (wall_s and peak_rss as functions of input size), a machine-readable model.yaml with caveats, a full REPORT.md, and a one-page exec summary PDF. Trigger PROACTIVELY whenever the user asks to "benchmark", "optimize", "tune", "characterize runtime/memory", "find best config", "build a resource model", "how does X scale", or "what should I put in my Snakemake resources directive for tool Y" — for any compute-bound bioinformatics step (sort, dedup, alignment, variant calling, methylation calling, basecalling, indexing, pileup, liftover). Also triggers on /runtime-resource-study or /benchmark-tool. Skip only for one-off quick timing where a single number suffices and no model is needed.
tools
End-to-end builder for new nf-core modules. Scaffolds all required files, runs lint and nf-test in a loop until both pass, and produces PR-ready artifacts (description, Slack draft, checklist). Use this skill proactively whenever the user wants to: create a new nf-core module, add a tool to nf-core/modules, write a DORADO_BASECALLER or MODKIT_LOCALIZE style process, wrap a bioinformatics tool in Nextflow for nf-core, or asks "how do I submit a module to nf-core". Also trigger for: adding GPU support to a module, wrapping an R or Python script as an nf-core process, handling licensed/ non-bioconda tools in nf-core, fixing nf-core lint failures on a new module. Do NOT trigger for: editing existing pipelines, writing Snakemake rules, or debugging non-module Nextflow code.