genome-assembly/short-read-assembly/SKILL.md
Assembles a genome de novo from Illumina short reads with SPAdes (isolate/careful/sc/meta/plasmid/rna modes), MEGAHIT (low-memory, huge datasets), Unicycler (bacterial finishing/hybrid), MaSuRCA (large hybrid), ABySS (Bloom-filter), and Platanus (heterozygous diploids), using multi-k de Bruijn graphs. Covers the repeat-resolution limit, why N50 plateaus at the genome not the depth, GenomeScope2 k-mer profiling first, the heterozygosity/haplotig trap, error-correction erasing rare alleles, GC dropout, and NG50/auN/BUSCO reporting. Use when assembling a bacterial isolate, fungal, small-eukaryotic, single-cell, or metagenome genome from Illumina reads, or when deciding whether short reads can even produce the assembly being asked for.
npx skillsauth add GPTomics/bioSkills bio-genome-assembly-short-read-assemblyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: SPAdes 4.0+, MEGAHIT 1.2+, Unicycler 0.5+, ABySS 2.3+, GenomeScope2 2.0+, KMC 3.2+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagsSPAdes 4.0 (June 2024) requires Python >=3.8, defaults to GFA v1.2 output with circular-path tags (TP:Z:circular), and is stated by the maintainers to be the last major feature release - pin the version in methods, because k auto-selection and the --isolate/--careful recommendation flipped across 3.x. MEGAHIT preset k-lists (meta-sensitive, meta-large) have changed across versions - confirm against megahit -h before hard-coding. If code throws an error, introspect the installed tool and adapt rather than retrying.
"Assemble a genome from Illumina reads" -> Build a multi-k de Bruijn graph from short reads and walk it into contigs, knowing the contiguity ceiling is set by the genome's repeat structure, not the depth.
spades.py --isolate -1 R1.fq.gz -2 R2.fq.gz -o out -t 16 (bacterial isolate, the genuine sweet spot)megahit -1 R1.fq.gz -2 R2.fq.gz -o out -t 16 (huge/low-memory, metagenome)genomescope2 -i kmer.hist -o gs -k 21 -p 2 (size/het/repeat before parameters)A short read (~150 bp) or paired insert (~300-600 bp) cannot resolve any repeat longer than itself: the de Bruijn graph enters a repeat from multiple unique flanks, traverses an identical internal path, and the assembler's only safe move is to break the contig at the repeat boundary or collapse the copies. Genomic repeats - rRNA operons (~5 kb), transposons, segmental duplications, satellites - are routinely kilobases to megabases. So a short-read assembly is structurally fragmented at every long repeat, and that is biology, not a software defect. Three load-bearing corollaries:
N50 plateaus at the repeat structure, not the sequencing depth. Past ~50-100x for an isolate (~50-60x for a eukaryote), more Illumina does NOT lengthen contigs - it only adds error-correction signal, and above ~150x it actively hurts by amplifying GC bias and duplicate-driven coverage distortion. The reflex "add more reads to get a better assembly" is wrong: the limit is read length.
The assembler is almost never the bottleneck - the input DNA and the genome's repeat/heterozygosity structure are. A clean, high-molecular-weight, PCR-free library at the right depth assembles beautifully with defaults; a degraded/contaminated library assembles badly with every setting and every k. The expert's first move on a bad assembly is to look at the k-mer spectrum, GC-vs-coverage, and duplication rate (GenomeScope2 profiling, see the profiling section below), not to swap assemblers or sweep k.
Short reads are no longer the frontier instrument for a finished genome. For any genome that must be complete or finished, the answer is long reads that span the repeats (-> long-read-assembly, -> hifi-assembly), with short reads relegated to polishing, k-mer-spectrum QC, and the cheap bacterial-isolate/surveillance workhorse. A skill that lets an agent attempt a de novo short-read assembly of a large, repeat-rich, heterozygous eukaryote and then blame parameters is doing harm.
| Tool | Citation | Role | When | |------|----------|------|------| | SPAdes | Bankevich 2012 J Comput Biol | multi-k de Bruijn assembler with mode dispatch | bacterial/fungal/small-euk isolate; the field default | | MEGAHIT | Li 2015 Bioinformatics | succinct (compressed) de Bruijn, ultra-low memory | huge short-read datasets, metagenomes, memory-constrained | | Unicycler | Wick 2017 PLoS Comput Biol | SPAdes wrapper + graph bridging + circularization | bacterial finishing; hybrid Illumina+long-read | | MaSuRCA | Zimin 2013 Bioinformatics | super-reads + mega-reads (hybrid OLC/DBG) | large eukaryotic genomes with mixed data | | ABySS 2.0 | Jackman 2017 Genome Res | Bloom-filter de Bruijn, MPI-parallel | large genomes on memory-constrained HPC | | Platanus / Platanus-allee | Kajitani 2014 Genome Res | bubble-aware DBG for high heterozygosity | highly heterozygous diploids (a real niche) | | GenomeScope2 / Smudgeplot | Ranallo-Benavidez 2020 Nat Commun | k-mer-histogram model: size/het/repeat/ploidy | run FIRST; sets the ceiling reference-free | | Velvet / SOAPdenovo2 | Zerbino 2008 Genome Res / Luo 2012 GigaScience | single-k DBG, superseded | reproduce old papers only; do not start new projects |
Every mainstream short-read assembler is a de Bruijn assembler because all-vs-all overlap (OLC) is infeasible for hundreds of millions of short reads. The single most consequential parameter is k: small k over-connects (collapses repeats, chimeras), large k is more unique (resolves repeats up to length k) but demands higher coverage and is error-fragile. No single k is optimal everywhere in a genome, which is exactly why SPAdes and MEGAHIT iterate over a k-series - do NOT hand-pick a single k.
| Scenario | Recommended | Why |
|----------|-------------|-----|
| Reads not yet profiled | GenomeScope2 on a k-mer histogram (profiling section below) | the spectrum sets size/het/repeat ceiling before any parameter |
| Reads not yet QC'd | -> read-qc/quality-reports, read-qc/adapter-trimming | over-trimming creates coverage holes that fragment the graph |
| Bacterial/archaeal/viral isolate, Illumina only | SPAdes --isolate | multi-k DBG + paired-end repeat resolution; the genuine sweet spot |
| Small bacterial genome, want mismatch/indel cleanup | SPAdes --careful (small genomes ONLY) | runs MismatchCorrector; NOT for large eukaryotes; incompatible with --meta/--rna |
| Bacterial isolate, want a finished/circular assembly | Unicycler (short-only or hybrid) | graph bridging + circularization + rotation to dnaA |
| Single-cell / MDA-amplified | SPAdes --sc | designed for the wildly uneven MDA coverage |
| Highly heterozygous diploid, short reads | Platanus-allee, then purge | bubble-aware; but consider -> hifi-assembly instead |
| Metagenome (recover MAGs) | -> metagenome-assembly (metaSPAdes / MEGAHIT) | community co-assembly; MAG recovery, not single-genome N50 |
| Huge dataset, RAM-constrained | MEGAHIT | succinct DBG, tiny memory footprint |
| Large/repeat-rich/heterozygous eukaryote, finished genome wanted | -> hifi-assembly or -> long-read-assembly | short reads mathematically cannot span the repeats |
| Assembly built, judging quality | -> assembly-qc (NG50 + auN + BUSCO + Merqury QV) | never report N50 alone; contiguity is not correctness |
spades.py --isolate -1 R1.fq.gz -2 R2.fq.gz -o out -t 16 -m 64 # recommended default for isolates; does NOT run MismatchCorrector
spades.py --careful -1 R1.fq.gz -2 R2.fq.gz -o out # SMALL genomes only; runs MismatchCorrector; NOT eukaryotes; not with --meta/--rna
spades.py --sc -1 R1.fq.gz -2 R2.fq.gz -o out # single-cell/MDA (default k 21,33,55)
spades.py --plasmid -1 R1.fq.gz -2 R2.fq.gz -o out # plasmidSPAdes (coverage-based extraction)
spades.py -1 R1.fq.gz -2 R2.fq.gz -o out -k 21,33,55,77 # explicit multi-k LIST (odd, <= read length) only if overriding
spades.py --only-assembler -1 R1.fq.gz -2 R2.fq.gz -o out # skip BayesHammer (reads pre-corrected, or het/pooled data)
Picking the wrong mode is the most common SPAdes error. --isolate and --careful solve different problems: modern SPAdes pushes --isolate as the isolate default (tuned for high, even coverage), while --careful runs BWA-based MismatchCorrector to cut mismatches/short indels on small genomes only - it is explicitly slow, memory-heavy, and low-benefit on medium/large eukaryotes, and is incompatible with --meta and --rna. --isolate and --careful are mutually exclusive - SPAdes aborts (cannot specify --careful in isolate mode) if both are given; pick one. --meta, --rna, --bio dispatch to genuinely different pipelines (metaSPAdes, rnaSPAdes, biosyntheticSPAdes) with different graph models - they are NOT genome assembly. BayesHammer (the built-in Illumina error-corrector) runs by default; --only-assembler skips it. Outputs: contigs.fasta, scaffolds.fasta, assembly_graph.gfa. Always report contig metrics, not just scaffold metrics (scaffold gaps are estimated N-runs, not sequence).
megahit -1 R1.fq.gz -2 R2.fq.gz -o out -t 16 # default k 21,29,39,59,79,99,119,141; min-count 2 (drops singleton error k-mers)
megahit -1 R1.fq.gz -2 R2.fq.gz --presets meta-sensitive -o out # min-count 1; verify k-list against megahit -h
unicycler -1 R1.fq.gz -2 R2.fq.gz -o out -t 16 # short-read-only bacterial finishing
unicycler -1 R1.fq.gz -2 R2.fq.gz -l long.fq.gz -o out # hybrid (short for accuracy + long to bridge repeats)
abyss-pe name=asm k=96 B=2G in='R1.fq.gz R2.fq.gz' # B = Bloom-filter size (the 2.0 low-memory mode); single k
MEGAHIT's succinct DBG gives a tiny memory footprint at some cost in contiguity/accuracy versus metaSPAdes - the default for huge datasets and a metagenome alternative. Unicycler wraps SPAdes, cleans the graph, bridges repeats, and rotates circular replicons to dnaA; hybrid mode (short + ONT/PacBio) is the bacterial finishing standard but a short-only run still cannot beat the repeat limit. ABySS 2.0's Bloom-filter mode enables large-genome assembly on modest hardware but is single-k and niche today.
kmc -k21 -t16 -m64 -ci1 -cs10000 @reads.lst kmcdb tmp/ # k=21: long enough to be mostly unique, short enough for k-mer coverage
kmc_tools transform kmcdb histogram kmer.hist -cx10000
genomescope2 -i kmer.hist -o gs_out -k 21 -p 2 # -p ploidy; estimates size, heterozygosity, repeat content
GenomeScope2 fits a model to the k-mer frequency histogram and returns, reference-free and before assembly, the genome size, heterozygosity rate, repeat content, and ploidy. A single histogram peak ~ haploid/homozygous; two peaks (a het peak at ~half the homozygous-peak coverage) ~ a heterozygous diploid that will fragment and inflate a short-read assembly. This is the ceiling check: it predicts whether short reads can even produce the assembly being asked for, and it gives the expected size to report assembly total against (assembly >> estimate = un-purged haplotigs, not a big genome).
Trigger: re-sequencing deeper because contigs are short. Mechanism: contiguity is capped by read length vs repeat length, not depth. Symptom: N50 unchanged past ~50-100x; cost wasted. Fix: accept the repeat-determined ceiling, or switch to long reads (-> hifi-assembly).
Trigger: "we used k=127 because it gave the best N50". Mechanism: large k demands high effective k-mer coverage and is error-fragile; climbing k until N50 peaks is an N50-gaming search that rewards chimeric mega-contigs. Symptom: gappy assembly, or a suspiciously contiguous mis-joined one. Fix: let multi-k auto-select; if overriding, give a small->large list, never a point.
--careful on a eukaryoteTrigger: cargo-culting --careful from a bacterial tutorial. Mechanism: MismatchCorrector is small-genome-only; on a large genome it is slow, memory-hungry, low-benefit. Symptom: the run stalls/OOMs for little quality gain. Fix: --isolate (no --careful) for isolates; defer base accuracy to dedicated polishing (-> assembly-polishing).
Trigger: running BayesHammer/BFC/Lighter on a diploid, pooled, or community sample. Mechanism: k-mer-spectrum correctors assume rare k-mers are errors and edit them toward the consensus - but the minor allele / rare strain IS those rare k-mers. Symptom: real low-frequency variation silently flattened before assembly. Fix: --only-assembler for het/pooled/meta data; correct only clonal isolates.
Trigger: assembly total ~1.5-2x the GenomeScope estimate, high BUSCO-Duplicated. Mechanism: both haplotypes assembled separately (un-purged haplotigs), not a real duplication. Symptom: "genome size" wrong by up to 2x. Fix: purge_dups / redundans, or -> hifi-assembly; report size against the profiling estimate.
Trigger: "why isn't my isolate one contig?" Mechanism: ~7 rRNA operons and IS elements each exceed the insert and force a break - the contig count is roughly a count of long repeats. Symptom: 30-100 contigs from a perfect run. Fix: accept it for typing/AMR/phylogenetics; for a finished single contig, use Unicycler hybrid or long reads.
Trigger: re-assembling to close a gap whose flanks have normal coverage. Mechanism: PCR/chemistry under-represents GC-extreme regions; the bias, not the depth, is the problem. Symptom: an interior coverage hole at extreme GC that stays missing at 30x and 300x. Fix: upstream (PCR-free prep, alternative polymerase) or different chemistry, not a parameter.
| Threshold | Source | Rationale | |-----------|--------|-----------| | Bacterial isolate coverage ~50-100x | field convention | below ~30x error/true k-mers blur and gaps proliferate; 50-100x = clean correction + strong k-mer peak | | Above ~100-150x: diminishing/negative | field convention | does not fix repeats; amplifies GC bias; duplicates distort the coverage model and GenomeScope fit | | Eukaryote short-read coverage ~50-60x | field convention | higher wastes money on unresolvable repeats; lower starves the graph | | GenomeScope/profiling k = 21 | field convention | long enough to be mostly unique, short enough for adequate k-mer coverage | | k constraint: odd, <= read length | DBG property | even k can self-loop on palindromes; k > read is impossible | | Heterozygous assembly inflation ~1.5-2x haploid | het diploid norm | both haplotypes assembled separately; do not report as genome size | | MEGAHIT min-count default 2 | MEGAHIT default | filters singleton (error) k-mers; presets drop to 1 for low-abundance metagenome members | | Report NG50 + auN + BUSCO + a reference-free correctness check | reporting convention | N50 alone is gamed by dropping sequence and inflated by mis-joins (auN = sum(L_i^2)/G) |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Out of memory | large/complex dataset, high coverage | lower -m; use MEGAHIT (succinct DBG) or ABySS Bloom-filter mode |
| Assembly ~2x expected size, high BUSCO-Duplicated | un-purged haplotigs (heterozygosity) | purge_dups/redundans; Platanus-allee; or -> hifi-assembly |
| Many contigs from a clean bacterial isolate | rRNA operons / IS elements exceed the insert (biology) | accept for draft uses; Unicycler hybrid or long reads for a finished genome |
| Extra contigs at odd coverage | contamination/index hopping, not a novel replicon | coverage-vs-GC screen (-> contamination-detection) |
| Poor assembly after heavy trimming | over-trimming created coverage holes | trim adapters + bad tails only; keep reads long |
| N50 high but downstream results wrong | mis-join / N50-gaming; high N50 != correct | report NG50 + auN + BUSCO; reference-free correctness (-> assembly-qc) |
| --careful rejected / errors | used with --meta or --rna | drop --careful; it is small-genome isolate-only |
--careful reflexesdevelopment
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.