genome-assembly/metagenome-assembly/SKILL.md
Assembles microbial-community sequencing into metagenome-assembled genomes (MAGs) with metaFlye (ONT), metaSPAdes/MEGAHIT (Illumina), and hifiasm-meta/metaMDBG (PacBio HiFi), then recovers genomes via multi-binner consolidation (MetaBAT2, MaxBin2, CONCOCT, SemiBin2, VAMB -> DAS_Tool) and QCs them against MIMAG with CheckM2, GUNC, and GTDB-Tk. Covers why a metagenome is not a genome (uneven coverage, micro-diversity, strain collapse to consensus), differential-coverage binning, co-assembly vs per-sample, the rRNA-operon collapse that fails short-read MAGs, and strain resolution with inStrain. Use when reconstructing genomes from a microbiome, soil, ocean, or gut community, recovering MAGs, or resolving strain-level variation.
npx skillsauth add GPTomics/bioSkills bio-genome-assembly-metagenome-assemblyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: Flye 2.9+, SPAdes 3.15+ (metaSPAdes), MEGAHIT 1.2+, hifiasm-meta 0.3+, metaMDBG 1.0+, MetaBAT2 2.15+, MaxBin 2.2.7+, CONCOCT 1.1+, SemiBin 2.0+, VAMB 4.1+, DAS_Tool 1.1.6+, CheckM2 1.0+, GUNC 1.0+, GTDB-Tk 2.4+, inStrain 1.7+, minimap2 2.26+, samtools 1.19+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagspip show <package> then help(module.function) to check signaturesGTDB-Tk results track the reference-package RELEASE (e.g. R214 vs R220); the DB release MUST match the GTDB-Tk binary or classification silently fails. CheckM2 and GUNC each download their own DIAMOND DB. SemiBin2's pretrained --environment models are versioned. If code throws an error, introspect the installed tool and adapt rather than retrying.
"Assemble genomes from my metagenome" -> Co-assemble a community at uneven, strain-mixed coverage, then bin the contigs into a set of consensus population genomes (MAGs) and QC each against MIMAG. The deliverable is MAGs, not a single assembly.
flye --meta --nano-hq reads.fq (ONT), spades.py --meta -1 R1.fq -2 R2.fq or megahit -1 R1.fq -2 R2.fq (Illumina), hifiasm_meta/metaMDBG (HiFi); then binners -> DAS_Tool -> checkm2 predict + gunc run + gtdbtk classify_wfEvery isolate assembler is built on the premise that the true sequence sits at roughly one depth, so a coverage drop or spike signals a repeat or an error. In a community that premise is false by construction: an abundant species at 500x and a rare one at 3x are both real. Running plain SPAdes/Unicycler or single-genome Flye on a community treats the abundance spread and strain bubbles as errors to "fix" and produces garbage. Use --meta modes. Three consequences cascade:
| Tool | Citation | Mechanism / role | When |
|------|----------|------------------|------|
| metaFlye | Kolmogorov 2020 Nat Methods | repeat-graph long-read meta-assembler (--meta mandatory) | ONT/CLR community de novo; polish after |
| metaSPAdes | Nurk 2017 Genome Res | strain-aware multi-k de Bruijn (spades.py --meta) | Illumina, contiguity priority; ONE paired library only |
| MEGAHIT | Li 2015 Bioinformatics | succinct de Bruijn, low-memory | huge/complex Illumina co-assemblies, soil; lower contiguity |
| hifiasm-meta | Feng 2022 Nat Methods | strain-resolved HiFi string graph | PacBio HiFi communities; keeps strains apart |
| metaMDBG | Benoit 2024 Nat Biotechnol | minimizer de Bruijn for HiFi | HiFi; often ~2x the HQ circular MAGs, low RAM |
| OPERA-MS | Bertrand 2019 Nat Biotechnol | short-read meta scaffolded by long reads | hybrid short + long |
| Binner | Citation | Signal | When |
|--------|----------|--------|------|
| MetaBAT2 | Kang 2019 PeerJ | tetranucleotide freq (TNF) + coverage, parameter-free | fast default workhorse (-m 1500) |
| MaxBin2 | Wu 2016 Bioinformatics | EM over TNF + marker genes + coverage | single/few samples |
| CONCOCT | Alneberg 2014 Nat Methods | GMM on composition+coverage of cut-up contigs | many samples; 4-step pipeline, not one command |
| SemiBin2 | Pan 2023 Bioinformatics | self-supervised contrastive deep learning | current SOTA; short + long; pretrained env models |
| VAMB | Nissen 2021 Nat Biotechnol | variational autoencoder of coabundance + k-mer | multi-sample; separates close strains |
| DAS_Tool | Sieber 2018 Nat Microbiol | dereplicate-aggregate-score across binners (consolidation) | ALWAYS run; non-redundant set beats any single binner |
| Scenario | Recommended | Why |
|----------|-------------|-----|
| Illumina, complex/huge/soil, low RAM | MEGAHIT --presets meta-sensitive | succinct dBG fits in memory; multi-library |
| Illumina, contiguity priority, tractable size | metaSPAdes (spades.py --meta) | strain-aware repeat resolution; merge libraries first (one paired lib only) |
| ONT-only community | metaFlye --meta --nano-hq -> polish | repeat-graph meta mode; ONT needs polishing -> assembly-polishing |
| PacBio HiFi community | hifiasm-meta or metaMDBG | complete circular strain-resolved MAGs; fixes rRNA collapse |
| Hybrid short + long | OPERA-MS | short-read meta scaffolded with long reads |
| Recover MAGs (any assembly) | >=2-3 binners -> DAS_Tool -> CheckM2 + GUNC -> GTDB-Tk | ensemble beats one binner; chimera + taxonomy gates |
| Only ONE sample | composition-only binning, expect weak bins | differential coverage needs multiple samples; add samples, not tuning |
| Multiple samples available | map ALL samples to each assembly for binning depth | differential-coverage is the strongest binning signal |
| Strain-level question | -> inStrain on reads mapped to MAGs | consensus MAGs blur strains; needs read-level microdiversity |
| Read-based taxonomy / rare biosphere | -> metagenomics/kraken-classification | assembly is blind below the abundance-detection limit |
| Reads not QC'd / host-contaminated | -> long-read-sequencing/long-read-qc | remove host reads vs a T2T reference before assembly |
| MAG contamination forensics | -> contamination-detection | detailed CheckM2/GUNC interpretation |
flye --meta --nano-hq ont.fastq.gz --out-dir flye_out -t 32
# --meta uneven-coverage metagenome mode (REQUIRED for communities)
# read-type flag (mutually exclusive): --nano-hq (Guppy5+/Q20) | --nano-raw (older) |
# --pacbio-hifi | --pacbio-raw (CLR)
# outputs: assembly.fasta, assembly_graph.gfa, assembly_info.txt (circularity flag in col 'circ.')
ONT contigs are contiguous but error-prone (indels in homopolymers); polish before downstream use (-> assembly-polishing; medaka needs the matching basecaller model). HiFi usually needs no polishing.
# metaSPAdes -- contiguity priority; exactly ONE paired library
spades.py --meta -1 R1.fastq.gz -2 R2.fastq.gz -o spades_out -t 32 -m 500
# -m memory cap in GB (SPAdes aborts if exceeded); -k auto by default
# outputs: contigs.fasta, scaffolds.fasta
# MEGAHIT -- huge/low-RAM; accepts comma-separated multiple libraries
megahit -1 a1.fq.gz,b1.fq.gz -2 a2.fq.gz,b2.fq.gz -o megahit_out -t 32 \
--presets meta-sensitive --min-contig-len 1000
# --presets meta-sensitive | meta-large (huge complex); raise --min-contig-len to ~1000 for binning
metaSPAdes --meta supports exactly ONE paired-end library -- a real constraint people miss; concatenate libraries first or use MEGAHIT for many. metaSPAdes is heavier on RAM/time and chokes on soil-scale co-assembly; MEGAHIT assembled a 252 Gbp soil set on one node at the cost of somewhat more fragmentation.
hifiasm_meta -t 32 -o asm hifi.fastq.gz
awk '/^S/{print ">"$2"\n"$3}' asm.p_ctg.gfa > asm.p_ctg.fa # GFA -> FASTA
metaMDBG asm --out-dir mdbg_out --in-hifi hifi.fastq.gz --threads 32 # often ~2x HQ circular MAGs
# Map reads back to the assembly -- per sample for differential coverage
minimap2 -ax map-ont -t 32 contigs.fa reads.fq.gz | samtools sort -@ 32 -o s1.sorted.bam -
samtools index s1.sorted.bam
jgi_summarize_bam_contig_depths --outputDepth depth.txt s1.sorted.bam s2.sorted.bam # all samples
metabat2 -i contigs.fa -a depth.txt -o metabat/bin -m 1500 -t 32 # -m 1500 = min contig (do not go below ~1000)
run_MaxBin.pl -contig contigs.fa -abund abund1.txt -out maxbin/bin -thread 32 -min_contig_length 1000
SemiBin2 single_easy_bin -i contigs.fa -b s1.sorted.bam -o semibin_out # add --environment human_gut for a pretrained model
CONCOCT is a 4-step pipeline (cut_up_fasta.py -> concoct_coverage_table.py -> concoct -> merge_cutup_clustering.py -> extract_fasta_bins.py), not one command. Differential-coverage binning needs MULTIPLE samples with abundance variation; with one sample binning collapses to weak composition-only signal -- more samples, not more tuning.
# Convert each binner's output to contig2bin tables, then aggregate-and-score
Fasta_to_Contig2Bin.sh -i metabat/ -e fa > metabat.tsv
Fasta_to_Contig2Bin.sh -i maxbin/ -e fasta > maxbin.tsv # MaxBin emits .fasta
gunzip -k semibin_out/output_bins/*.gz 2>/dev/null || true # SemiBin2 bins are gzipped
Fasta_to_Contig2Bin.sh -i semibin_out/output_bins/ -e fa > semibin.tsv
DAS_Tool -i metabat.tsv,maxbin.tsv,semibin.tsv -l metabat,maxbin,semibin \
-c contigs.fa -o dastool/DAS --write_bins -t 32
# DAS_Tool assumes the SAME contig set across binners -- feeding bins from different assemblies is a silent error
checkm2 predict --input dastool/DAS_DASTool_bins/ -x fa --output-directory checkm2_out -t 32
gunc run --input_dir dastool/DAS_DASTool_bins/ --file_suffix .fa --out_dir gunc_out --threads 32
gtdbtk classify_wf --genome_dir dastool/DAS_DASTool_bins/ -x fa --out_dir gtdbtk_out --cpus 32
Run CheckM2 AND GUNC: CheckM2 counts marker copy number (completeness/contamination), GUNC tests whether a genome's genes share one lineage (chimerism). A bin made of two half-genomes can score high completeness, low contamination, and still be a chimera -- GUNC is the orthogonal catch. Dereplicate MAGs across samples (dRep ~95% ANI species, ~99% strain) before reporting counts.
# Consensus MAGs blur strains; recover read-level microdiversity
inStrain profile sample.sorted.bam mags.fa -o instrain_out -p 16 -g genes.fna
inStrain compare -i instrain_A instrain_B -o instrain_compare # popANI: shared-strain detection across samples
Trigger: plain spades.py/flye (no --meta) on community reads. Mechanism: uniform-coverage assumption deletes rare-taxon contigs and mis-resolves strain bubbles as errors. Symptom: few short contigs, missing abundant taxa. Fix: --meta mode for every meta-assembler.
Trigger: "we used SemiBin2 because it's SOTA," one binner. Mechanism: each binner recovers a partially-different genome set. Symptom: real MAGs left on the table; lower count than peers. Fix: run >=2-3 binners -> DAS_Tool consolidation.
Trigger: MetaBAT2/CONCOCT on one sample, surprised bins are bad. Mechanism: one coverage value -> composition (TNF) only, which is weak (related genera share TNF). Symptom: poor, split bins. Fix: more samples with abundance variation, then map all back; do not retune.
Trigger: calling a 95%-complete/2%-contam short-read MAG "high-quality." Mechanism: conserved + multi-copy rRNA operon tangles the short-read graph and stays unbinned. Symptom: MAG fails MIMAG HQ for missing 16S/23S/5S despite great completeness. Fix: check the FULL MIMAG HQ definition (rRNA + tRNA); use HiFi/long reads to span the operon.
Trigger: trusting completeness/contamination alone. Mechanism: non-overlapping markers from two half-genomes score high completeness, low contamination. Symptom: "clean" MAG that is a chimera. Fix: always pair CheckM2 with GUNC -> contamination-detection.
Trigger: per-allele/per-strain interpretation of one MAG. Mechanism: the assembler collapsed strains to consensus. Symptom: strain findings that read-level data contradict. Fix: inStrain popANI or strain-aware (hifiasm-meta) assembly.
Trigger: "200 MAGs cover 60% of reads" treated as the whole community. Mechanism: the rare biosphere never crosses the assembly detection limit. Symptom: species richness massively undercounted. Fix: complement with read-based profiling -> metagenomics/kraken-classification.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| MIMAG high-quality: completeness >90%, contamination <5%, AND 16S/23S/5S rRNA + >=18 tRNAs | Bowers 2017 Nat Biotechnol | rRNA criterion is the silent killer; short-read MAGs fail it |
| MIMAG medium-quality: completeness >=50%, contamination <10% | Bowers 2017 | most short-read MAGs land here |
| Min contig length for binning ~1000-1500 bp | MetaBAT2 default 1500 | shorter contigs have unreliable TNF/coverage -> noise/chimeras |
| Differential-coverage binning needs N >= ~3-5 samples with abundance variation | binning practice | one coverage value cannot separate co-abundant genomes |
| Assembly detection limit ~3-5x coverage | de Bruijn requirement | below this the rare biosphere does not assemble at all |
| MAG dereplication 95% ANI (species), 99% (strain) | dRep/ANI convention | collapse redundant per-sample MAGs before counting |
| metaSPAdes input: exactly ONE paired library | --meta constraint | merge libraries or use MEGAHIT for many |
| N50 / contiguity | not a community metric | report MAG count + MIMAG tiers instead |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Few short contigs, missing abundant taxa | isolate mode on a community | add --meta |
| metaSPAdes rejects multiple libraries | --meta supports one paired library | concatenate, or use MEGAHIT |
| metaSPAdes aborts / out of memory | RAM cap hit on a complex co-assembly | raise -m, or switch to MEGAHIT |
| Poor bins from one sample | no differential-coverage signal | add samples; map all back for depth |
| "HQ MAG" lacks rRNA | rRNA-operon collapse in short reads | full MIMAG check; HiFi/long reads |
| Clean CheckM2 but suspect bin | chimera invisible to marker counts | run GUNC alongside CheckM2 |
| GTDB-Tk classify_wf errors | DB release != binary version | match the GTDB reference-package release |
| Strain finding not reproducible | consensus MAG blurs strains | inStrain popANI / strain-aware assembly |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.