metagenomics/metaphlan-profiling/SKILL.md
Profiles shotgun metagenomes to species/SGB relative abundance with MetaPhlAn 4's clade-specific marker genes (bowtie2 short reads, minimap2 long reads). Covers why a MetaPhlAn percentage is a cell fraction (genome-size-normalized taxonomic abundance) and must never be merged with Kraken/Bracken read fractions, kSGB vs uSGB units for quantifying database-absent taxa, the unknown-fraction rescaling and its version-default flip, --index pinning as a batch variable, and when mOTUs3 or sourmash gather beat marker profiling. Use when profiling who-is-there with high precision, needing HMP-comparable species abundances, quantifying novel taxa, or deciding marker-gene vs k-mer profiling. For k-mer classification see kraken-classification; for strains see strain-tracking; for 16S amplicon see the microbiome category.
npx skillsauth add GPTomics/bioSkills bio-metagenomics-metaphlanInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: MetaPhlAn 4.1+, Bowtie2 2.5.3+, minimap2 2.26+, pandas 2.2+.
Before using code patterns, verify installed versions match. If versions differ:
metaphlan --version then metaphlan --help to confirm flag names and defaultspip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
The marker DATABASE version is the experimental variable. Results track the index (e.g. mpa_vJun23_CHOCOPhlAnSGB_202403 vs the live vJan25 build); MetaPhlAn 3 and MetaPhlAn 4 databases are not interchangeable. Pin --index and report it like a reagent lot. Two flags were renamed in 4.2: --bowtie2out -> --mapout and --bowtie2db -> --db_dir (the --input_type value bowtie2out likewise becomes mapout); unknown-fraction estimation flipped from opt-in (--unclassified_estimation) to on-by-default (--skip_unclassified_estimation to disable). Confirm against metaphlan --help.
"Who is in my metagenome, by cell fraction?" -> Detect which clades' private marker genes are present, average their per-marker coverage, and normalize to a genome-size-aware relative abundance - so the percentage is a fraction of cells, not of reads.
metaphlan reads_1.fq.gz,reads_2.fq.gz --input_type fastq --index mpa_vJun23_CHOCOPhlAnSGB_202403 -o profile.txt --mapout sample.bz2Scope: marker-gene species/SGB profiling and its alternatives (mOTUs3, sourmash gather). K-mer read classification -> kraken-classification. Strain-resolved SNV haplotypes -> strain-tracking. Functional profiling -> functional-profiling. Compositional stats and plotting -> metagenome-visualization. 16S amplicon -> the microbiome category.
A MetaPhlAn percentage estimates what fraction of the CELLS in the community belong to a clade - a genome-size-normalized taxonomic abundance. A Kraken/Bracken percentage estimates what fraction of the READS came from a clade - a sequence abundance. There is no sample-independent conversion between them, because sequence abundance under-estimates small-genome microbes and over-estimates large-genome ones by a factor that depends on the whole community's genome-size distribution (Sun 2021 Nat Methods 18:618). Therefore:
Mnemonic: markers measure WHO is there (cells); k-mers measure HOW MUCH DNA is there (reads).
MetaPhlAn 4's atomic taxon is the SGB (species-level genome bin, a ~95% ANI cluster), not an NCBI species. A kSGB contains a cultured reference genome and gets a Latin name; a uSGB is defined only from MAGs (>=5 required) and is reported with a placeholder ID and no name. Quantifying uSGBs - taxa with no reference genome - is MetaPhlAn 4's headline advance over MetaPhlAn 3 and explains ~20% more gut reads, >40% more in under-characterized environments (Blanco-Miguez 2023 Nat Biotechnol 41:1633). Consequences: an unnamed t__SGB... row is a real quantified taxon - do not drop it; one named species can split into several SGBs; MetaPhlAn 3 species profiles and MetaPhlAn 4 SGB profiles are not row-compatible (use sgb_to_gtdb_profile.py for GTDB names). The t__ tier is the SGB, NOT a strain - strain resolution is StrainPhlAn (-> strain-tracking).
| Tool | Citation | Mechanism / role | When | |------|----------|------------------|------| | MetaPhlAn 4 | Blanco-Miguez 2023 Nat Biotechnol 41:1633 | ~189 clade-specific markers/SGB; robust coverage average | high-precision species/SGB %, HMP-comparable, characterized communities | | mOTUs3 | Ruscheweyh 2022 Microbiome 10:212 | 10 universal single-copy marker genes | higher recall of novel/divergent taxa; transparent marker-hit confidence | | sourmash gather | Pierce 2019 F1000Res 8:1006 | FracMinHash containment, minimum metagenome cover | genome-resolved hits vs all of GTDB + an honest unknown fraction | | Kraken2 + Bracken | Wood 2019 Genome Biol 20:257 | k-mer LCA + Bayesian reestimation | -> kraken-classification; max recall, willing to filter false positives |
| Scenario | Recommended | Why | |----------|-------------|-----| | Human gut species %, low false positives, HMP-comparable | MetaPhlAn 4 | curated SGB markers; high precision; huge corpus | | Quantify novel / database-absent taxa | MetaPhlAn 4 uSGBs OR mOTUs3 ext-mOTUs | reference-independent units | | Maximize recall in under-characterized environments | mOTUs3 or sourmash gather | universal markers / containment vs everything | | Genome-resolved + explicit unknown fraction | sourmash gather | minimum metagenome cover reports what is unexplained | | Max recall of every read, speed | -> kraken-classification | k-mer LCA; filter the false-positive tail | | Need cell fraction, not read fraction | MetaPhlAn / mOTUs | k-mer tools report read fraction | | Strain-level resolution | -> strain-tracking | per-SNV haplotypes, not species profiling | | Composition stats next | -> metagenome-visualization (CLR/ANCOM-BC) | output is closed; naive stats on percentages are invalid |
# Paired-end reads are passed as ONE comma-separated argument (MetaPhlAn treats them as two
# single-end files - it does not use insert/pairing info). Pin the index for reproducibility.
metaphlan reads_R1.fastq.gz,reads_R2.fastq.gz \
--input_type fastq \
--index mpa_vJun23_CHOCOPhlAnSGB_202403 \ # pin it; DB version is a batch variable
--nproc 8 \
--mapout sample.map.bz2 \ # cache the read->marker mapping (pre-4.2: --bowtie2out)
--output_file profile.txt
Goal: Try different analysis types, levels, or estimator settings without realigning.
Approach: Save the mapping once with --mapout, then re-run from it with --input_type mapout (pre-4.2: bowtie2out). Realignment is the expensive step; everything downstream is free.
metaphlan sample.map.bz2 --input_type mapout \
--tax_lev s \ # k,p,c,o,f,g,s,t (t = SGB tier)
--stat_q 0.2 \ # quantile-truncated robust mean of per-marker coverages: drop top/bottom 20%, average the middle 60%
--output_file profile_species.txt
--stat_q down-weights markers in HGT/mobile and conserved cross-clade regions; the default 0.2 is a sensible robust mean. Changing it changes the reported abundances - report it if it is changed. Long reads (4.1+) route to minimap2 with --long_reads.
Relative abundance sums to 100% only over DETECTED clades. With unknown estimation OFF (pre-4.2 default), known taxa absorb 100% and the database-absent community is invisible - overstating every known taxon. With it ON (4.2 default), an UNCLASSIFIED row appears and every known abundance shrinks proportionally. In soil/marine/rumen the unknown fraction can be the largest "taxon" in the sample.
# 4.2 default includes the UNCLASSIFIED row. To force it on pre-4.2: --unclassified_estimation
# For SAM input, pass --nreads <total> or the unknown fraction is wrong.
metaphlan reads.fastq.gz --input_type fastq -o profile.txt # 4.2: UNCLASSIFIED row present by default
Pre-4.2-default and 4.2-default outputs are not comparable abundances - mixing them is a hidden batch effect.
# All inputs MUST come from the SAME database index or rows mismatch silently.
merge_metaphlan_tables.py profiles/*_profile.txt > merged_abundance.txt
sgb_to_gtdb_profile.py -i merged_abundance.txt -o merged_gtdb.txt # recover GTDB names for SGBs
Trigger: putting MetaPhlAn and Bracken abundances in one matrix or correlating them. Mechanism: cell fraction vs read fraction - different quantities (Sun 2021). Symptom: "tools disagree," spurious scatter, broken ML/differential-abundance features. Fix: keep them separate; if harmonizing, convert via genome length (Bracken counts / genome length, renormalize) and accept it is approximate.
Trigger: profiles built with different MetaPhlAn versions or --unclassified_estimation settings. Mechanism: the UNCLASSIFIED row rescales all known abundances. Symptom: a batch effect aligned to processing date, not biology. Fix: pin one version and one unknown-estimation setting across the whole study; for environmental samples always include the unknown fraction.
Trigger: alarm at <1% of reads mapping. Mechanism: only clade-specific markers are targeted; low mapping is expected. Symptom: unnecessary re-runs. Fix: low mapping is normal; a large unknown fraction means database-absent community (consider mOTUs3/sourmash), and a very low rate plus low microbial yield suggests host contamination -> contamination-controls.
Trigger: profiling soil/marine and reporting only named taxa. Mechanism: a marker tool is structurally blind to clades whose markers are not in the database (high precision, low recall; CAMI2 Meyer 2022). Symptom: most of the community missing; lowering thresholds does not recover it. Fix: use mOTUs3 (universal markers) or sourmash gather (containment vs all of GTDB), or accept Kraken false positives and filter - do not just lower MetaPhlAn thresholds and call it sensitivity.
Trigger: merging profiles built on different --index builds. Mechanism: SGB IDs and marker sets differ between releases. Symptom: rows silently fail to align; abundances look implausible. Fix: rebuild all samples on one pinned index before merging.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| Presence gate ~20% of an SGB's markers | Blanco-Miguez 2023 Nat Biotechnol 41:1633 | enough markers covered to call a clade present (precision mechanism) |
| --stat_q 0.2 default | MetaPhlAn docs | truncated mean drops top/bottom 20% of marker coverages; robust to HGT/conserved outliers |
| uSGB requires >=5 MAGs | Blanco-Miguez 2023 Nat Biotechnol 41:1633 | false-positive control for unnamed taxa |
| Pin --index | MetaPhlAn docs | DB version changes profiles for identical reads; report like a reagent lot |
| --min_cu_len 2000 | MetaPhlAn docs | minimum cumulative marker length to report a clade (low-evidence filter) |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| "No database found" | DB not installed | metaphlan --install (optionally --index <ver> --db_dir DIR) |
| Output all zeros | wrong --input_type or empty/host-only input | match --input_type to the file; check microbial yield |
| --bowtie2out not recognized | running MetaPhlAn 4.2+ | use --mapout / --input_type mapout (4.2 rename) |
| Rows mismatch after merge | profiles from different indices | rebuild on one pinned --index |
| SAM input unknown fraction wrong | --nreads not supplied | pass total read count with --nreads |
| Viral calls look unreliable | --add_viruses calls are low-confidence | treat vSGB calls cautiously (CAMI2) |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.