microbiome/functional-prediction/SKILL.md
Predicts community functional POTENTIAL from 16S/ITS amplicon ASVs with PICRUSt2 (or q2-picrust2) by phylogenetic interpolation of reference-genome gene content - EPA-ng placement, gappa, castor hidden-state prediction of KO/EC/Pfam copy number, 16S copy-number normalization, and MinPath MetaCyc/KEGG pathways - gated by the NSTI quality index. Covers why predicted function is taxonomy re-encoded (never measured gene content and never activity), the mandatory NSTI report (--max_nsti 2 silently drops novel ASVs), why accuracy IS reference coverage (gut Spearman ~0.8, soil/marine collapse), the circularity trap, and Tax4Fun2/FAPROTAX/BugBase alternatives. Use when inferring KO/EC/MetaCyc potential from an ASV table, gating on NSTI, or choosing a prediction method. For MEASURED shotgun function see metagenomics/functional-profiling; for enrichment of KO lists see pathway-analysis/go-enrichment; for DA of predicted tables see differential-abundance.
npx skillsauth add GPTomics/bioSkills bio-microbiome-functional-predictionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: PICRUSt2 2.5+, pandas 2.2+.
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagspip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
The bundled REFERENCE is the version that matters. PICRUSt2 ships its own reference tree, alignment, and per-genome trait tables (16S/KO/EC/Pfam/COG/TIGRFAM) for ~20,000 genomes - there is no separate multi-GB download as in HUMAnN. That fixed reference is the entire ceiling on accuracy, and NSTI is computed against THAT reference. The reference is versioned with the PICRUSt2 release; record the release so a prediction is reproducible.
"Predict the functional pathways from my 16S data" -> Place ASVs on a reference genome tree and report the gene content of their nearest sequenced relatives - because PICRUSt2 never sees a functional gene from the sample, it infers POTENTIAL from who-is-there.
picrust2_pipeline.py -s asv_seqs.fna -i asv_table.biom -o picrust2_out -p 8 --max_nsti 2Scope: PREDICTED gene-content potential from amplicon ASVs. MEASURED shotgun function (HUMAnN) -> metagenomics/functional-profiling. ASV table + rep-seqs come from amplicon-processing + taxonomy-assignment. The QIIME2 q2-picrust2 path -> qiime2-workflow. DA of predicted tables -> differential-abundance (compositional, same theory as metagenomics/abundance-estimation). Reading KO/pathway lists -> pathway-analysis/go-enrichment.
PICRUSt2 never sequences a functional gene from the sample. It places each ASV on a tree of ~20,000 reference genomes, asks "what genes do this ASV's nearest sequenced relatives carry?", and reports that guess as the community's function. Every output number is the gene content of OTHER organisms, weighted by how much of the 16S resembles them. Three corollaries each common misuse violates:
Organize the analysis around defending these three (report NSTI, forbid activity verbs, do not double-count taxonomy as a second finding), not around listing flags.
State this explicitly when reporting. Each rung is a real measurement layer above the one below:
predicted potential (PICRUSt2, this skill) < measured gene carriage (HUMAnN / shotgun, metagenomics/functional-profiling) < measured transcription (metatranscriptomics) < measured flux (fluxomics/metabolomics).
PICRUSt2 is the bottom rung. A PICRUSt2 path_abun_unstrat table looks identical to a HUMAnN pathabundance table - same MetaCyc IDs, same shape, same MinPath logic - but HUMAnN counts reads that actually aligned to genes in the sample, while PICRUSt2 reports relatives' genomes. Never merge or compare the two as interchangeable.
Every stage is an inference layered on the previous, so error compounds (Douglas 2020 Nat Biotechnol 38:685; original Langille 2013 Nat Biotechnol 31:814). PICRUSt2's headline advance over PICRUSt1 is that it places arbitrary de-novo ASVs (not closed-reference Greengenes OTUs):
--placement_tool epa-ng (alternative sepp).--hsp_method mp (maximum parsimony; pic is faster but mp is the recommended default).--coverage, presence confidence) and pathway ABUNDANCE are different questions - do not conflate.| Tool | Citation | Mechanism / role | When | |------|----------|------------------|------| | PICRUSt2 | Douglas 2020 Nat Biotechnol 38:685 | EPA-ng placement + castor HSP into ~20k-genome tree -> KO/EC/Pfam, MetaCyc | de-novo ASVs, broad function; strongest in human gut; the default | | Tax4Fun2 | Wemheuer 2020 Environ Microbiome 15:11 | nearest-BLAST to reference 16S + habitat-specific reference + functional-redundancy index | habitat-specific reference available; want a functional-redundancy metric | | FAPROTAX | Louca 2016 Science 353:1272 | curated literature lookup: taxon -> biogeochemical group (nitrification, methanogenesis, sulfate reduction) | ENVIRONMENTAL/biogeochemistry, cultured-taxon-dominated marine/soil; NOT gene-content prediction | | BugBase | Ward 2017 bioRxiv 133462 | organism-level PHENOTYPE prediction (PICRUSt-style gene content) | coarse community phenotypes (aerobic/anaerobic, Gram, oxidative stress); PREPRINT-only, never journal-published | | PanFP | Jun 2015 BMC Res Notes 8:479 | per-lineage pangenome profile weighted by OTU abundance | lineage-level functional summary; no de-novo placement | | Piphillin | Iwai 2016 PLoS ONE 11:e0166104 | direct nearest-neighbor BLAST of ASVs to genome DB (no tree, no HSP) | avoids the phylogenetic-interpolation assumption |
FAPROTAX is conceptually different: a curated taxon-to-function lookup answering "is this community nitrifying?", saying nothing about uncharacterized taxa (they map to nothing). Use FAPROTAX for biogeochemical-cycle questions, PICRUSt2 for a predicted KO/pathway profile - different questions.
| Scenario | Recommended | Why | |----------|-------------|-----| | Human gut / well-referenced host, broad KO/pathway profile | PICRUSt2 + report NSTI | dense references -> low NSTI -> Spearman ~0.8 vs shotgun; credible broad strokes | | Soil / marine / sediment / novel environment, biogeochemical question | FAPROTAX (broad groups) | reference tree is gut/host-biased; high NSTI makes PICRUSt2 KOs mostly extrapolation | | Need REAL (measured) functional gene content | -> metagenomics/functional-profiling | only shotgun reads measure genes; PICRUSt2 predicts, HUMAnN measures | | Coarse community phenotype (aerobe/anaerobe, Gram, pathogenic potential) | BugBase (flag: preprint-only) | organism-level phenotype, not pathways; cite its unpublished status | | Already in QIIME2 with .qza artifacts | -> qiime2-workflow (q2-picrust2 plugin) | same engine; FeatureTable in, FeatureTable out, into qiime diversity/composition | | DA between groups on predicted table | -> differential-abundance, run >=2 CoDA tools | compositional + prediction error; frame as hypothesis-generating (circularity) | | Reading the resulting KO/pathway list for biology | -> pathway-analysis/go-enrichment | predicted abundance is summed gene content, not an over-representation test |
# Full pipeline: place -> HSP -> 16S-normalize -> metagenome -> MetaCyc pathways
picrust2_pipeline.py \
-s asv_seqs.fna \ # representative ASV sequences (FASTA)
-i asv_table.biom \ # ASV abundance table (BIOM or TSV; samples as columns)
-o picrust2_out \
-p 8 \
--hsp_method mp \ # maximum parsimony (recommended default); pic is faster but not recommended
--max_nsti 2 \ # ASVs ABOVE 2 are DROPPED before inference; report how many (see below)
--verbose
# Key outputs (gzipped):
# KO_metagenome_out/pred_metagenome_unstrat.tsv.gz KEGG ortholog abundances
# EC_metagenome_out/pred_metagenome_unstrat.tsv.gz EC-number abundances
# pathways_out/path_abun_unstrat.tsv.gz MetaCyc pathway abundances
# marker_predicted_and_nsti.tsv.gz per-ASV 16S copies + metadata_NSTI (the quality file)
--stratified additionally emits per-ASV contribution tables (large, much slower); --per_sequence_contrib is only meaningful with it. --coverage adds pathway coverage (a different question from abundance). The unrolled per-step scripts are place_seqs.py -> hsp.py -i {16S,KO,EC} -m mp [-n] -> metagenome_pipeline.py --max_nsti 2 -> pathway_pipeline.py; --max_nsti filtering and 16S normalization happen in metagenome_pipeline.py. add_descriptions.py -m METACYC attaches human-readable names.
Goal: Quantify how much of the prediction is extrapolation and how much of the sampled community the NSTI gate discarded, so the result is interpretable.
Approach: Read marker_predicted_and_nsti.tsv.gz, summarize the metadata_NSTI distribution, and report the number of ASVs AND the fraction of READS dropped at --max_nsti 2 (a study that loses 40% of reads predicted function for a different community than it sampled).
import pandas as pd
nsti = pd.read_csv('picrust2_out/marker_predicted_and_nsti.tsv.gz', sep='\t') # cols: sequence, metadata_NSTI
asv_counts = pd.read_csv('asv_table.tsv', sep='\t', index_col=0) # ASVs x samples
nsti = nsti.set_index('sequence')
reads_per_asv = asv_counts.sum(axis=1)
max_nsti = 2.0 # PICRUSt2 default; ASVs above this are dropped before metagenome inference
dropped = nsti.index[nsti['metadata_NSTI'] > max_nsti]
reads_dropped_frac = reads_per_asv.reindex(dropped).sum() / reads_per_asv.sum()
print(f'mean NSTI {nsti.metadata_NSTI.mean():.3f} median {nsti.metadata_NSTI.median():.3f}')
print(f'ASVs dropped at NSTI>{max_nsti}: {len(dropped)}/{len(nsti)} reads dropped: {reads_dropped_frac:.1%}')
Trigger: reporting "increased butyrate production" / "upregulated" / "more metabolically active." Mechanism: PICRUSt2 measured no genes and no transcripts - only inferred gene presence from relatives' genomes. Symptom: a results sentence with an activity verb on a predicted pathway. Fix: restrict every claim to "potential" / "predicted capacity"; for activity, cite metatranscriptomics, not this skill.
Trigger: accepting the default --max_nsti 2 filter without reporting the distribution or the dropped fraction. Mechanism: the filter silently deletes the most novel/under-referenced ASVs - exactly the organisms an environmental study cares about. Symptom: a predicted-function result with no NSTI numbers in the methods. Fix: report mean/median NSTI, the distribution, and the ASV AND read fraction dropped (the helper above); treat high mean NSTI as a red flag that the result is mostly extrapolation.
Trigger: running PICRUSt2 on soil/marine/sediment/plant/novel hosts and reporting fine-grained KO differences. Mechanism: the ~20k-genome reference tree is gut/host-biased; sparse references mean high NSTI and predictions interpolated from distant relatives. Symptom: high mean NSTI yet confident KO/pathway tables. Fix: report the environment and NSTI; prefer FAPROTAX for the broad biogeochemical question, or do real shotgun. "Relatively better than other predictors" (per the paper) is not "trustworthy in absolute terms."
Trigger: describing path_abun_unstrat as "the pathways present in the community" or merging it with a HUMAnN table. Mechanism: the two tables share IDs and shape but PICRUSt2 reports relatives' genomes, HUMAnN reports reads that aligned to genes in the sample. Symptom: predicted and measured tables combined or compared as one. Fix: label predictions as potential; never merge with shotgun; keep coverage and abundance as separate questions.
Trigger: reporting "groups differed taxonomically AND functionally" as two lines of evidence. Mechanism: predicted function is a deterministic function of the ASV table, so the functional difference IS the taxonomic difference re-encoded. Symptom: a predicted-function DA result presented as orthogonal corroboration of a taxonomic result. Fix: present predicted function as a hypothesis-generating summary of the taxonomic signal; for orthogonal functional evidence use shotgun/metatranscriptomics.
Trigger: uncorrected Wilcoxon/t-test on relative abundances of the predicted table. Mechanism: the table is compositional, depth-confounded, and zero-inflated, on top of prediction error. Symptom: a long list of "significant" pathways that do not replicate across methods. Fix: use >=2 CoDA tools (ALDEx2, ANCOM-BC2, MaAsLin2, LinDA) and report the intersection (Nearing 2022 Nat Commun 13:342); ALDEx2 wants count-like features-as-rows, NOT relab-normalized output - see differential-abundance.
Trigger: inferring strain-specific function (toxin, resistance, pathogenicity island) from a 16S-based prediction. Mechanism: 16S resolves to roughly genus/species; accessory genome, HGT, plasmids, and prophage-borne genes vary within a species and are assigned the reference neighbors' core content. Symptom: a strain-level functional claim from amplicon data. Fix: state that the species core is the ceiling regardless of NSTI; strain function needs isolate genomes or shotgun.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| --max_nsti 2.0 (default) | Douglas 2020 Nat Biotechnol 38:685 | ASVs whose nearest sequenced genome is >2 substitutions/site away are too extrapolated to trust; dropped before inference - always report the dropped read fraction |
| --hsp_method mp (default) | PICRUSt2 manual | maximum parsimony is the recommended HSP method; pic is faster but not recommended (the legacy skill wrongly defaulted to pic) |
| --placement_tool epa-ng (default) | Barbera 2019 Syst Biol 68:365 | EPA-ng + gappa is the default placement path; sepp is the alternative |
| --min_align 0.8 (default) | PICRUSt2 manual | an ASV must align over >=80% of its length to be placed; poorly aligning ASVs are excluded |
| Predicted-vs-shotgun Spearman ~0.79-0.88 (gut) | Douglas 2020 Nat Biotechnol 38:685 | the empirical ceiling in the BEST case (dense human-gut references); lower elsewhere - the anchor for every accuracy caveat |
| NSTI ~0.5 / ~0.5-1 / ->2 (heuristic bands) | community practice | rough well-characterized / moderate / weak guide; no NSTI value converts predicted potential into measured function |
| DA: >=2 CoDA tools, report intersection | Nearing 2022 Nat Commun 13:342 | tool choice changes the predicted-pathway hit list; consensus beats any single tool |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| metadata_NSTI / NSTI file not found | reading marker_nsti_predicted.tsv (does not exist) | the real file is marker_predicted_and_nsti.tsv.gz; the column is metadata_NSTI |
| Near-empty output / most ASVs dropped | high NSTI (wrong environment) | check the NSTI distribution; consider FAPROTAX or shotgun; do not just lower --max_nsti to keep them |
| ALDEx2 gives implausible results on predicted table | fed relab-normalized output, features-as-columns | pass raw (count-like) predicted abundances with features as rows - see differential-abundance |
| Predicted and shotgun pathway tables disagree | comparing PICRUSt2 to HUMAnN as if interchangeable | they are different objects (predicted vs measured); do not merge |
| --per_sequence_contrib produces nothing | used without --stratified | it is only meaningful with --stratified |
| Pathway "presence" and "abundance" conflated | reading path_abun as coverage | abundance and --coverage are different questions |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.