phylogenetics/species-trees/SKILL.md
Estimates species trees under the multispecies coalescent from per-locus gene trees with the modern ASTER astral binary (ASTRAL-III/wASTRAL/ASTRAL-Pro), plus SVDQuartets, BPP, and StarBEAST2. Covers why a species tree is not a gene tree, why each locus has its own genealogy that disagrees by incomplete lineage sorting (ILS) even with zero error, why concatenation is statistically inconsistent and positively misleading in the anomaly zone where more loci converge on the wrong tree with full support, why gene-tree estimation error biases summary methods, that localPP is not bootstrap and ASTRAL branch lengths are coalescent units, and how minority-quartet symmetry separates ILS from introgression. Use when multi-locus discordance, rapid radiations, anomaly-zone risk, concordance-factor interpretation, or concatenation-vs-coalescent choice arise. Routes per-locus gene-tree inference and gCF/sCF to modern-tree-inference, dating to divergence-dating, orthology to comparative-genomics/ortholog-inference.
npx skillsauth add GPTomics/bioSkills bio-phylo-species-treesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: ASTER 1.15+ (provides astral/ASTRAL-III, wastral, astral-pro), IQ-TREE 2.2+ (concordance factors), PAUP* 4.0a168+ (SVDQuartets), BPP 4+.
Before using code patterns, verify installed versions match. If versions differ:
astral --version then astral --help to confirm flagsiqtree2 --version then iqtree2 --help to confirm flagsIf code throws an error, introspect the installed version and adapt flags rather than retrying.
The modern binary is astral from the ASTER package; input is a file of gene-tree Newick strings, one tree per line. The classic-Java ASTRAL uses -t for ANNOTATION level, but the ASTER astral uses -t for THREADS and -u for annotation -- copying a Java -t 8 (quartet support) into ASTER silently requests 8 threads with no annotation. Confirm which interface is installed before scripting.
"Estimate the species tree from my multi-locus phylogenomic data" -> Treat the dataset as a distribution of genealogies, not one tree, and estimate the species tree most consistent with that distribution under the multispecies coalescent.
astral -i gene_trees.nwk -o species.tre (ASTER)iqtree2 ... --gcf ... --scfl (shared with modern-tree-inference)Scope: choosing and running a coalescent species-tree method, contracting noisy gene trees, and reading discordance honestly. Per-locus gene-tree inference and model selection -> modern-tree-inference. Computing and plotting gCF/sCF -> modern-tree-inference, tree-visualization. Single-copy vs multi-copy orthology upstream -> comparative-genomics/ortholog-inference. Dated species trees -> divergence-dating. Rooting -> tree-manipulation.
A species tree is not a gene tree. Each locus has its own genealogy, and under the multispecies coalescent (MSC) those genealogies disagree with the species tree -- and with each other -- by incomplete lineage sorting (ILS) alone, with zero estimation error. The disagreement is not noise to average away; it is signal generated by the coalescent running inside the species tree. Three load-bearing facts:
Concatenation is consistent and efficient only when ILS is low. It is actively wrong in three coupled regimes, all driven by short internodes measured in coalescent units (generations / 2Ne):
| Method | Citation | Type / role | When |
|--------|----------|-------------|------|
| ASTRAL-III (astral) | Zhang 2018 | Summary, quartet-based; consistent under MSC; localPP, polytomy test, q1/q2/q3 | Default genome-scale species tree from single-copy gene trees |
| wASTRAL (wastral/astral-hybrid) | Zhang and Mirarab 2022 | Summary, weighted quartets; down-weights weak gene-tree branches | Current best-accuracy summary method for real (noisy) gene trees |
| ASTRAL-Pro (astral-pro) | Zhang 2020 | Summary, quartet-based on MULTI-COPY gene-family trees | Paralog-rich families; uses families without pre-orthology |
| SVDQuartets | Chifman and Kubatko 2014 | Site-based quartets from site patterns; NO gene trees | Short loci (RADseq/UCE/SNP); when gene trees are unreliable |
| BPP | Flouri 2018 | Full-likelihood Bayesian MSC; integrates gene-tree uncertainty | Small high-stakes datasets; species delimitation; introgression (MSci) |
| StarBEAST2 | Ogilvie 2017 | Full Bayesian co-estimation of species tree, gene trees, dates | Dated species tree with full posterior uncertainty |
| Concatenation (IQ-TREE/RAxML) | -- | Supermatrix ML; assumes ONE tree | Low ILS / high gCF only; never in or near the anomaly zone |
Summary methods take estimated gene trees as input and assume they are true -- the central caveat (see failure modes). Site-based (SVDQuartets) and full-likelihood (BPP, StarBEAST2) skip that assumption but pay in scale: roughly tens of taxa for the full-likelihood methods. The speed/scale order (most scalable first) is concatenation < ASTRAL/wASTRAL < SVDQuartets < StarBEAST2 < BPP; model completeness runs the opposite way. Pick the most scalable method whose assumptions the data do not badly violate. Introgression detection (HyDe, Blischak 2018, Syst Biol; QuIBL, Edelman 2019, Science; the D-statistic) lives downstream of the symmetry diagnostic below.
| Situation / diagnostic | Recommended | |------------------------|-------------| | Low ILS: long internodes (>>1 coalescent unit), gCF high (>~70), few discordant gene trees | Concatenation fine; coalescent agrees and adds no advantage | | Moderate ILS: gCF ~50-70, some short branches | Run wASTRAL; check gCF/sCF and the polytomy test on short branches | | High ILS / rapid radiation: short successive internodes, gCF <50 (esp. <33), q2 ~ q3 | Coalescent REQUIRED; concatenation is anomaly-zone wrong; quartet consistency holds | | Low per-locus signal / unreliable gene trees | wASTRAL or contract gene-tree branches <10% support; or SVDQuartets (no gene trees); BPP integrates uncertainty | | Paralog-rich gene families | ASTRAL-Pro (no pre-orthology); concatenation cannot use them | | Need dated tree or delimitation | StarBEAST2 (dating) or BPP (delimitation); coalescent-unit lengths are not time | | Suspected introgression (q2 != q3, D significant) | Model the reticulation (HyDe/QuIBL/BPP-MSci/PhyloNet); a tree masks it |
Do not reflexively coalescent-everything: where ILS is low the two trees agree and concatenation pools weak per-locus signal robustly. The deliverable is concordance, not a single best method -- run both, compute gCF/sCF, and trust the coalescent tree exactly where the two disagree on low-gCF branches.
Goal: Turn per-locus gene trees into a species tree that is consistent under the MSC, after removing gene-tree noise that would bias the quartet counts.
Approach: Infer one gene tree per locus with branch support (modern-tree-inference), contract branches below ~10% support to polytomies (which ASTRAL handles correctly), then run wASTRAL as the primary estimate and astral for the classic localPP / polytomy-test workflow.
# Per-locus gene trees with support live upstream in modern-tree-inference:
# iqtree2 -S loci_dir -m MFP -B 1000 -T AUTO --prefix loci # one tree per locus
# Contract gene-tree branches below 10% support (Newick tools / nw_ed) -> polytomies,
# then concatenate into one file, one Newick per line:
# cat loci/*.contracted.treefile > gene_trees.nwk
# Primary estimate: wASTRAL weights quartets by gene-tree support+length (noisy gene trees)
wastral -i gene_trees.nwk -o species_wastral.tre 2> species_wastral.log
# Classic ASTRAL-III for localPP + polytomy test. In ASTER, -t is THREADS, -u is annotation:
astral -t 8 -i gene_trees.nwk -o species.tre 2> species.log # -t 8 = 8 threads here
astral -u 2 -i gene_trees.nwk -o species_annot.tre # full localPP + q1/q2/q3 for all 3 resolutions
astral --root OUTGROUP -i gene_trees.nwk -o species_rooted.tre # rooting improves branch-length estimation
# Multi-copy gene families (no pre-orthology) -> ASTRAL-Pro:
astral-pro -i family_trees.nwk -o species_pro.tre
ASTRAL returns an UNROOTED tree; root with an outgroup afterward. Branch lengths are in coalescent units (not time, not substitutions), and tip lengths are undefined in that mode. The classic Java astral.5.7.8.jar inverts the flags: there -t is the annotation level (-t 2 full, -t 8 quartet support, -t 10 polytomy test) and there is no -u.
Goal: Replace a single bootstrap/localPP per branch with how much of the actual data agrees, and read ILS-vs-introgression off the quartet symmetry.
Approach: Compute gene and site concordance factors against the species tree, then inspect the two minority quartet frequencies (q2, q3) at each contested branch.
# gCF = % of decisive gene trees containing a branch; sCF = % of decisive sites supporting it
iqtree2 -te species.tre --gcf gene_trees.nwk -s concat.fasta --scfl 100 --prefix cf # -te fixes the topology for likelihood sCF
# cf.cf.tree carries gCF/sCF on node labels; cf.cf.stat has per-branch q1/q2/q3 counts
Then read each contested branch: under pure ILS the two minority topologies are equally probable (q2 ~ q3); introgression breaks that symmetry by inflating the one matching the direction of gene flow. A high-bootstrap branch with gCF ~ 25 is a branch where the data are screaming disagreement that the bootstrap hides.
Trigger: Short, low-information loci (e.g. <500 bp exons) fed to ASTRAL/wASTRAL as if their gene trees were truth. Mechanism: MSC consistency assumes the input gene trees are correct; estimation error is a non-coalescent noise source that biases quartet counts in finite samples and can make the coalescent tree worse than concatenation. Symptom: Coalescent tree noisier or worse than concatenation; unstable across locus subsets; low localPP everywhere; gCF far below sCF. Fix: Contract gene-tree branches below ~10% bootstrap before ASTRAL (they become polytomies that contribute no spurious quartet similarity); better, use wASTRAL (continuous weighting); or drop to SVDQuartets (no gene trees) or BPP (integrates gene-tree uncertainty).
Trigger: Reporting ASTRAL branch labels as bootstrap proportions, or demanding ">95% bootstrap" of an ASTRAL tree. Mechanism: localPP is a coalescent posterior from quartet frequencies, on a different scale; it saturates at 1.0 on resolved branches. Symptom: Over-confident reading of localPP = 1.0 as "100/100 BS"; ignoring that localPP ~ 0.33 is a near-tie. Fix: Interpret localPP on its own scale; report gCF/sCF for honest conflict; use the polytomy test to decide whether to collapse a branch.
Trigger: Rapid radiation with several short successive internodes, analyzed by supermatrix ML with "100% bootstrap" reported as resolution. Mechanism: Concatenation is inconsistent and positively misleading under high ILS; the plurality genealogy is anomalous, so adding loci converges on the wrong tree with rising support. Symptom: Rock-solid bootstrap but gCF < 33 on focal branches; concatenated and coalescent trees disagree exactly there. Fix: Use wASTRAL/ASTRAL (quartet-level consistency holds in the anomaly zone); report gCF/sCF; trust the coalescent topology on low-gCF branches; consider a true near-polytomy via the polytomy test.
Trigger: Discarding all multi-copy families to feed single-copy ASTRAL, or treating paralogs as orthologs. Mechanism: Standard ASTRAL assumes single-copy gene trees; mis-assigned paralogs inject quartets reflecting the duplication history, not speciation. Symptom: Massive data loss (most families dropped), or spurious clades / inflated discordance from hidden paralogy. Fix: Use ASTRAL-Pro on the multi-copy gene-family trees (it models orthology/paralogy without pre-orthology); audit orthology upstream (comparative-genomics/ortholog-inference).
Trigger: Assuming all discordance is ILS and fitting a tree, OR over-calling gene flow from any discordance. Mechanism: ILS produces symmetric minority topologies (q2 ~ q3); introgression breaks the symmetry. A bifurcating tree cannot represent reticulate history. Symptom: Asymmetric q2/q3 on a discordant branch; significant D-statistic / HyDe gamma; QuIBL favoring the introgression component -- but a tree-only analysis hides it. Fix: Test minority-quartet symmetry (D-statistic / ABBA-BABA, HyDe, QuIBL, ASTRAL q1/q2/q3) before interpreting; if asymmetric, model the reticulation rather than forcing a tree.
| Quantity | Threshold | Meaning / action | Source | |----------|-----------|------------------|--------| | Internode (coalescent units) | < ~1.0 | >25% of gene trees discordant; ILS material | P(concordant)=1-(2/3)e^(-tau) | | Internode (coalescent units) | < ~0.3 | ~50% discordant; coalescent method required | same | | Two+ successive internodes | both < ~0.1-0.2 | Anomaly-zone risk; concatenation positively misleading | Degnan and Rosenberg 2006 | | Gene-tree branch support to contract before ASTRAL | < ~10% bootstrap (or use wASTRAL) | Collapse to polytomy to cut gene-tree-error bias | ASTRAL practice / Zhang and Mirarab 2022 | | gCF | < 50 | Most loci disagree; distrust concatenation here | Minh 2020 | | gCF | < 33 | Near the ILS coin-flip floor; effectively unresolved | Minh 2020 | | sCF | ~33 | Random-resolution floor (three quartet resolutions); no site signal | Minh 2020 | | localPP | ~0.95+ to call "well-supported" | Strong coalescent support, NOT a bootstrap | Sayyari and Mirarab 2016 | | localPP | ~0.33 | Three quartet resolutions tied; candidate polytomy | Sayyari and Mirarab 2016 | | Polytomy test p | fail to reject (e.g. p > 0.05) | Cannot claim a resolved bifurcation; collapse | Sayyari and Mirarab 2018 |
The gCF/sCF cutoffs of 50/33 are practical heuristics relative to the dataset (a branch at gCF 45 amid branches at 90 is the weak one), not significance tests. The 10% contraction is a widely-used default, not a derived optimum; wASTRAL removes the need to pick it.
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| ASTER astral ignores the annotation flag | Used Java -t 8 (annotation); ASTER -t is threads | Use -u 2 in ASTER; -t 8 only in classic Java |
| Coalescent tree worse than concatenation | Noisy gene trees fed in as truth | Contract <10% branches; use wASTRAL; or SVDQuartets |
| localPP = 1.0 reported as bootstrap | localPP is a coalescent posterior on its own scale | Report gCF/sCF; do not apply a BS cutoff |
| Coalescent-unit branch read as time | Branch lengths are coalescent units, tips undefined | State units; for dates use StarBEAST2 / divergence-dating |
| High bootstrap, low gCF, called resolved | Concatenation in the anomaly zone | Run coalescent; trust it on low-gCF branches; polytomy test |
| Most gene families dropped | Forced single-copy orthology | Use ASTRAL-Pro on multi-copy family trees |
| Network claimed from any discordance | ILS also produces discordance | Confirm q2 != q3 (D / HyDe / QuIBL) before invoking gene flow |
Degnan JH, Rosenberg NA. 2006. Discordance of species trees with their most likely gene trees. PLoS Genetics 2(5):e68. Degnan JH, Rosenberg NA. 2009. Gene tree discordance, phylogenetic inference and the multispecies coalescent. Trends in Ecology & Evolution 24(6):332-340. Kubatko LS, Degnan JH. 2007. Inconsistency of phylogenetic estimates from concatenated data under coalescence. Systematic Biology 56(1):17-24. Roch S, Steel M. 2015. Likelihood-based tree reconstruction on a concatenation of aligned sequence data sets can be statistically inconsistent. Theoretical Population Biology 100:56-62. Mirarab S, Reaz R, Bayzid MS, Zimmermann T, Swenson MS, Warnow T. 2014. ASTRAL: genome-scale coalescent-based species tree estimation. Bioinformatics 30(17):i541-i548. Zhang C, Rabiee M, Sayyari E, Mirarab S. 2018. ASTRAL-III: polynomial time species tree reconstruction from partially resolved gene trees. BMC Bioinformatics 19(Suppl 6):153. Zhang C, Mirarab S. 2022. Weighting by gene tree uncertainty improves accuracy of quartet-based species trees. Molecular Biology and Evolution 39(12):msac215. Zhang C, Scornavacca C, Molloy EK, Mirarab S. 2020. ASTRAL-Pro: quartet-based species-tree inference despite paralogy. Molecular Biology and Evolution 37(11):3292-3307. Zhang C, Nielsen R, Mirarab S. 2025. ASTER: a package for large-scale phylogenomic reconstructions. Molecular Biology and Evolution 42(8):msaf172. Sayyari E, Mirarab S. 2016. Fast coalescent-based computation of local branch support from quartet frequencies. Molecular Biology and Evolution 33(7):1654-1668. Sayyari E, Mirarab S. 2018. Testing for polytomies in phylogenetic species trees using quartet frequencies. Genes 9(3):132. Chifman J, Kubatko L. 2014. Quartet inference from SNP data under the coalescent model. Bioinformatics 30(23):3317-3324. Flouri T, Jiao X, Rannala B, Yang Z. 2018. Species tree inference with BPP using genomic sequences and the multispecies coalescent. Molecular Biology and Evolution 35(10):2585-2593. Ogilvie HA, Bouckaert RR, Drummond AJ. 2017. StarBEAST2 brings faster species tree inference and accurate estimates of substitution rates. Molecular Biology and Evolution 34(8):2101-2114. Minh BQ, Hahn MW, Lanfear R. 2020. New methods to calculate concordance factors for phylogenomic datasets. Molecular Biology and Evolution 37(9):2727-2733. Blischak PD, Chifman J, Wolfe AD, Kubatko LS. 2018. HyDe: a Python package for genome-scale hybridization detection. Systematic Biology 67(5):821-829. Edelman NB, Frandsen PB, Miyagi M, et al. 2019. Genomic architecture and introgression shape a butterfly radiation. Science 366(6465):594-599.
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.