ecological-genomics/species-delimitation/SKILL.md
Delimits putative species boundaries from molecular data within the de Queiroz 2007 unified-lineage framework using ASAP (Puillandre 2021 successor to ABGD), mPTP C++ (Kapli 2017 successor to bPTP; bPTP is Python NOT R), GMYC single/multi-threshold (Pons 2006; Fujisawa 2013), multilocus BPP v4 with prior calibration from data (NOT defaults; Yang 2015), SNAPP + BFD* for SNP delimitation, DELINEATE (Sukumaran 2021) speciation-process modeling to address Sukumaran & Knowles 2017 PNAS critique that MSC delimits structure not species, integrative-taxonomy congruence (Padial 2010; Carstens 2013), Dsuite for introgression testing before sister claims (Malinsky 2021), and Meyer & Paulay 2005 barcoding-gap-absence caveat. Use when delineating species from DNA barcoding data, resolving cryptic complexes, choosing among ASAP/mPTP/BPP/DELINEATE, calibrating BPP priors, distinguishing introgression from ILS, or applying the Sukumaran-Knowles oversplitting correction.
npx skillsauth add GPTomics/bioSkills bio-ecological-genomics-species-delimitationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: ASAP (current CLI), mPTP (current C++), bPTP (Python from github.com/zhangjiajie/PTP), splits 1.0+ for GMYC, BPP 4.7+, SNAPP/BEAST 2.7+, DELINEATE (current Python), Dsuite 0.5+, ape 5.7+, fossil 0.4+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturespackageVersion('<pkg>') then ?function_name to verify parameters<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Delineate species boundaries from my DNA barcoding or genomic data" -> Apply distance-based (ASAP) and tree-based (mPTP) methods for primary delimitation, multilocus coalescent (BPP) for genomic confirmation, DELINEATE for speciation-process modeling to address MSC oversplitting, integrative-taxonomy validation across multiple lines of evidence, and Dsuite to test introgression before claiming sister relationships.
asap (downloadable C binary) OR web (https://bioinfo.mnhn.fr/abi/public/asap/) for single-locus primary delimitationmptp (https://github.com/Pas-Kapli/mptp) for tree-based multi-rate PTP (faster than bPTP)bPTP (https://github.com/zhangjiajie/PTP) for legacy tree-based delimitation — NOTE: Python not Rsplits::gmyc() for GMYC on ultrametric treesbpp (https://bpp.github.io) for multilocus MSC delimitation; control-file drivenDsuite for D-statistics / f4-ratio / f-branch introgression testingSukumaran & Knowles 2017 PNAS 114(7):1607-1612 established that BPP, BFD*, and other multispecies-coalescent methods delimit genetic structure, not species. The mathematical reason: the MSC model treats every panmictic population as a "species" in the parameterization. Applied to data with substructure (isolated demes within a species), MSC methods partition that structure into "species" — leading to systematic oversplitting in published literature, especially in lizards, frogs, geckos, and insects.
Modern best practice (post-2021):
A second cornerstone: the barcoding gap is OFTEN absent in real data (Meyer & Paulay 2005 PLoS Biol 3:e422). ABGD and ASAP are theoretically grounded in the gap; when the gap is absent, both fail gracefully but their output is unreliable. Always inspect the pairwise distance histogram.
A third: Dsuite-checked introgression must precede sister-species claims. D-statistics, f4-ratio, and f-branch (Malinsky 2021) distinguish admixture from incomplete lineage sorting; without this check, ILS-driven discordance is misinterpreted as gene flow.
| Method | Input | Approach | Strength | Fails when | |--------|-------|----------|----------|------------| | ASAP (Puillandre 2021) | Aligned sequences | Hierarchical-clustering distance partitioning with new scoring | Fast; modern; web + CLI | Single locus; works only if barcoding gap exists | | ABGD (Puillandre 2012) | Aligned sequences | Distance gap-detection across priors | Legacy; superseded by ASAP | Less robust when gap is weak | | GMYC (Pons 2006) | Ultrametric tree | Yule-coalescent transition threshold | Theoretical foundation | Requires time-calibrated tree | | Multi-threshold GMYC (Fujisawa & Barraclough 2013) | Ultrametric tree | Per-lineage transition thresholds | Heterogeneous rates | Same; not always more powerful | | bPTP (Zhang 2013) | Rooted phylogeny | Bayesian PTP MCMC | Posterior support per partition | Single intraspecific rate assumption; slower than mPTP | | mPTP (Kapli 2017) | Rooted phylogeny | Multi-rate PTP, C++ | Faster (5+ orders); per-species intraspecific rates | Heterogeneity-friendly; for shallow phylogenies bPTP may be more powerful | | BPP A10/A11 (Yang & Rannala) | Multi-locus alignments | Bayesian multispecies coalescent | Rigorous, multilocus | Computationally heavy; oversplits per Sukumaran-Knowles | | SNAPP (Bryant 2012) | Biallelic SNP data | Coalescent species-tree bypassing gene trees | High-power genomic | Cubic complexity; not for very large datasets | | BFD* (Leache 2014) | SNP data via SNAPP | Bayes-factor delimitation | Genomic-scale species hypothesis testing | Same oversplitting risk | | DELINEATE (Sukumaran 2021) | BPP output + guide tree | Speciation-process modeling | Distinguishes structure from species | Requires preliminary delimitation as input | | Dsuite (Malinsky 2021) | VCF + populations | D-statistic, f4-ratio, f-branch | Tests introgression vs ILS | Tree must accurately reflect history |
| Scenario | Recommended approach | Why | |----------|---------------------|-----| | Primary species hypothesis from single locus | ASAP first; cross-check with mPTP | ASAP is the modern successor to ABGD; faster and better-scoring | | Tree-based delimitation, heterogeneous intraspecific rates | mPTP | Per-species rate model; 5+ orders faster than bPTP | | Tree-based delimitation, shallow phylogeny | bPTP (single-rate) or mPTP | When rates are similar, single-rate model may be more powerful | | Time-calibrated tree from BEAST / chronos | GMYC (single or multi-threshold) | Requires ultrametric; both threshold variants worth comparing | | Multilocus genomic delimitation | BPP A10/A11 + DELINEATE | BPP alone oversplits per Sukumaran-Knowles 2017 | | SNP-based species hypothesis test | SNAPP + BFD* | Coalescent species-tree from biallelic SNPs | | Confirm BPP result is not oversplitting | DELINEATE (Sukumaran 2021) | Speciation-process modeling distinguishes structure from species | | Test introgression before sister-species claim | Dsuite (D, f4-ratio, f-branch) | Distinguishes admixture from ILS | | Cryptic species complex | Integrative taxonomy: genetic + morphological + ecological congruence | Single-method conclusion unreliable | | ABGD/ASAP returns no clear partition | Inspect pairwise distance histogram; do not force partition | Barcoding gap may be absent (Meyer & Paulay 2005) | | Recently-diverged populations (Ne*t << 1) | Caution; MSC methods very prone to oversplit | Incomplete lineage sorting indistinguishable from shallow structure | | Conservation management unit definition | Moritz 1994 ESU/MU framework, NOT MSC delimitation | Sukumaran-Knowles caveat applies |
Goal: Primary species hypothesis from a single-locus alignment via hierarchical clustering of pairwise distances, scored by a new asap-score (Puillandre 2021).
Approach: Run ASAP via the web interface (https://bioinfo.mnhn.fr/abi/public/asap/) OR the downloadable C binary. ASAP ranks candidate partitions by asap-score (lower = better); always inspect the top 5-10 partitions, not just the best, and check for large score gaps that signal robust partitioning. Output: ranked species partitions with asap-score, p-value, and threshold distance per partition.
# ASAP CLI (downloadable C binary)
# Substitution model: K2P standard for COI; p-distance for very closely related; JC for general
./asap -d K2P -o asap_results/ aligned_sequences.fasta
# Inspect output
# - asap_output.csv: ranked partitions by asap-score (lower = better)
# - Always look at top 5-10 partitions; large score gaps = robust partitioning
# - The "best" partition is the top-ranked; report secondary partitions too
# Python ASAP-style analysis for inspection (NOT a substitute for ASAP)
# Useful for understanding the distance landscape
from Bio import AlignIO
from Bio.Phylo.TreeConstruction import DistanceCalculator
from scipy.cluster.hierarchy import fcluster, linkage
from scipy.spatial.distance import squareform
import numpy as np
import matplotlib.pyplot as plt
alignment = AlignIO.read('aligned_sequences.fasta', 'fasta')
calc = DistanceCalculator('identity') # use K2P-friendly model in practice
dm = calc.get_distance(alignment)
names = [r.id for r in alignment]
n = len(names)
dist_array = np.zeros((n, n))
for i in range(n):
for j in range(n):
dist_array[i][j] = dm[names[i], names[j]]
# Inspect distance histogram for barcoding gap (often ABSENT per Meyer & Paulay 2005)
upper = dist_array[np.triu_indices(n, k=1)]
plt.hist(upper, bins=50)
plt.xlabel('Pairwise distance')
plt.ylabel('Frequency')
plt.title('Distance histogram — look for bimodality (gap)')
plt.savefig('barcoding_gap_histogram.pdf')
# Hierarchical clustering at threshold scan
condensed = squareform(dist_array)
Z = linkage(condensed, method='average')
thresholds = np.arange(0.01, 0.10, 0.005)
for t in thresholds:
clusters = fcluster(Z, t=t, criterion='distance')
print(f'Threshold {t:.3f}: {len(set(clusters))} groups')
Goal: Delimit species from a phylogeny by detecting the transition from speciation to coalescent branching rates, allowing per-species intraspecific rates (Kapli 2017).
Approach: Run mPTP on a rooted ML tree (e.g., from RAxML or IQ-TREE). mPTP's multi-rate model is more flexible than bPTP's single-rate; for shallow phylogenies with similar rates, bPTP may be competitive. mPTP is C++ (5+ orders of magnitude faster than bPTP), available from https://github.com/Pas-Kapli/mptp.
# mPTP installation: build from C source
# git clone https://github.com/Pas-Kapli/mptp
# cd mptp && ./autogen.sh && ./configure && make
# Run mPTP ML inference
mptp --ml --tree_file rooted_tree.nwk --output_file mptp_ml
# MCMC for posterior support
mptp --mcmc 1000000 --mcmc_sample 1000 --mcmc_burnin 100000 \
--tree_file rooted_tree.nwk --output_file mptp_mcmc
# Output: species partition with branch annotations
# > 0.95 posterior support: strong; 0.80-0.95: moderate; < 0.80: uncertain
Goal: Bayesian Poisson Tree Processes delimitation with single intraspecific rate (Zhang 2013).
Approach: Run bPTP via the Python package from https://github.com/zhangjiajie/PTP (NOTE: this is a Python tool, NOT an R package; install.packages('PTP') does not exist). For heterogeneous-rate data, prefer mPTP.
# bPTP Python install (NOT R)
pip install git+https://github.com/iTaxoTools/PTP-pyqt5
# Run bPTP MCMC
python -m PTP.PTP -t rooted_tree.nwk -o bptp_results \
--type bayesian --ngen 100000 --burnin 0.1 --seed 42
# Output: posterior support per species partition (> 0.95 strong)
# bPTP over-splits when populations have strong geographic substructure
# Cross-check with ASAP and/or mPTP for consensus
Goal: Detect the threshold on an ultrametric phylogeny where branching transitions from interspecific (Yule) to intraspecific (coalescent).
Approach: Convert ML tree to ultrametric with ape::chronos (penalized likelihood) or BEAST (Bayesian time-calibration), then run GMYC. Single-threshold vs multi-threshold; multi-threshold handles heterogeneous rates across the tree.
library(splits)
library(ape)
tree <- read.tree('rooted_tree.nwk')
# Convert to ultrametric (REQUIRED for GMYC)
# chronos: penalized likelihood; lambda=1 mid-clock
# For rigorous time-calibration, use BEAST2 instead
ultrametric_tree <- chronos(tree, lambda = 1)
class(ultrametric_tree) <- 'phylo'
stopifnot(is.ultrametric(ultrametric_tree))
# Single-threshold GMYC
gmyc_single <- gmyc(ultrametric_tree, method = 'single')
summary(gmyc_single)
cat('GMYC species (single-threshold):', gmyc_single$entity[1], '\n')
cat('LR test p-value:', gmyc_single$p.value[1], '\n')
# Multi-threshold for heterogeneous rates
gmyc_multi <- gmyc(ultrametric_tree, method = 'multiple')
summary(gmyc_multi)
# Compare single vs multi-threshold; report both
species_single <- spec.list(gmyc_single)
species_multi <- spec.list(gmyc_multi)
Goal: Bayesian multilocus species delimitation under the multispecies coalescent (Yang & Rannala 2010; current Flouri 2018 BPP v4).
Approach: Configure BPP A10 (joint delimitation + species tree estimation) with priors on theta and tau ESTIMATED FROM DATA — NOT defaults. Yang 2015 Curr Zool 61:854-865 gives explicit guidance: thetaprior = G(2, 2000) for small/moderate data; tighter for large genomic data. Wrong priors dominate the analysis.
# BPP control file for A10 analysis
# CRITICAL: priors below are CALIBRATED to data, not defaults
# Defaults are tuned for primate-mammal data; insects/fish/plants need adjustment
seed = 12345
seqfile = alignment.phy
Imapfile = imap.txt # map individuals -> putative species
outfile = bpp_results.txt
mcmcfile = bpp_mcmc.txt
speciesdelimitation = 1 0 2 0.5 # 1 = delimitation mode; rjMCMC parameters
speciestree = 1 # 1 = estimate species tree
speciesmodelprior = 1 # 1 = uniform prior on rooted trees
species&tree = 4 speciesA speciesB speciesC speciesD
10 8 12 6
((speciesA, speciesB), (speciesC, speciesD));
# CALIBRATED theta prior (per-locus nucleotide diversity)
# G(2, 2000) -> mean = 0.001 (calibrate from observed per-locus pi)
# Yang 2015 Curr Zool 61:854 has taxon-specific recommendations
thetaprior = 2 2000
# CALIBRATED tau prior (divergence times)
# Calibrate from observed maximum genetic distance / mutation rate
tauprior = 2 2000
# MCMC
nsample = 100000
sampfreq = 2
burnin = 50000
# Run multiple independent chains; check convergence
# Run BPP with multiple independent chains
for i in 1 2 3 4; do
bpp --cfile bpp_control.bpp --seed $((1000 * i)) --out bpp_run_$i.txt &
done
wait
# Compare chains; posterior probabilities must agree across chains
# PP > 0.95 for a delimitation: strong support
# PP 0.50-0.95: moderate; integrate other evidence
Goal: Test whether candidate lineages from BPP represent fully-formed species OR incomplete-speciation structure (Sukumaran 2021).
Approach: Provide DELINEATE with a guide tree of population lineages and a hypothesis about which are species. The output: probability that each population lineage is a true species vs an incipient lineage. THE post-2021 best practice complement to BPP.
# DELINEATE (https://github.com/jsukumaran/delineate)
# Python tool
pip install delineate
# Configure delineate input
# - guide tree: rooted Newick with population lineages as tips
# - constraint file: which tips are confirmed species, which to test
delineate-estimate partitions \
--tree-file population_lineage_tree.nwk \
--config-file delineate_config.json \
--output-prefix delineate_run
# Output: probability that each lineage is a species
# Use ALONGSIDE BPP; do not claim species from BPP alone post-2021
Goal: Test for introgression before claiming sister-species relationships, distinguishing admixture from ILS.
Approach: Compute D-statistic (ABBA-BABA), f4-ratio (admixture fraction), and f-branch (Malinsky 2021 Mol Ecol Resour 21:584-595) on a VCF with population assignments. Non-zero D indicates introgression between H3 and one of H1/H2.
# Dsuite (https://github.com/millanek/Dsuite)
# Sets file: 3 columns (sample_id, population/species, optional outgroup tag)
# Tree file: Newick with population names matching the sets file
# All-trio D-statistic
Dsuite Dtrios -t species_tree.nwk sets.txt input.vcf.gz
# F-branch decomposition (more interpretable than raw D)
Dsuite Fbranch species_tree.nwk Dtrios_output_tree.txt > fbranch.txt
# Plot f-branch matrix
dtools.py fbranch.txt species_tree.nwk
# D > 0 with |Z| > 3: significant ABBA-BABA imbalance -> introgression
# f4-ratio gives the admixture fraction estimate
Trigger: Running BPP, BFD*, or other multispecies-coalescent methods on data with within-species substructure (isolated demes), then claiming the delimited lineages are species.
Mechanism: The MSC model treats every panmictic population as a "species" in its parameterization (Sukumaran & Knowles 2017). When applied to data with population substructure, MSC partitions that structure into "species."
Symptom: Many "species" delimited; each with low sample size; lineages defined by geography rather than morphology; BPP posterior support is high but biological reality is unclear.
Fix: Run DELINEATE (Sukumaran 2021) on the BPP output to test which lineages have completed speciation. Validate via integrative taxonomy (Padial 2010): require congruence across genetic + morphological + ecological evidence before publishing species.
Trigger: Reporting "best partition" from ABGD or ASAP when the pairwise distance histogram is unimodal (no gap).
Mechanism: ABGD and ASAP detect a gap in the pairwise-distance distribution that separates intra- from inter-specific divergences. Meyer & Paulay 2005 PLoS Biol 3:e422 demonstrated the gap is frequently absent.
Symptom: Distance histogram is unimodal (no clear bimodality); ASAP partitions have similar scores across many K values; ABGD priors disagree.
Fix: Inspect the distance histogram first; if no clear gap, do NOT report ASAP/ABGD results as definitive; integrate other lines of evidence.
Trigger: Running bPTP on a tree where the same biological species has multiple isolated demes contributing genetic structure.
Mechanism: bPTP detects branching-rate shifts as species boundaries; isolated demes within a species show coalescent-rate branching that bPTP partitions as separate species.
Symptom: More bPTP species than morphologically recognized; species correspond to geographic regions.
Fix: Cross-check with ASAP (more conservative) and with mPTP at multi-rate option; treat bPTP/mPTP as preliminary; require congruence with other lines of evidence.
Trigger: Running BPP without calibrating priors to the data; using priors copied from a primate/mammal example file.
Mechanism: BPP results depend critically on theta (per-locus nucleotide diversity) and tau (divergence time) priors. Default priors are typically tuned for mammal data; for insects, fish, plants with different mutation-rate and Ne contexts, defaults can dominate the posterior.
Symptom: Posterior support clusters at 1.0 OR 0.0; results inconsistent with morphological / ecological data; sensitivity to alternative priors is very high.
Fix: Calibrate priors from data per Yang 2015 Curr Zool 61:854: estimate per-locus heterozygosity for theta mean; estimate divergence times from observed maximum genetic distances and mutation rate.
Trigger: Reporting raw D-statistics for many taxa without decomposing introgression patterns.
Mechanism: D-statistic measures asymmetric site-pattern counts; with taxon-rich datasets, D > 0 can be confounded by ancient introgression in the outgroup or by ghost lineages. f-branch (Malinsky 2018) decomposes admixture across the tree more interpretably.
Symptom: Many trios with D > 0; difficult to identify which specific introgression event drives the pattern.
Fix: Report f-branch decomposition alongside D; check for consistent patterns across phylogenetic levels.
| Threshold | Value | Source / rationale | |-----------|-------|-------------------| | COI K2P "species" cutoff | typically 2-3% pairwise distance (variable) | Animal barcoding convention; verify per taxon | | ASAP score gap signaling robust partition | Large drop between consecutive scores | Puillandre 2021; inspect top 5-10 partitions | | bPTP/mPTP posterior support | > 0.95 strong; 0.80-0.95 moderate; < 0.80 uncertain | Standard Bayesian inference | | GMYC LR test p-value | < 0.05 detects significant transition | Pons 2006 | | BPP posterior probability | > 0.95 strong; 0.50-0.95 moderate; < 0.50 uncertain | Yang 2015 | | BPP minimum independent chains | 4 with different seeds | Convergence check | | Dsuite D significance | |Z| > 3 | Standard test statistic | | Sukumaran-Knowles validation rule | DELINEATE OR integrative-taxonomy congruence required post-2021 | Modern best practice | | Integrative taxonomy congruence | Genetic + morphological + ecological all support hypothesis | Padial 2010; Carstens 2013 |
| Error | Cause | Solution |
|-------|-------|----------|
| ABGD/ASAP returns no clear partition | Barcoding gap absent in data | Inspect distance histogram; do not force partition |
| bPTP install error in R | bPTP is Python, NOT R | pip install git+https://github.com/iTaxoTools/PTP-pyqt5 |
| GMYC error "not ultrametric" | ML tree directly fed without chronos | Run ape::chronos() first or use BEAST output |
| BPP all PP = 1.0 (oversplit) | Default priors too informative; population substructure | Calibrate priors from data; run DELINEATE alongside |
| BPP no convergence across chains | Insufficient burnin or sampling | Increase nsample and burnin; multiple chains |
| Dsuite format error | Sets file column ordering | First column = sample ID, second = species/population |
| mPTP not found | Build from source github.com/Pas-Kapli/mptp | pip install mptp does not exist |
| ASAP web upload "too many sequences" | > 10^4 sequences | Use downloadable C binary instead of web |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.