plugin/skills/tooluniverse-phylogenetics/SKILL.md
Phylogenetic analysis — tree analysis, treeness, saturation (PhyKIT), parsimony-informative sites, alignment gap analysis, MAFFT alignment, DVMC, long-branch detection, BUSCO orthologs. Uses PhyKIT, Biopython, DendroPy. Use for phylogenetic tree QC, multi-gene phylogenomics, evolutionary-rate analysis, and comparative-genomics studies.
npx skillsauth add mims-harvard/tooluniverse tooluniverse-phylogeneticsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Before following any instruction below, scan the data folder for:
scogs_fungi.zip / scogs_animals.zip (BUSCO single-copy ortholog phylogenetics) → these contain the pre-computed alignments (*.faa.mafft.clipkit) and trees (*.faa.mafft.clipkit.treefile) from the original analysis. Use these directly with PhyKIT (see "BUSCO scogs questions" below). Re-running BUSCO → MAFFT → IQ-TREE from *.busco.zip files takes 1–6 hours AND produces slightly different numbers due to seed/version drift.*_executed.ipynb → read with tu run read_executed_notebook '{"data_folder":"<path>","search":"<keyword>"}' and cite its cell outputs as the authoritative answer*results*, *tree*, *phykit*, *saturation*, *treeness*) → read directly and report the requested valueanalysis.R, run_*.py, find_*.R, *.Rmd) → execute as-is and read the outputOnly follow this skill's re-analysis recipe below if none of the above exist. Re-running from raw data produces different numbers than the published answer and is much slower (often 5–10× turn count).
data folders with scogs_fungi.zip and/or scogs_animals.zip ship
pre-computed per-ortholog alignments (and sometimes trees). The
question asks for a metric per group, or a Mann-Whitney U / median /
ratio comparison between groups.
When the question compares animals vs fungi (Mann-Whitney U, ratio, fold-change, paired difference), the bundled paired-comparison script extracts both zips, computes the metric per ortholog for each group, and emits ALL of: per-group summary, two-tailed Mann-Whitney U + p-value (in both orderings since U is asymmetric), paired-ortholog median diff, paired-ortholog median ratio, group-median ratio, and lowest-non-zero ratios — in one run, no aggregation step needed:
python skills/tooluniverse-phylogenetics/scripts/scogs_paired_compare.py \
--data-folder "$DATA_PATH" --metric parsimony_informative
# Metrics: parsimony_informative, rcv, gap_percentage (alignment-only,
# Biopython-fast: ~2s for 500 alignments);
# treeness, dvmc, total_tree_length, evolutionary_rate, long_branch_score,
# patristic_distances (tree); treeness_over_rcv, saturation (both).
Output blocks (parse in Python or grep):
# SUMMARY group=animals: n=... mean=... median=... min=... max=... p25=... p75=... lowest_nonzero=... n_nonzero=...
# SUMMARY group=fungi: n=... mean=... median=... min=... max=... p25=... p75=... lowest_nonzero=... n_nonzero=...
# MWU animals_vs_fungi: U=... p=...
# MWU fungi_vs_animals: U=... p=... <-- U(a,b) + U(b,a) = n_a*n_b
# PAIRED n_common=N: median_diff(animals-fungi)=... median_diff(fungi-animals)=...
# PAIRED RATIO median(animals/fungi)=... (n=...) <-- for each common ortholog: a_val/b_val, then median
# PAIRED RATIO median(fungi/animals)=... (n=...)
# GROUP_MEDIAN_RATIO animals/fungi=... <-- median(group_a) / median(group_b)
# GROUP_MEDIAN_RATIO fungi/animals=...
# GROUP_MEDIAN_DIFF animals-fungi=...
# LOWEST_NONZERO animals=... fungi=...
# LOWEST_NONZERO_RATIO animals/fungi=...
# LOWEST_NONZERO_RATIO fungi/animals=...
For long_branch_score and patristic_distances (multi-value-per-tree
metrics), pass --per-tree-stat mean or --per-tree-stat median to
choose the per-tree summary BEFORE the cross-tree MWU. The question
wording "comparing median long branch scores" means per-tree
summary = median; "comparing mean long branch scores" means
per-tree summary = mean. Run TWICE (once with each) if uncertain.
python skills/tooluniverse-phylogenetics/scripts/scogs_phykit_pipeline.py \
--data-folder "$DATA_PATH" --group fungi --metric treeness --out /tmp/f.tsv
# Auto-falls-back to .faa.mafft when .faa.mafft.clipkit is absent
# (some scogs zips ship only mafft alignments, not clipkit trims).
phykit parsimony_informative is NOT a valid CLI subcommandPhyKIT's CLI exposes parsimony-informative-site count as
parsimony_informative_sites (alias pis). Calling
phykit parsimony_informative <file> returns the help banner with
non-zero exit and silently produces zero values. The bundled scripts
translate parsimony_informative → parsimony_informative_sites
automatically. The output is <n_pi>\t<n_total>\t<percent> — column
THREE is the percentage that questions usually ask for.
When a question phrases tree-length / RCV / DVMC comparisons as "ratio of fungal to animal X across orthologs", there are TWO distinct quantities:
GROUP_MEDIAN_RATIO = median(values_fungi) / median(values_animals).
Use ALL orthologs in each group independently. This is what
group-comparison published numbers usually report (n_fungi can
differ from n_animals, and "across" is a population statement, not
a paired one).
PAIRED RATIO median = for each ortholog present in BOTH groups,
compute value_fungi / value_animals, then take the median across
common orthologs. Smaller denominator (intersection only) and a
different number when the groups have different size.
Default to GROUP_MEDIAN_RATIO unless the question explicitly says "matched ortholog", "paired ortholog", "per-ortholog ratio", or "for each ortholog". If the answer phrasing is ambiguous, BOTH numbers are in the script's output — pick the one matching the question's "across" / "paired" / "ratio of medians" phrasing.
When a BUSCO single-copy ortholog dataset (single_copy_busco_sequences/) is
present and the question asks "how many total amino acids are present in all
single-copy ortholog sequences", count one representative sequence per
ortholog, not the sum across all species/copies.
Each <ortholog_id>.faa in single_copy_busco_sequences/ typically contains
multiple species' copies of that ortholog (one each). Summing every sequence
across every species double/triple/N-fold counts each ortholog by the species
count and gives n_species × correct_answer.
| Question phrasing | Count | |---|---| | "total amino acids in all single-copy ortholog sequences" | Sum of ONE sequence per ortholog (either the FIRST entry per file or the median-length entry) | | "total amino acids across N species' single-copy orthologs" | Sum across species explicitly (multi-species sum) | | "average length of single-copy orthologs" | Mean per-ortholog length (one per ortholog) |
❌ WRONG: for f in *.faa: sum(len(rec.seq) for rec in SeqIO.parse(f, 'fasta')) then sum across files (multi-species sum)
✅ RIGHT: for f in *.faa: first_rec = next(SeqIO.parse(f, 'fasta')); total += len(first_rec.seq) (one representative per ortholog)
If your answer is n_species × GT (e.g. 32228 when GT looks like 13809 = 32228/2.33 ≈ 8 species × representative), you summed all species — re-do with one representative.
For metrics that can legitimately equal 0 for highly conserved or
very short alignments (parsimony informative %, RCV on near-identical
seqs), "lowest" in a question typically means "lowest non-zero". The
paired script emits LOWEST_NONZERO_RATIO for both orderings — use
that line when the raw min in a group is 0.
scogs zips ship in two shapes:
<gene>.faa, <gene>.faa.mafft, <gene>.faa.mafft.clipkit,
<gene>.faa.mafft.clipkit.treefile, plus iqtree/bionj/log/mldist.
Use clipkit alignment + treefile for tree-paired metrics.<gene>.faa + <gene>.faa.mafft. No
trees, no clipkit. Used for parsimony, RCV, gap-percentage
questions. Use the .faa.mafft (NOT raw .faa) — the published
metric was computed on the MAFFT-aligned file.Both bundled scripts auto-detect the layout and use the best available alignment per ortholog. Do NOT re-run MAFFT or ClipKit yourself; the shipped files are canonical.
Anti-pattern: running phykit on the raw *.busco.zip extracted
ortholog FASTAs and aligning/tree-building yourself. The pre-computed
files in scogs_*.zip are the canonical inputs.
PhyKIT, Biopython, and DendroPy for alignment/tree analysis, evolutionary metrics, and comparative genomics.
When uncertain about any scientific fact, SEARCH databases first.
FASTA/PHYLIP/Nexus/Newick files; treeness, RCV, DVMC, evolutionary rate, parsimony sites, tree length, bootstrap; group comparisons (Mann-Whitney U); tree construction (NJ/UPGMA/parsimony); Robinson-Foulds distance.
NOT for: MSA generation (MUSCLE/MAFFT), ML trees (IQ-TREE/RAxML), Bayesian (MrBayes/BEAST).
import numpy as np, pandas as pd
from scipy import stats
from Bio import AlignIO, Phylo, SeqIO
from phykit.services.tree.treeness import Treeness
from phykit.services.tree.total_tree_length import TotalTreeLength
from phykit.services.tree.evolutionary_rate import EvolutionaryRate
from phykit.services.tree.dvmc import DVMC
from phykit.services.tree.treeness_over_rcv import TreenessOverRCV
from phykit.services.alignment.parsimony_informative_sites import ParsimonyInformative
from phykit.services.alignment.rcv import RelativeCompositionVariability
import dendropy
ALIGNMENT ANALYSIS (FASTA/PHYLIP):
Parsimony sites → phykit_parsimony_informative()
RCV → phykit_rcv()
Gap % → alignment_gap_percentage()
TREE ANALYSIS (Newick):
Treeness → phykit_treeness()
Tree length → phykit_tree_length()
Evolutionary rate → phykit_evolutionary_rate()
DVMC → phykit_dvmc()
Bootstrap → extract_bootstrap_support()
COMBINED: Treeness/RCV → phykit_treeness_over_rcv(tree, aln)
TREE CONSTRUCTION: NJ → build_nj_tree(); UPGMA → build_upgma_tree(); Parsimony → build_parsimony_tree()
GROUP COMPARISON: batch metrics → Mann-Whitney U → summary stats
TREE COMPARISON: Robinson-Foulds → robinson_foulds_distance()
| Metric | Input | Description | |--------|-------|-------------| | Treeness | Newick | Internal / total branch length | | RCV | FASTA/PHYLIP | Relative Composition Variability | | Treeness/RCV | Both | Signal quality ratio | | Tree Length | Newick | Sum of all branch lengths | | Evolutionary Rate | Newick | Total length / num terminals | | DVMC | Newick | Degree of Violation of Molecular Clock | | Parsimony Sites | FASTA/PHYLIP | Sites with >=2 chars appearing >=2 times |
fungi_dvmc = batch_dvmc(discover_gene_files("data/fungi"))
animal_dvmc = batch_dvmc(discover_gene_files("data/animals"))
print(f"Fungi median: {np.median(list(fungi_dvmc.values())):.4f}")
u_stat, p_value = stats.mannwhitneyu(list(g1.values()), list(g2.values()), alternative='two-sided')
Filter by gap percentage < 5%, then compute treeness/RCV on filtered set.
gene_files = discover_gene_files("data/") # → [{gene_id, aln_file, tree_file}]
treeness_results = batch_treeness(gene_files) # → {gene_id: value}
| Pattern | Method |
|---------|--------|
| "median X" | np.median(values) |
| "maximum X" | np.max(values) |
| "difference in median" | abs(np.median(a) - np.median(b)) |
| "Mann-Whitney U" | stats.mannwhitneyu(a, b)[0] |
| "fold-change" | np.median(a) / np.median(b) |
Rounding: PhyKIT default 4 decimals. U stats = integer. Question wording overrides.
| Metric | Good | Acceptable | Poor | |--------|------|-----------|------| | Treeness | >0.8 | 0.5-0.8 | <0.5 | | RCV | <0.2 | 0.2-0.5 | >0.5 | | Treeness/RCV | >2.0 | 1.0-2.0 | <1.0 | | Bootstrap | >95% | 70-95% | <70% | | Parsimony sites | >30% | 10-30% | <10% |
All files identified; group structure detected; correct PhyKIT function; ALL genes processed (not sample); correct test; 4-decimal rounding; specific statistic (median/max/U/p); Mann-Whitney alternative='two-sided'.
phykit_batch_analysis tool for batch computationsFor ANY question asking for statistics across multiple trees/alignments (median treeness, mean saturation, DVMC percentage, gap percentage, long branch scores), use the ToolUniverse tool:
tu run phykit_batch_analysis '{"operation":"batch","function":"treeness","directory":"./trees","extension":".treefile"}'
tu run phykit_batch_analysis '{"operation":"batch","function":"saturation","directory":"./alignments","extension":".fa","tree_directory":"./trees","tree_extension":".treefile"}'
tu run phykit_batch_analysis '{"operation":"gap_percentage","directory":"./alignments","extension":".fa"}'
Do NOT run phykit manually in a loop — the tool handles all files and returns correct summary statistics.
When parsing PhyKIT stdout for batch metrics, the column you want depends on the metric:
| Command | Output columns | Column to take |
|---------|---------------|----------------|
| phykit saturation | slope <TAB> 1-slope | col 2 (1-slope) — proportion of evolutionary signal retained |
| phykit toverr (a.k.a. treeness_over_rcv) | treeness/RCV <TAB> treeness <TAB> RCV | col 1 (treeness/RCV ratio) |
| phykit long_branch_score -v (verbose) | taxon <TAB> score per line | aggregate scores per tree (mean) |
| phykit long_branch_score (no -v) | mean <TAB> median <TAB> 25%ile <TAB> 75%ile <TAB> min <TAB> max <TAB> std <TAB> var <TAB> n | col 1 (mean) for "mean LB score" |
| phykit patristic_distances (no -v) | summary stats line (same shape as LB) | col 1 (mean) for "mean patristic distance" |
Rule of thumb: phykit toverr and saturation produce multi-column lines per alignment. Don't grep the value that "looks like the answer" — count columns from the header in phykit <metric> --help. If your batch median is wildly off the published number (e.g., median treeness/RCV ≈ 0.20 when expected ≈ 0.26), you almost certainly picked the wrong column.
Preferred: don't parse phykit output by hand — call the phykit_batch_analysis tool, which already returns the correct column for each metric:
tu run phykit_batch_analysis '{"operation":"batch","function":"saturation","directory":"./alignments","extension":".fa","tree_directory":"./trees","tree_extension":".treefile"}'
tu run phykit_batch_analysis '{"operation":"batch","function":"treeness_over_rcv","directory":"./alignments","extension":".fa","tree_directory":"./trees","tree_extension":".treefile"}'
Sanity targets for biological scogs trees: median saturation ~0.4–0.7, median treeness/RCV ~0.2–0.4, median treeness ~0.05–0.15. Values an order of magnitude off these mean wrong column.
When the data folder has *.busco.zip files + target_orthologs.txt, use the bundled script — do NOT enumerate single_copy_busco_sequences/*.faa across all zips manually:
python skills/tooluniverse-phylogenetics/scripts/busco_target_orthologs.py \
--data-folder /path/to/data
The default run prints FIVE summary lines covering every common interpretation of "total amino acids":
# SUMMARY: n_targets=K, n_intersected=N (single-copy in ALL S species), intersected_total_aa=A, sum_all_aa=B
# SUMMARY group=all: intersected n=N total_aa=A, sum_all total_aa=B
# SUMMARY group=animals: sum_all total_aa=X <-- per-group sum (animal species only)
# SUMMARY group=fungi: sum_all total_aa=Y <-- per-group sum (fungal species only)
Match the question phrasing to the summary line:
| Question phrasing | Pick this line | Why |
|---|---|---|
| "total AA in all single-copy ortholog sequences" with only animal species in the data folder OR question mentions only one organism group | # SUMMARY group=animals: sum_all total_aa=... (or group=fungi) | scogs phylogenomics analyses are run PER GROUP; "all" refers to all orthologs WITHIN that group, not the union across groups |
| "total AA across orthologs single-copy in every / all species" | intersected_total_aa | strict intersection rule |
| "total AA across all per-species copies" | sum_all_aa (group=all) | only when the question says "all species" or the data folder has just one organism group |
Default rule when the data folder contains BOTH animal AND fungal busco
zips: published "total amino acids" answers almost always refer to
ONE group (the analysis group), NOT the cross-group union. Use
group=animals: sum_all or group=fungi: sum_all. Do NOT pick the
union number (sum_all_aa) unless the question explicitly says
"across all 8 species" or "fungi and animals combined".
The script emits the per-group sums BEFORE the union sum on stdout for
this exact reason — read the output line by line and stop at the
group=animals / group=fungi line that matches the analysis group
implied by the question.
Two-step rule when counting across BUSCO single_copy_busco_sequences/ data:
Find the comparison set first. If a target_orthologs.txt (or similar named subset list) exists in the data folder, that file IS the comparison set — restrict to those ortholog IDs only. Do not enumerate every BUSCO single-copy file across species. Do not assume "all" means the whole BUSCO output when a target list is provided.
Then apply the intersection rule. "Single-copy ortholog" across species means single-copy in EVERY species in the comparison set. If an ortholog is missing from one species' single_copy_busco_sequences/, exclude it from the count entirely — do not partially count the species that do have it.
Sanity check: if any species shows a much smaller per-ortholog count than others (e.g., one species at ~600 aa while others are 4000+ aa for the same ortholog set), the missing-from-some orthologs are inflating the per-ortholog average — drop them first.
Worked example. data folder has 8 species (4 animal, 4 fungal) *.busco.zip + target_orthologs.txt listing 10 ortholog IDs:
single_copy_busco_sequences/*.faa across all 8 species → ≈80 files → sum AA → answer 32228 (treats every per-species copy independently).single_copy in all 8 species → keep only intersected IDs (often 5/10 — some target IDs are multi-copy/missing in one species) → for kept IDs, sum AA across the 8 species → 13809.When a question asks for a median/percentile/mean across orthologs, your batch must include EVERY ortholog in the relevant comparison set:
scogs_fungi.zip ships ~255 fungal alignments+trees; scogs_animals.zip ships ~241. Median computed from a 10-file sample is NOT the published answer.phykit_batch_analysis, always point at the extracted scogs directory containing all per-ortholog files, not a hand-picked subset.Questions of the form "max X in genes with >70% gaps" require the filter to be applied before the max:
# 1. Compute gap% per alignment
# 2. Keep only alignments with gap% > 70
# 3. Compute treeness/RCV ON THE FILTERED SET
# 4. Take max
Computing the metric across all genes and then taking max returns the global max, which is wrong.
PhyKIT's long_branch_score -v outputs per-taxon LB scores (one row
per leaf in the tree). For per-tree summaries:
phykit long_branch_score -v <tree> → list of
per-taxon scores.Match the per-tree summary to the question phrasing:
| Question says... | Use --per-tree-stat ... |
|---|---|
| "mean long branch scores" | mean |
| "median long branch scores" | median |
| "average long branch score" (ambiguous) | run BOTH and pick the one matching numbers/units |
The bundled scogs_paired_compare.py --metric long_branch_score --per-tree-stat {mean,median} does steps 1+2 for both groups in one
pass and emits the cross-group MWU U + p-value directly.
Common error: averaging the four animal species and four fungal species directly without going through the per-tree step — this conflates species LB and ortholog LB and yields the wrong delta.
phykit toverr (a.k.a. treeness_over_rcv) takes BOTH alignment and tree. Use the trimmed alignment (*.faa.mafft.clipkit) paired with its treefile (*.faa.mafft.clipkit.treefile), not the raw .faa.mafft. ClipKit-trimmed alignments are what produced the canonical tree, so the RCV must be computed on the same trimmed alignment for the ratio to match published numbers.
AlignIO or the AMAS tool to iterate columns and count.Treeness = sum of internal branch lengths / total tree length. Internal branches are those that do not lead to a leaf (tip).
PhyKIT (pip install phykit) provides command-line functions for tree and alignment statistics. Common functions:
phykit treeness <tree_file> — outputs treeness (RCV) valuephykit saturation <alignment_file> -t <tree_file> — outputs saturation valuephykit dvmc <tree_file> — degree of violation of the molecular clockphykit long_branch_score <tree_file> — long-branch score (LBS)phykit alignment_length <alignment_file> — alignment lengthphykit parsimony_informative <alignment_file> — count parsimony informative sitesWhen running PhyKIT on multiple gene trees/alignments, use the bundled batch script:
# Treeness across all trees
python skills/tooluniverse-phylogenetics/scripts/phykit_batch.py \
--dir scogs_fungi --function treeness --ext .treefile --stat median
# Saturation with paired alignment+tree
python skills/tooluniverse-phylogenetics/scripts/phykit_batch.py \
--dir alignments --function saturation --tree-dir trees \
--ext .fa --tree-ext .treefile --stat median
# Long branch score (mean per tree, then median across trees)
python skills/tooluniverse-phylogenetics/scripts/phykit_batch.py \
--dir trees --function long_branch_score --ext .treefile \
--per-tree-stat mean --stat median
# DVMC
python skills/tooluniverse-phylogenetics/scripts/phykit_batch.py \
--dir trees --function dvmc --ext .treefile --stat all
# Gap percentage across all alignments
python skills/tooluniverse-phylogenetics/scripts/phykit_batch.py \
--dir alignments --function gap_percentage --ext .fa
# Evolutionary rate (median across trees)
python skills/tooluniverse-phylogenetics/scripts/phykit_batch.py \
--dir trees --function evolutionary_rate --ext .treefile --stat median
# Mean patristic distance per tree, then mean across trees
python skills/tooluniverse-phylogenetics/scripts/phykit_batch.py \
--dir trees --function patristic_distances --ext .treefile --stat mean
Preferred: use the phykit_batch_analysis ToolUniverse tool instead of running PhyKIT manually:
# Via CLI
tu run phykit_batch_analysis '{"operation":"batch","function":"treeness","directory":"/path/to/trees","extension":".treefile"}'
# Via SDK
tu.run_one_function({"name": "phykit_batch_analysis", "arguments": {"operation": "batch", "function": "saturation", "directory": "/path/to/alignments", "extension": ".fa", "tree_directory": "/path/to/trees"}})
# Gap percentage
tu run phykit_batch_analysis '{"operation":"gap_percentage","directory":"/path/to/alignments","extension":".fa"}'
Key rules:
"per_tree_stat":"mean"references/sequence_alignment.md, references/tree_building.md, references/parsimony_analysis.md, scripts/tree_statistics.py
tools
PCR / qPCR primer and oligo design — design forward/reverse primers for a target region (SantaLucia nearest-neighbor thermodynamics), compute melting temperature (Tm) and annealing temperature (Ta), check GC content, and screen an oligo for hairpins and primer-dimers. Use when you need primers for a sequence, want to QC an existing primer pair, or need the Tm of an oligo. Covers the primer-design rules (Tm matching, GC clamp, 3'-end, length) and the tools' constraint quirks.
tools
Pharmacokinetic (PK) analysis of concentration-time data — non-compartmental analysis (NCA) for Cmax, Tmax, AUC (0-t and 0-∞), terminal half-life, clearance (CL), volume of distribution (Vd), MRT, and absolute bioavailability (F). Also one-compartment fitting. Use when you have plasma/serum drug concentrations over time after a dose and need PK parameters, or to compute bioavailability from IV + oral AUCs. NOT for ADMET property prediction from structure (use tooluniverse-admet-prediction).
tools
Molecular cloning assembly design — Gibson Assembly (overlap design for seamless multi-fragment joining) and Golden Gate Assembly (Type IIS / BsaI / BbsI design with unique 4-bp fusion overhangs). Use when you need to plan how to join DNA fragments into a construct, design assembly overlaps/overhangs, or decide between cloning methods. Covers the domestication (internal-site removal), overhang-uniqueness, and overlap-Tm rules. For PCR primers to generate the fragments, see tooluniverse-primer-design.
tools
Meta-analysis / evidence synthesis — pool effect sizes across studies (odds ratios, risk ratios, hazard ratios, mean differences, correlations, GWAS betas) with fixed- or random-effects models, quantify heterogeneity (Q, I², τ²), and build a forest plot. Use when you have results from MULTIPLE studies and need a single pooled estimate, or to synthesize evidence from a systematic review / multiple GWAS / replicated experiments. Handles the error-prone effect-size + standard-error preparation (converting OR/HR/CI, two-group means±SD, proportions, and correlations into the (effect, SE) the pooling step needs).