Protein Phylogeny Inference

Pipeline for building protein phylogenies across deep evolutionary distances. Designed for single domains, whole single-domain proteins, and multi-domain proteins, from ~10 to several thousand sequences, spanning sponges to all eukaryotes.

If the user says "gene tree" they may mean protein sequences. Confirm if ambiguous, or inspect the input sequences (amino acid alphabet vs nucleotide).

Overview: Two-Phase Workflow

This skill generates a reproducible .qmd analysis script (Python) rather than running commands directly. The workflow is:

Discuss — Resolve all pipeline decisions with the user (input sequences, algorithm, model tier, trimming, etc.)
Generate — Write a complete .qmd script encoding all decisions as configuration variables, following quarto-docs and script-organization skill conventions
Render — User renders with quarto render (or Claude renders it), producing all outputs in outs/<subdirectory>/XX_script_name/

One script per major deliverable (one protein family or domain tree). All pipeline steps go in one .qmd unless there is a strong reason to split.

Phase 1: Discussion

Resolve these questions with the user before generating the script. Use the recommendation logic below to guide the conversation.

1. Input validation

Validate the FASTA file before proceeding. The generated script includes a validation chunk (see Phase 2) that runs these checks automatically, but also inspect the file during the discussion phase to inform algorithm/model recommendations.

File exists and is valid FASTA — parseable by BioPython, non-empty
Sequences are protein — check alphabet. If >90% of characters are A/T/G/C/N, warn the user: "These look like nucleotide sequences, not protein. Confirm?"
Basic stats — sequence count, median length, min/max length, length distribution
Duplicate IDs — flag any duplicate sequence headers (MAFFT will silently use only the last occurrence)
Empty sequences — flag sequences with length 0
Non-standard characters — report any characters outside the standard amino acid alphabet (ACDEFGHIKLMNPQRSTVWY) plus common extras (X, U, B, Z, J, O, *). A few X's are normal; many suggest quality issues
Stop codons — if * appears mid-sequence (not just at the end), warn the user

2. Minimum length cutoff

Calculate median sequence length
Flag sequences shorter than 50% of median: "You have N sequences shorter than 50% of the median length (X aa). Short fragments can degrade alignment quality. Want to remove them, use a different cutoff, or keep all?"

3. CD-HIT redundancy reduction (optional, off by default)

If requested: 99% identity threshold
Report how many sequences would be removed

4. MAFFT algorithm selection

| Condition | Algorithm | MAFFT flags | |-----------|-----------|-------------| | Single domain or whole protein, <= 2000 seq | L-INS-i (default) | --localpair --maxiterate 1000 --reorder | | Multi-domain protein with variable linkers | E-INS-i | --genafpair --maxiterate 1000 --reorder | | > 2000 sequences | auto (MAFFT decides) | --auto --reorder |

Recommendation: "You have N sequences, median length X aa. This looks like a [single-domain / multi-domain / whole-protein] dataset. I'd recommend MAFFT [L-INS-i / E-INS-i / auto]. Does that sound right?"

5. Trimming (optional, off by default)

Do not trim unless the user requests it. For deep divergences, aggressive trimming can remove phylogenetically informative sites. Good models (site-heterogeneous) handle noisy columns.

| Tool | Mode | Best for | |------|------|----------| | ClipKIT | kpic-gappy | General purpose | | BMGE | BLOSUM30 | Very deep divergences with compositional heterogeneity |

Deep datasets (cross-animal, cross-eukaryote): recommend BMGE
Otherwise: recommend ClipKIT

6. IQ-TREE tier selection

| Tier | Model | When to use | Approximate time (166 seq) | |------|-------|-------------|---------------------------| | 1 | MFP (ModelFinder) | Quick exploration, screening | ~15-30 min | | 1.5 | Q.pfam+F+R6 | Fast gene trees, batch screening — skip model search | ~10-20 min | | 2 | ELM+C60+G PMSF (default) | Deep eukaryotic phylogenies | ~3-8 hours | | 2 alt | LG+C60+F+R PMSF | Non-eukaryotic or for comparison | ~3-8 hours | | 3 | PhyloBayes CAT+GTR | Maximum rigor, < 200 seq | Hours to days |

Q.pfam: empirical protein model trained on Pfam alignments — solid general-purpose choice when model selection overhead isn't justified. +F uses empirical amino acid frequencies, +R6 uses 6 free-rate categories. Good for batch phylogenetics on the cluster where you're building many trees and don't need per-dataset model optimization.
ELM (Eukaryotic Linked Mixture): better than LG with profile mixtures for eukaryotic data (Banos et al. 2024, MBE)
C60: 60-class amino acid profile mixture
PMSF: posterior mean site frequency profiles (ML approximation of Bayesian CAT)

7. Branch support

Default: UFBoot2 (1000) + SH-aLRT (1000), simultaneous with -B 1000 -alrt 1000
Optional: Non-parametric bootstrap (-b 1000) for publication-quality final trees
Note: UFBoot >= 95 ~ traditional bootstrap >= 70

8. No automated outlier removal

The user inspects the tree after inference and decides if anything looks wrong. Do not use TreeShrink or similar tools automatically.

Phase 2: Script Generation

Once all decisions are resolved, generate a .qmd script following these conventions.

Naming and placement

Follow the script-organization skill:

Script: scripts/<subdirectory>/XX_name.qmd (next available number)
Outputs: outs/<subdirectory>/XX_name/
Register in project CLAUDE.md script table

.qmd structure

Generate a Python .qmd with these sections in order:

YAML frontmatter (standard quarto-docs Python template)
Setup chunk (PROJECT_ROOT, out_dir, git_hash, imports)
Configuration chunk — all user decisions as named variables
Inputs chunk (FASTA path, clearly labeled)
Input validation chunk (FASTA validation + sequence stats + length filtering)
[Optional] CD-HIT chunk (controlled by config flag)
Alignment chunk (MAFFT via subprocess)
[Optional] Trimming chunk (controlled by config flag)
Tree inference chunk (IQ-TREE, tier from config)
Summary chunk (alignment stats, model, support values)
Build info chunk (standard BUILD_INFO.txt)

Key patterns

Configuration chunk — all decisions as named variables at the top:

#| label: config

# ---- Pipeline Configuration ----
# Resolved during discussion; change here to re-run with different options.

INPUT_FASTA = PROJECT_ROOT / "data/phylogenetics/sequences.fasta"
MIN_LENGTH_FRACTION = 0.5        # Remove sequences shorter than this fraction of median

RUN_CDHIT = False                # Redundancy reduction
CDHIT_THRESHOLD = 0.99

MAFFT_ALGORITHM = "linsi"        # "linsi" | "einsi" | "auto"
MAFFT_THREADS = 8                # Parallel threads for MAFFT alignment

RUN_TRIMMING = False
TRIMMING_TOOL = "clipkit"        # "clipkit" | "bmge"

IQTREE_TIER = 2                  # 1 = MFP, 1.5 = Q.pfam+F+R6, 2 = ELM+C60+G PMSF, 3 = PhyloBayes
IQTREE_MATRIX = "ELM"           # "ELM" | "LG" (Tier 2 only)
IQTREE_THREADS = 8              # Fixed thread count (AUTO crashes with PMSF in v3.0.1)
BOOTSTRAP_REPS = 1000

FASTA validation chunk — runs before any analysis:

#| label: validate-input

from Bio import SeqIO
from collections import Counter

VALID_AA = set("ACDEFGHIKLMNPQRSTVWY")
EXTENDED_AA = set("XUBZJO*")

fasta_path = INPUT_FASTA
assert fasta_path.exists(), f"FASTA file not found: {fasta_path}"

records = list(SeqIO.parse(fasta_path, "fasta"))
assert len(records) > 0, "FASTA file is empty or not valid FASTA format"

# Basic stats
lengths = [len(r.seq) for r in records]
median_len = sorted(lengths)[len(lengths) // 2]
print(f"Sequences: {len(records)}")
print(f"Length: median {median_len}, min {min(lengths)}, max {max(lengths)}")

# Duplicate IDs
ids = [r.id for r in records]
dupes = [id for id, count in Counter(ids).items() if count > 1]
if dupes:
    print(f"WARNING: {len(dupes)} duplicate IDs — MAFFT keeps only last: {dupes[:5]}")

# Empty sequences
empty = [r.id for r in records if len(r.seq) == 0]
if empty:
    print(f"WARNING: {len(empty)} empty sequences: {empty[:5]}")

# Alphabet check — nucleotide vs protein
all_chars = Counter()
for r in records:
    all_chars.update(str(r.seq).upper())
total_chars = sum(all_chars.values())
nuc_chars = sum(all_chars.get(c, 0) for c in "ATGCN")
if total_chars > 0 and nuc_chars / total_chars > 0.9:
    print(f"WARNING: {nuc_chars/total_chars:.0%} nucleotide chars — may not be protein")

# Non-standard characters
nonstandard = {c: n for c, n in all_chars.items()
               if c not in VALID_AA and c not in EXTENDED_AA and c != "-"}
if nonstandard:
    print(f"WARNING: Non-standard characters: {nonstandard}")

# Stop codons mid-sequence
for r in records:
    seq = str(r.seq)
    if "*" in seq[:-1]:  # ignore terminal stop
        print(f"WARNING: Internal stop codon (*) in {r.id}")

# Short sequence filter
short_cutoff = median_len * MIN_LENGTH_FRACTION
short = [r.id for r in records if len(r.seq) < short_cutoff]
if short:
    print(f"\n{len(short)} sequences < {short_cutoff:.0f} aa "
          f"({MIN_LENGTH_FRACTION:.0%} of median)")
    records = [r for r in records if len(r.seq) >= short_cutoff]
    print(f"Retained {len(records)} sequences after length filter")
    filtered_fasta = out_dir / "input_filtered.fasta"
    SeqIO.write(records, filtered_fasta, "fasta")
else:
    print(f"\nAll sequences >= {short_cutoff:.0f} aa — no filtering needed")
    filtered_fasta = fasta_path

Caching pattern — skip expensive steps if output exists:

aligned_fasta = out_dir / "aligned.fasta"

if not aligned_fasta.exists():
    print("Running MAFFT alignment...")
    # ... subprocess call ...
    print(f"  Alignment: {n_seqs} sequences x {n_cols} columns")
else:
    print(f"Alignment cached at {aligned_fasta.name}, skipping MAFFT")

Subprocess pattern — consistent error handling:

result = subprocess.run(
    cmd, capture_output=True, text=True
)
if result.returncode != 0:
    print(f"STDERR: {result.stderr[:500]}")
    raise RuntimeError("MAFFT failed")

MAFFT algorithm mapping

MAFFT_FLAGS = {
    "linsi": ["--localpair", "--maxiterate", "1000", "--reorder", "--thread", str(MAFFT_THREADS)],
    "einsi": ["--genafpair", "--maxiterate", "1000", "--reorder", "--thread", str(MAFFT_THREADS)],
    "auto":  ["--auto", "--reorder", "--thread", str(MAFFT_THREADS)],
}

IQ-TREE commands by tier

Tier 1 — Quick baseline:

cmd = [
    "iqtree3", "-s", str(input_aln),
    "-m", "MFP",
    "-B", str(BOOTSTRAP_REPS), "-alrt", str(BOOTSTRAP_REPS),
    "-nstop", "50", "-T", "AUTO",
    "-pre", str(out_dir / "tree")
]

Tier 1.5 — Fast fixed model (batch screening, no model search):

cmd = [
    "iqtree3", "-s", str(input_aln),
    "-m", "Q.pfam+F+R6",
    "-B", str(BOOTSTRAP_REPS), "-alrt", str(BOOTSTRAP_REPS),
    "-nstop", "50", "-T", "AUTO",
    "-pre", str(out_dir / "tree")
]

Tier 2 — Two-pass PMSF (default for deep eukaryotic phylogenies):

model = f"{IQTREE_MATRIX}+C60+G" if IQTREE_MATRIX == "ELM" else f"{IQTREE_MATRIX}+C60+F+R"

# Pass 1: guide tree
guide_tree = out_dir / "guide.treefile"
if not guide_tree.exists():
    cmd_pass1 = [
        "iqtree3", "-s", str(input_aln),
        "-m", model,
        "-nstop", "50", "-T", str(IQTREE_THREADS),
        "-pre", str(out_dir / "guide")
    ]
    subprocess.run(cmd_pass1, check=True)

# Pass 2: PMSF tree with bootstrap
final_tree = out_dir / "pmsf.treefile"
if not final_tree.exists():
    cmd_pass2 = [
        "iqtree3", "-s", str(input_aln),
        "-m", model,
        "-ft", str(guide_tree),
        "-B", str(BOOTSTRAP_REPS), "-alrt", str(BOOTSTRAP_REPS),
        "-nstop", "50", "-T", str(IQTREE_THREADS),
        "-pre", str(out_dir / "pmsf")
    ]
    subprocess.run(cmd_pass2, check=True)

Bug workaround (IQ-TREE 3.0.1): -T AUTO crashes with an assertion error during PMSF site frequency computation (computePartialParsimonyFast). Use a fixed thread count (e.g., -T 8) instead. Re-test with future IQ-TREE releases.

Tier 3 — PhyloBayes (special case):

PhyloBayes runs for hours to days and does not fit the "render once" model. The .qmd should launch chains and note that convergence must be checked separately:

print("PhyloBayes requires long-running chains.")
print("Launch manually:")
print(f"  pb_mpi -d {input_phy} -cat -gtr -x 1 10000 chain1")
print(f"  pb_mpi -d {input_phy} -cat -gtr -x 1 10000 chain2")
print("Check convergence with bpcomp -x 2500 chain1 chain2")

CAT+GTR: infinite mixture model, gold standard for deep phylogenetics
Very slow; not practical for > ~200 sequences

Summary chunk

Parse IQ-TREE output and report:

# Read support values from treefile
import re
tree_text = (out_dir / "pmsf.treefile").read_text()
supports = re.findall(r'\)([\d.]+)/([\d.]+):', tree_text)
ufboot = [float(s[1]) for s in supports]
shalrt = [float(s[0]) for s in supports]

total = len(ufboot)
print(f"=== Tree Inference Summary ===")
print(f"Sequences: {n_seqs}")
print(f"Alignment: {n_cols} columns")
print(f"Model: {model} (Tier {IQTREE_TIER})")
print(f"UFBoot >= 95: {sum(1 for v in ufboot if v >= 95)}/{total}")
print(f"UFBoot >= 70: {sum(1 for v in ufboot if v >= 70)}/{total}")
print(f"SH-aLRT >= 80: {sum(1 for v in shalrt if v >= 80)}/{total}")
print(f"\nTree file: {final_tree}")
print(f"\nNext: use tree-formatting skill for visualization")

Phase 3: After Rendering

Output files

The rendered script produces these files in outs/<subdirectory>/XX_name/:

| File | Description | |------|-------------| | pmsf.treefile | ML tree with UFBoot/SH-aLRT support (Tier 2) | | pmsf.contree | Bootstrap consensus tree | | pmsf.iqtree | Full IQ-TREE report | | pmsf.sitefreq | PMSF site frequency profiles | | guide.treefile | Pass 1 guide tree (Tier 2) | | aligned.fasta | MAFFT alignment | | BUILD_INFO.txt | Script provenance |

How to render

source ~/miniconda3/etc/profile.d/conda.sh && conda activate <env>
quarto render scripts/<subdirectory>/XX_name.qmd --output-dir outs/<subdirectory>/XX_name/

For long-running Tier 2 analyses, consider rendering in a screen/tmux session.

Next steps

"Tree is ready. Want to visualize it? The tree-formatting skill handles layout, coloring, and clade collapsing."
"Want to re-run with trimming / different model / PhyloBayes?"

Optional Sensitivity Analyses

These are not part of the standard pipeline but can be added as additional chunks:

Dayhoff6 recoding: Reduces 20 amino acids to 6 physicochemical groups. Tests whether topology is driven by compositional convergence. Run IQ-TREE with -m GTR+R on the recoded alignment.
AU topology test: If competing hypotheses exist for specific nodes. Run IQ-TREE with -zb 10000 -au.

Required Tools

| Tool | Purpose | Install | |------|---------|---------| | MAFFT | Alignment | conda install -c bioconda mafft | | IQ-TREE 3 | Tree inference | conda install -c bioconda iqtree (ensure v3+) | | BioPython | FASTA I/O, sequence handling | pip install biopython | | CD-HIT | Redundancy reduction (optional) | conda install -c bioconda cd-hit | | ClipKIT | Trimming (optional) | pip install clipkit | | BMGE | Trimming (optional) | conda install -c bioconda bmge | | PhyloBayes-MPI | Bayesian inference (optional) | conda install -c bioconda phylobayes-mpi |

Key References

Katoh & Standley (2013) MAFFT v7. Mol Biol Evol 30:772-780.
Banos et al. (2024) GTRpmix: A linked GTR model for profile mixture models. Mol Biol Evol 41:msae174.
Steenwyk et al. (2020) ClipKIT: a multiple sequence alignment trimming software. PLoS Biol 18:e3001007.
Criscuolo & Gribaldo (2010) BMGE. BMC Evol Biol 10:210.
Wang et al. (2017) PMSF model. Syst Biol 67:216-235.
Lartillot et al. (2013) PhyloBayes-MPI. Syst Biol 62:611-615.

Protein Phylogeny Inference

If the user says "gene tree" they may mean protein sequences. Confirm if ambiguous, or inspect the input sequences (amino acid alphabet vs nucleotide).

Overview: Two-Phase Workflow

This skill generates a reproducible .qmd analysis script (Python) rather than running commands directly. The workflow is:

Discuss — Resolve all pipeline decisions with the user (input sequences, algorithm, model tier, trimming, etc.)
Generate — Write a complete .qmd script encoding all decisions as configuration variables, following quarto-docs and script-organization skill conventions
Render — User renders with quarto render (or Claude renders it), producing all outputs in outs/<subdirectory>/XX_script_name/

One script per major deliverable (one protein family or domain tree). All pipeline steps go in one .qmd unless there is a strong reason to split.

Phase 1: Discussion

Resolve these questions with the user before generating the script. Use the recommendation logic below to guide the conversation.

1. Input validation

File exists and is valid FASTA — parseable by BioPython, non-empty
Sequences are protein — check alphabet. If >90% of characters are A/T/G/C/N, warn the user: "These look like nucleotide sequences, not protein. Confirm?"
Basic stats — sequence count, median length, min/max length, length distribution
Duplicate IDs — flag any duplicate sequence headers (MAFFT will silently use only the last occurrence)
Empty sequences — flag sequences with length 0
Non-standard characters — report any characters outside the standard amino acid alphabet (ACDEFGHIKLMNPQRSTVWY) plus common extras (X, U, B, Z, J, O, *). A few X's are normal; many suggest quality issues
Stop codons — if * appears mid-sequence (not just at the end), warn the user

2. Minimum length cutoff

Calculate median sequence length
Flag sequences shorter than 50% of median: "You have N sequences shorter than 50% of the median length (X aa). Short fragments can degrade alignment quality. Want to remove them, use a different cutoff, or keep all?"

3. CD-HIT redundancy reduction (optional, off by default)

If requested: 99% identity threshold
Report how many sequences would be removed

4. MAFFT algorithm selection

5. Trimming (optional, off by default)

Do not trim unless the user requests it. For deep divergences, aggressive trimming can remove phylogenetically informative sites. Good models (site-heterogeneous) handle noisy columns.

| Tool | Mode | Best for | |------|------|----------| | ClipKIT | kpic-gappy | General purpose | | BMGE | BLOSUM30 | Very deep divergences with compositional heterogeneity |

Deep datasets (cross-animal, cross-eukaryote): recommend BMGE
Otherwise: recommend ClipKIT

6. IQ-TREE tier selection

Q.pfam: empirical protein model trained on Pfam alignments — solid general-purpose choice when model selection overhead isn't justified. +F uses empirical amino acid frequencies, +R6 uses 6 free-rate categories. Good for batch phylogenetics on the cluster where you're building many trees and don't need per-dataset model optimization.
ELM (Eukaryotic Linked Mixture): better than LG with profile mixtures for eukaryotic data (Banos et al. 2024, MBE)
C60: 60-class amino acid profile mixture
PMSF: posterior mean site frequency profiles (ML approximation of Bayesian CAT)

7. Branch support

Default: UFBoot2 (1000) + SH-aLRT (1000), simultaneous with -B 1000 -alrt 1000
Optional: Non-parametric bootstrap (-b 1000) for publication-quality final trees
Note: UFBoot >= 95 ~ traditional bootstrap >= 70

8. No automated outlier removal

The user inspects the tree after inference and decides if anything looks wrong. Do not use TreeShrink or similar tools automatically.

Phase 2: Script Generation

Once all decisions are resolved, generate a .qmd script following these conventions.

Naming and placement

Follow the script-organization skill:

Script: scripts/<subdirectory>/XX_name.qmd (next available number)
Outputs: outs/<subdirectory>/XX_name/
Register in project CLAUDE.md script table

.qmd structure

Generate a Python .qmd with these sections in order:

YAML frontmatter (standard quarto-docs Python template)
Setup chunk (PROJECT_ROOT, out_dir, git_hash, imports)
Configuration chunk — all user decisions as named variables
Inputs chunk (FASTA path, clearly labeled)
Input validation chunk (FASTA validation + sequence stats + length filtering)
[Optional] CD-HIT chunk (controlled by config flag)
Alignment chunk (MAFFT via subprocess)
[Optional] Trimming chunk (controlled by config flag)
Tree inference chunk (IQ-TREE, tier from config)
Summary chunk (alignment stats, model, support values)
Build info chunk (standard BUILD_INFO.txt)

Key patterns

Configuration chunk — all decisions as named variables at the top:

#| label: config

# ---- Pipeline Configuration ----
# Resolved during discussion; change here to re-run with different options.

INPUT_FASTA = PROJECT_ROOT / "data/phylogenetics/sequences.fasta"
MIN_LENGTH_FRACTION = 0.5        # Remove sequences shorter than this fraction of median

RUN_CDHIT = False                # Redundancy reduction
CDHIT_THRESHOLD = 0.99

MAFFT_ALGORITHM = "linsi"        # "linsi" | "einsi" | "auto"
MAFFT_THREADS = 8                # Parallel threads for MAFFT alignment

RUN_TRIMMING = False
TRIMMING_TOOL = "clipkit"        # "clipkit" | "bmge"

IQTREE_TIER = 2                  # 1 = MFP, 1.5 = Q.pfam+F+R6, 2 = ELM+C60+G PMSF, 3 = PhyloBayes
IQTREE_MATRIX = "ELM"           # "ELM" | "LG" (Tier 2 only)
IQTREE_THREADS = 8              # Fixed thread count (AUTO crashes with PMSF in v3.0.1)
BOOTSTRAP_REPS = 1000

FASTA validation chunk — runs before any analysis:

#| label: validate-input

from Bio import SeqIO
from collections import Counter

VALID_AA = set("ACDEFGHIKLMNPQRSTVWY")
EXTENDED_AA = set("XUBZJO*")

fasta_path = INPUT_FASTA
assert fasta_path.exists(), f"FASTA file not found: {fasta_path}"

records = list(SeqIO.parse(fasta_path, "fasta"))
assert len(records) > 0, "FASTA file is empty or not valid FASTA format"

# Basic stats
lengths = [len(r.seq) for r in records]
median_len = sorted(lengths)[len(lengths) // 2]
print(f"Sequences: {len(records)}")
print(f"Length: median {median_len}, min {min(lengths)}, max {max(lengths)}")

# Duplicate IDs
ids = [r.id for r in records]
dupes = [id for id, count in Counter(ids).items() if count > 1]
if dupes:
    print(f"WARNING: {len(dupes)} duplicate IDs — MAFFT keeps only last: {dupes[:5]}")

# Empty sequences
empty = [r.id for r in records if len(r.seq) == 0]
if empty:
    print(f"WARNING: {len(empty)} empty sequences: {empty[:5]}")

# Alphabet check — nucleotide vs protein
all_chars = Counter()
for r in records:
    all_chars.update(str(r.seq).upper())
total_chars = sum(all_chars.values())
nuc_chars = sum(all_chars.get(c, 0) for c in "ATGCN")
if total_chars > 0 and nuc_chars / total_chars > 0.9:
    print(f"WARNING: {nuc_chars/total_chars:.0%} nucleotide chars — may not be protein")

# Non-standard characters
nonstandard = {c: n for c, n in all_chars.items()
               if c not in VALID_AA and c not in EXTENDED_AA and c != "-"}
if nonstandard:
    print(f"WARNING: Non-standard characters: {nonstandard}")

# Stop codons mid-sequence
for r in records:
    seq = str(r.seq)
    if "*" in seq[:-1]:  # ignore terminal stop
        print(f"WARNING: Internal stop codon (*) in {r.id}")

# Short sequence filter
short_cutoff = median_len * MIN_LENGTH_FRACTION
short = [r.id for r in records if len(r.seq) < short_cutoff]
if short:
    print(f"\n{len(short)} sequences < {short_cutoff:.0f} aa "
          f"({MIN_LENGTH_FRACTION:.0%} of median)")
    records = [r for r in records if len(r.seq) >= short_cutoff]
    print(f"Retained {len(records)} sequences after length filter")
    filtered_fasta = out_dir / "input_filtered.fasta"
    SeqIO.write(records, filtered_fasta, "fasta")
else:
    print(f"\nAll sequences >= {short_cutoff:.0f} aa — no filtering needed")
    filtered_fasta = fasta_path

Caching pattern — skip expensive steps if output exists:

aligned_fasta = out_dir / "aligned.fasta"

if not aligned_fasta.exists():
    print("Running MAFFT alignment...")
    # ... subprocess call ...
    print(f"  Alignment: {n_seqs} sequences x {n_cols} columns")
else:
    print(f"Alignment cached at {aligned_fasta.name}, skipping MAFFT")

Subprocess pattern — consistent error handling:

result = subprocess.run(
    cmd, capture_output=True, text=True
)
if result.returncode != 0:
    print(f"STDERR: {result.stderr[:500]}")
    raise RuntimeError("MAFFT failed")

MAFFT algorithm mapping

MAFFT_FLAGS = {
    "linsi": ["--localpair", "--maxiterate", "1000", "--reorder", "--thread", str(MAFFT_THREADS)],
    "einsi": ["--genafpair", "--maxiterate", "1000", "--reorder", "--thread", str(MAFFT_THREADS)],
    "auto":  ["--auto", "--reorder", "--thread", str(MAFFT_THREADS)],
}

IQ-TREE commands by tier

Tier 1 — Quick baseline:

cmd = [
    "iqtree3", "-s", str(input_aln),
    "-m", "MFP",
    "-B", str(BOOTSTRAP_REPS), "-alrt", str(BOOTSTRAP_REPS),
    "-nstop", "50", "-T", "AUTO",
    "-pre", str(out_dir / "tree")
]

Tier 1.5 — Fast fixed model (batch screening, no model search):

cmd = [
    "iqtree3", "-s", str(input_aln),
    "-m", "Q.pfam+F+R6",
    "-B", str(BOOTSTRAP_REPS), "-alrt", str(BOOTSTRAP_REPS),
    "-nstop", "50", "-T", "AUTO",
    "-pre", str(out_dir / "tree")
]

Tier 2 — Two-pass PMSF (default for deep eukaryotic phylogenies):

model = f"{IQTREE_MATRIX}+C60+G" if IQTREE_MATRIX == "ELM" else f"{IQTREE_MATRIX}+C60+F+R"

# Pass 1: guide tree
guide_tree = out_dir / "guide.treefile"
if not guide_tree.exists():
    cmd_pass1 = [
        "iqtree3", "-s", str(input_aln),
        "-m", model,
        "-nstop", "50", "-T", str(IQTREE_THREADS),
        "-pre", str(out_dir / "guide")
    ]
    subprocess.run(cmd_pass1, check=True)

# Pass 2: PMSF tree with bootstrap
final_tree = out_dir / "pmsf.treefile"
if not final_tree.exists():
    cmd_pass2 = [
        "iqtree3", "-s", str(input_aln),
        "-m", model,
        "-ft", str(guide_tree),
        "-B", str(BOOTSTRAP_REPS), "-alrt", str(BOOTSTRAP_REPS),
        "-nstop", "50", "-T", str(IQTREE_THREADS),
        "-pre", str(out_dir / "pmsf")
    ]
    subprocess.run(cmd_pass2, check=True)

Bug workaround (IQ-TREE 3.0.1): -T AUTO crashes with an assertion error during PMSF site frequency computation (computePartialParsimonyFast). Use a fixed thread count (e.g., -T 8) instead. Re-test with future IQ-TREE releases.

Tier 3 — PhyloBayes (special case):

PhyloBayes runs for hours to days and does not fit the "render once" model. The .qmd should launch chains and note that convergence must be checked separately:

print("PhyloBayes requires long-running chains.")
print("Launch manually:")
print(f"  pb_mpi -d {input_phy} -cat -gtr -x 1 10000 chain1")
print(f"  pb_mpi -d {input_phy} -cat -gtr -x 1 10000 chain2")
print("Check convergence with bpcomp -x 2500 chain1 chain2")

CAT+GTR: infinite mixture model, gold standard for deep phylogenetics
Very slow; not practical for > ~200 sequences

Summary chunk

Parse IQ-TREE output and report:

# Read support values from treefile
import re
tree_text = (out_dir / "pmsf.treefile").read_text()
supports = re.findall(r'\)([\d.]+)/([\d.]+):', tree_text)
ufboot = [float(s[1]) for s in supports]
shalrt = [float(s[0]) for s in supports]

total = len(ufboot)
print(f"=== Tree Inference Summary ===")
print(f"Sequences: {n_seqs}")
print(f"Alignment: {n_cols} columns")
print(f"Model: {model} (Tier {IQTREE_TIER})")
print(f"UFBoot >= 95: {sum(1 for v in ufboot if v >= 95)}/{total}")
print(f"UFBoot >= 70: {sum(1 for v in ufboot if v >= 70)}/{total}")
print(f"SH-aLRT >= 80: {sum(1 for v in shalrt if v >= 80)}/{total}")
print(f"\nTree file: {final_tree}")
print(f"\nNext: use tree-formatting skill for visualization")

Phase 3: After Rendering

Output files

The rendered script produces these files in outs/<subdirectory>/XX_name/:

How to render

source ~/miniconda3/etc/profile.d/conda.sh && conda activate <env>
quarto render scripts/<subdirectory>/XX_name.qmd --output-dir outs/<subdirectory>/XX_name/

For long-running Tier 2 analyses, consider rendering in a screen/tmux session.

Next steps

"Tree is ready. Want to visualize it? The tree-formatting skill handles layout, coloring, and clade collapsing."
"Want to re-run with trimming / different model / PhyloBayes?"

Optional Sensitivity Analyses

These are not part of the standard pipeline but can be added as additional chunks:

Dayhoff6 recoding: Reduces 20 amino acids to 6 physicochemical groups. Tests whether topology is driven by compositional convergence. Run IQ-TREE with -m GTR+R on the recoded alignment.
AU topology test: If competing hypotheses exist for specific nodes. Run IQ-TREE with -zb 10000 -au.

Required Tools

Key References

Katoh & Standley (2013) MAFFT v7. Mol Biol Evol 30:772-780.
Banos et al. (2024) GTRpmix: A linked GTR model for profile mixture models. Mol Biol Evol 41:msae174.
Steenwyk et al. (2020) ClipKIT: a multiple sequence alignment trimming software. PLoS Biol 18:e3001007.
Criscuolo & Gribaldo (2010) BMGE. BMC Evol Biol 10:210.
Wang et al. (2017) PMSF model. Syst Biol 67:216-235.
Lartillot et al. (2013) PhyloBayes-MPI. Syst Biol 62:611-615.

Adoption

musserlab/protein-phylogeny

$ install --global

Security Scan Results

SKILL.md

Protein Phylogeny Inference

Overview: Two-Phase Workflow

Phase 1: Discussion

1. Input validation

2. Minimum length cutoff

3. CD-HIT redundancy reduction (optional, off by default)

4. MAFFT algorithm selection

5. Trimming (optional, off by default)

6. IQ-TREE tier selection

7. Branch support

8. No automated outlier removal

Phase 2: Script Generation

Naming and placement

.qmd structure

Key patterns

MAFFT algorithm mapping

IQ-TREE commands by tier

Summary chunk

Phase 3: After Rendering

Output files

How to render

Next steps

Optional Sensitivity Analyses

Required Tools

Key References

Related Skills

musserlab/tree-formatting

musserlab/security-setup

musserlab/script-organization

musserlab/r-renv

musserlab/protein-phylogeny

$ install --global

Security Scan Results

SKILL.md

Protein Phylogeny Inference

Overview: Two-Phase Workflow

Phase 1: Discussion

1. Input validation

2. Minimum length cutoff

3. CD-HIT redundancy reduction (optional, off by default)

4. MAFFT algorithm selection

5. Trimming (optional, off by default)

6. IQ-TREE tier selection

7. Branch support

8. No automated outlier removal

Phase 2: Script Generation

Naming and placement

.qmd structure

Key patterns

MAFFT algorithm mapping

IQ-TREE commands by tier

Summary chunk

Phase 3: After Rendering

Output files

How to render

Next steps

Optional Sensitivity Analyses

Required Tools

Key References

Related Skills

musserlab/tree-formatting

musserlab/security-setup

musserlab/script-organization

musserlab/r-renv