skills/protein-phylogeny/SKILL.md
Protein (gene) phylogeny inference pipeline: generates a .qmd analysis script that performs alignment, optional trimming, and tree building. Use when building phylogenetic trees from protein sequences, aligning protein families, running IQ-TREE or MAFFT for phylogenetics, or when the user says "gene tree" or "protein tree." Covers single domains, whole proteins, and multi-domain proteins across deep evolutionary distances (sponges, animals, eukaryotes). Do NOT load for nucleotide-only phylogenies, species trees from concatenated matrices, or tree visualization (use tree-formatting skill for that).
npx skillsauth add musserlab/lab-claude-skills protein-phylogenyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Pipeline for building protein phylogenies across deep evolutionary distances. Designed for single domains, whole single-domain proteins, and multi-domain proteins, from ~10 to several thousand sequences, spanning sponges to all eukaryotes.
If the user says "gene tree" they may mean protein sequences. Confirm if ambiguous, or inspect the input sequences (amino acid alphabet vs nucleotide).
This skill generates a reproducible .qmd analysis script (Python) rather than running commands directly. The workflow is:
.qmd script encoding all decisions as configuration
variables, following quarto-docs and script-organization skill conventionsquarto render (or Claude renders it), producing all
outputs in outs/<subdirectory>/XX_script_name/One script per major deliverable (one protein family or domain tree). All pipeline steps
go in one .qmd unless there is a strong reason to split.
Resolve these questions with the user before generating the script. Use the recommendation logic below to guide the conversation.
Validate the FASTA file before proceeding. The generated script includes a validation chunk (see Phase 2) that runs these checks automatically, but also inspect the file during the discussion phase to inform algorithm/model recommendations.
* appears mid-sequence (not just at the end), warn the user| Condition | Algorithm | MAFFT flags |
|-----------|-----------|-------------|
| Single domain or whole protein, <= 2000 seq | L-INS-i (default) | --localpair --maxiterate 1000 --reorder |
| Multi-domain protein with variable linkers | E-INS-i | --genafpair --maxiterate 1000 --reorder |
| > 2000 sequences | auto (MAFFT decides) | --auto --reorder |
Recommendation: "You have N sequences, median length X aa. This looks like a [single-domain / multi-domain / whole-protein] dataset. I'd recommend MAFFT [L-INS-i / E-INS-i / auto]. Does that sound right?"
Do not trim unless the user requests it. For deep divergences, aggressive trimming can remove phylogenetically informative sites. Good models (site-heterogeneous) handle noisy columns.
| Tool | Mode | Best for |
|------|------|----------|
| ClipKIT | kpic-gappy | General purpose |
| BMGE | BLOSUM30 | Very deep divergences with compositional heterogeneity |
| Tier | Model | When to use | Approximate time (166 seq) | |------|-------|-------------|---------------------------| | 1 | MFP (ModelFinder) | Quick exploration, screening | ~15-30 min | | 1.5 | Q.pfam+F+R6 | Fast gene trees, batch screening — skip model search | ~10-20 min | | 2 | ELM+C60+G PMSF (default) | Deep eukaryotic phylogenies | ~3-8 hours | | 2 alt | LG+C60+F+R PMSF | Non-eukaryotic or for comparison | ~3-8 hours | | 3 | PhyloBayes CAT+GTR | Maximum rigor, < 200 seq | Hours to days |
-B 1000 -alrt 1000-b 1000) for publication-quality final treesThe user inspects the tree after inference and decides if anything looks wrong. Do not use TreeShrink or similar tools automatically.
Once all decisions are resolved, generate a .qmd script following these conventions.
Follow the script-organization skill:
scripts/<subdirectory>/XX_name.qmd (next available number)outs/<subdirectory>/XX_name/Generate a Python .qmd with these sections in order:
Configuration chunk — all decisions as named variables at the top:
#| label: config
# ---- Pipeline Configuration ----
# Resolved during discussion; change here to re-run with different options.
INPUT_FASTA = PROJECT_ROOT / "data/phylogenetics/sequences.fasta"
MIN_LENGTH_FRACTION = 0.5 # Remove sequences shorter than this fraction of median
RUN_CDHIT = False # Redundancy reduction
CDHIT_THRESHOLD = 0.99
MAFFT_ALGORITHM = "linsi" # "linsi" | "einsi" | "auto"
MAFFT_THREADS = 8 # Parallel threads for MAFFT alignment
RUN_TRIMMING = False
TRIMMING_TOOL = "clipkit" # "clipkit" | "bmge"
IQTREE_TIER = 2 # 1 = MFP, 1.5 = Q.pfam+F+R6, 2 = ELM+C60+G PMSF, 3 = PhyloBayes
IQTREE_MATRIX = "ELM" # "ELM" | "LG" (Tier 2 only)
IQTREE_THREADS = 8 # Fixed thread count (AUTO crashes with PMSF in v3.0.1)
BOOTSTRAP_REPS = 1000
FASTA validation chunk — runs before any analysis:
#| label: validate-input
from Bio import SeqIO
from collections import Counter
VALID_AA = set("ACDEFGHIKLMNPQRSTVWY")
EXTENDED_AA = set("XUBZJO*")
fasta_path = INPUT_FASTA
assert fasta_path.exists(), f"FASTA file not found: {fasta_path}"
records = list(SeqIO.parse(fasta_path, "fasta"))
assert len(records) > 0, "FASTA file is empty or not valid FASTA format"
# Basic stats
lengths = [len(r.seq) for r in records]
median_len = sorted(lengths)[len(lengths) // 2]
print(f"Sequences: {len(records)}")
print(f"Length: median {median_len}, min {min(lengths)}, max {max(lengths)}")
# Duplicate IDs
ids = [r.id for r in records]
dupes = [id for id, count in Counter(ids).items() if count > 1]
if dupes:
print(f"WARNING: {len(dupes)} duplicate IDs — MAFFT keeps only last: {dupes[:5]}")
# Empty sequences
empty = [r.id for r in records if len(r.seq) == 0]
if empty:
print(f"WARNING: {len(empty)} empty sequences: {empty[:5]}")
# Alphabet check — nucleotide vs protein
all_chars = Counter()
for r in records:
all_chars.update(str(r.seq).upper())
total_chars = sum(all_chars.values())
nuc_chars = sum(all_chars.get(c, 0) for c in "ATGCN")
if total_chars > 0 and nuc_chars / total_chars > 0.9:
print(f"WARNING: {nuc_chars/total_chars:.0%} nucleotide chars — may not be protein")
# Non-standard characters
nonstandard = {c: n for c, n in all_chars.items()
if c not in VALID_AA and c not in EXTENDED_AA and c != "-"}
if nonstandard:
print(f"WARNING: Non-standard characters: {nonstandard}")
# Stop codons mid-sequence
for r in records:
seq = str(r.seq)
if "*" in seq[:-1]: # ignore terminal stop
print(f"WARNING: Internal stop codon (*) in {r.id}")
# Short sequence filter
short_cutoff = median_len * MIN_LENGTH_FRACTION
short = [r.id for r in records if len(r.seq) < short_cutoff]
if short:
print(f"\n{len(short)} sequences < {short_cutoff:.0f} aa "
f"({MIN_LENGTH_FRACTION:.0%} of median)")
records = [r for r in records if len(r.seq) >= short_cutoff]
print(f"Retained {len(records)} sequences after length filter")
filtered_fasta = out_dir / "input_filtered.fasta"
SeqIO.write(records, filtered_fasta, "fasta")
else:
print(f"\nAll sequences >= {short_cutoff:.0f} aa — no filtering needed")
filtered_fasta = fasta_path
Caching pattern — skip expensive steps if output exists:
aligned_fasta = out_dir / "aligned.fasta"
if not aligned_fasta.exists():
print("Running MAFFT alignment...")
# ... subprocess call ...
print(f" Alignment: {n_seqs} sequences x {n_cols} columns")
else:
print(f"Alignment cached at {aligned_fasta.name}, skipping MAFFT")
Subprocess pattern — consistent error handling:
result = subprocess.run(
cmd, capture_output=True, text=True
)
if result.returncode != 0:
print(f"STDERR: {result.stderr[:500]}")
raise RuntimeError("MAFFT failed")
MAFFT_FLAGS = {
"linsi": ["--localpair", "--maxiterate", "1000", "--reorder", "--thread", str(MAFFT_THREADS)],
"einsi": ["--genafpair", "--maxiterate", "1000", "--reorder", "--thread", str(MAFFT_THREADS)],
"auto": ["--auto", "--reorder", "--thread", str(MAFFT_THREADS)],
}
Tier 1 — Quick baseline:
cmd = [
"iqtree3", "-s", str(input_aln),
"-m", "MFP",
"-B", str(BOOTSTRAP_REPS), "-alrt", str(BOOTSTRAP_REPS),
"-nstop", "50", "-T", "AUTO",
"-pre", str(out_dir / "tree")
]
Tier 1.5 — Fast fixed model (batch screening, no model search):
cmd = [
"iqtree3", "-s", str(input_aln),
"-m", "Q.pfam+F+R6",
"-B", str(BOOTSTRAP_REPS), "-alrt", str(BOOTSTRAP_REPS),
"-nstop", "50", "-T", "AUTO",
"-pre", str(out_dir / "tree")
]
Tier 2 — Two-pass PMSF (default for deep eukaryotic phylogenies):
model = f"{IQTREE_MATRIX}+C60+G" if IQTREE_MATRIX == "ELM" else f"{IQTREE_MATRIX}+C60+F+R"
# Pass 1: guide tree
guide_tree = out_dir / "guide.treefile"
if not guide_tree.exists():
cmd_pass1 = [
"iqtree3", "-s", str(input_aln),
"-m", model,
"-nstop", "50", "-T", str(IQTREE_THREADS),
"-pre", str(out_dir / "guide")
]
subprocess.run(cmd_pass1, check=True)
# Pass 2: PMSF tree with bootstrap
final_tree = out_dir / "pmsf.treefile"
if not final_tree.exists():
cmd_pass2 = [
"iqtree3", "-s", str(input_aln),
"-m", model,
"-ft", str(guide_tree),
"-B", str(BOOTSTRAP_REPS), "-alrt", str(BOOTSTRAP_REPS),
"-nstop", "50", "-T", str(IQTREE_THREADS),
"-pre", str(out_dir / "pmsf")
]
subprocess.run(cmd_pass2, check=True)
Bug workaround (IQ-TREE 3.0.1):
-T AUTOcrashes with an assertion error during PMSF site frequency computation (computePartialParsimonyFast). Use a fixed thread count (e.g.,-T 8) instead. Re-test with future IQ-TREE releases.
Tier 3 — PhyloBayes (special case):
PhyloBayes runs for hours to days and does not fit the "render once" model. The .qmd
should launch chains and note that convergence must be checked separately:
print("PhyloBayes requires long-running chains.")
print("Launch manually:")
print(f" pb_mpi -d {input_phy} -cat -gtr -x 1 10000 chain1")
print(f" pb_mpi -d {input_phy} -cat -gtr -x 1 10000 chain2")
print("Check convergence with bpcomp -x 2500 chain1 chain2")
Parse IQ-TREE output and report:
# Read support values from treefile
import re
tree_text = (out_dir / "pmsf.treefile").read_text()
supports = re.findall(r'\)([\d.]+)/([\d.]+):', tree_text)
ufboot = [float(s[1]) for s in supports]
shalrt = [float(s[0]) for s in supports]
total = len(ufboot)
print(f"=== Tree Inference Summary ===")
print(f"Sequences: {n_seqs}")
print(f"Alignment: {n_cols} columns")
print(f"Model: {model} (Tier {IQTREE_TIER})")
print(f"UFBoot >= 95: {sum(1 for v in ufboot if v >= 95)}/{total}")
print(f"UFBoot >= 70: {sum(1 for v in ufboot if v >= 70)}/{total}")
print(f"SH-aLRT >= 80: {sum(1 for v in shalrt if v >= 80)}/{total}")
print(f"\nTree file: {final_tree}")
print(f"\nNext: use tree-formatting skill for visualization")
The rendered script produces these files in outs/<subdirectory>/XX_name/:
| File | Description |
|------|-------------|
| pmsf.treefile | ML tree with UFBoot/SH-aLRT support (Tier 2) |
| pmsf.contree | Bootstrap consensus tree |
| pmsf.iqtree | Full IQ-TREE report |
| pmsf.sitefreq | PMSF site frequency profiles |
| guide.treefile | Pass 1 guide tree (Tier 2) |
| aligned.fasta | MAFFT alignment |
| BUILD_INFO.txt | Script provenance |
source ~/miniconda3/etc/profile.d/conda.sh && conda activate <env>
quarto render scripts/<subdirectory>/XX_name.qmd --output-dir outs/<subdirectory>/XX_name/
For long-running Tier 2 analyses, consider rendering in a screen/tmux session.
These are not part of the standard pipeline but can be added as additional chunks:
-m GTR+R
on the recoded alignment.-zb 10000 -au.| Tool | Purpose | Install |
|------|---------|---------|
| MAFFT | Alignment | conda install -c bioconda mafft |
| IQ-TREE 3 | Tree inference | conda install -c bioconda iqtree (ensure v3+) |
| BioPython | FASTA I/O, sequence handling | pip install biopython |
| CD-HIT | Redundancy reduction (optional) | conda install -c bioconda cd-hit |
| ClipKIT | Trimming (optional) | pip install clipkit |
| BMGE | Trimming (optional) | conda install -c bioconda bmge |
| PhyloBayes-MPI | Bayesian inference (optional) | conda install -c bioconda phylobayes-mpi |
development
Phylogenetic tree visualization and formatting with ggtree (R) or iTOL (web). Use when rendering a phylogenetic tree as a figure, choosing tree layout, coloring branches or labels by taxonomy, collapsing clades, displaying support values, or adding overlays to a tree. Do NOT load for tree inference (use protein-phylogeny skill) or domain annotation (future separate skill).
development
Configure and manage Claude Code security protections for sensitive files, credentials, and data. Use when the user invokes /security-setup to set up or modify protections against unauthorized file access, credential exposure, or sensitive data leaks.
development
Script organization for data science analysis projects with numbered scripts, data/outs/ directories, and reproducibility conventions. Use when creating new analysis scripts in projects that follow data science conventions (numbered XX_ prefix scripts, outs/ directories, BUILD_INFO.txt). Do NOT load for documentation projects (Quarto books), infrastructure repos, or projects without data/outs/ directory structure.
testing
R renv package management for data science projects. Use when working with renv (renv.lock, renv::restore, renv::snapshot) in R analysis projects. Do NOT load for projects that do not use R or renv.