alignment/structural-alignment/SKILL.md
Align protein structures using Foldseek 3Di, TM-align, US-align, DALI, or Foldmason for structural MSA. Predict, score, and superpose backbone coordinates when sequence identity is below the twilight zone or remote-homology detection is required. Use when sequence MSA fails (<25% identity), when the dark proteome is the target, when AlphaFoldDB / ESM Atlas search is needed, or when structural superposition is the goal.
npx skillsauth add GPTomics/bioSkills bio-alignment-structuralInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: Foldseek 8+, TM-align 20220412+, US-align 20231222+, Foldmason 1+, BioPython 1.83+, pymol-open-source 3.0+
Before using code patterns, verify installed versions match. If versions differ:
foldseek --version, TMalign, USalign, foldmason --versionpip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Align two protein structures" -> Compute backbone-aware superposition and a fold-similarity score (TM-score, RMSD, or LDDT).
TMalign A.pdb B.pdb, USalign A.pdb B.pdbfoldseek easy-search query/ AFDB result.m8 tmp/foldmason easy-msa structures/*.pdb out tmp/Bio.PDB.Superimposer, or subprocess wrapping TMalign / USalign (see examples/tm_align_pairwise.py)matchmaker, PyMOL super/cealign"Find structural homologs of an AlphaFold model" -> Search a structure database by 3Di-encoded structural alphabet (Foldseek) or by full TM-align rotation (DALI, US-align).
| Sequence identity | Recommended approach | |-------------------|---------------------| | >= 40% | Sequence DP (Bio.Align, BLASTP) is sufficient | | 25-40% | Sensitive sequence (MMseqs2, jackhmmer, profile-profile HHsearch) | | 15-25% | Profile-profile (HHsearch) OR Foldseek if structures available | | < 15% (dark proteome / twilight zone) | Foldseek (3Di), TM-align, US-align, pLM aligners |
Sequence alignment below 15% identity is statistically indistinguishable from random pairings. The exact twilight-zone cutoff is length-dependent: Rost 1999 (Prot Eng) showed the curve drops to 25% at length 80, 20% at length 250 -- short alignments need higher identity for the same statistical signal, so a 15-25% rule of thumb is shorthand for "twilight zone for proteins of typical domain size (~150-300 residues)". If reasonable structural models exist (PDB, AlphaFoldDB, ESMFold), structural alignment is far more reliable in this regime.
For families with strong functional or structural constraint (ribosomal proteins, class I aminoacyl-tRNA synthetases with HIGH/KMSKS motifs, histone fold, cytochrome c with CXXCH, or any long alignment with well-distributed conservation), sequence MSA may remain reliable down to ~12-15% identity. Verify with a profile-profile method (HHsearch) before committing to structural alignment.
| Tool | Reference | Score | Best for | |------|-----------|-------|----------| | TM-align | Zhang & Skolnick 2005 NAR | TM-score | Single-chain pairwise; standard fold-similarity benchmark | | US-align | Zhang et al 2022 Nat Methods | TM-score | Multi-chain protein, RNA, DNA, complexes; original successor to TM-align | | Foldseek-Multimer | Kim et al 2025 Nat Methods | TM-score | Database-scale multi-chain complex search; 10-100x faster than US-align in pairwise mode, 1000-10000x in DB search; preferred at AFDB-multimer scale | | FATCAT | Ye & Godzik 2003 Bioinf | FATCAT score | Flexible alignment with allowed twists/breaks | | CE | Shindyalov & Bourne 1998 Prot Eng | CE score | Combinatorial extension of fragments; PDB legacy | | DALI | Holm 2022 NAR | Z-score | Distance-matrix alignment; superior at TM 0.3-0.5 (twilight fold); operates on AFDB at scale | | Bio.PDB.Superimposer | Cock et al 2009 Bioinf | RMSD | Pure-Python superposition with known atom correspondence |
TM-align and US-align report two TM-scores by default: one normalised by chain 1 length and one by chain 2 length. The standard fold-similarity metric for asymmetric reporting is the larger of the two (effectively normalising by the shorter chain), since TM-score is length-asymmetric. The -a T flag adds a third score normalised by the average of the two chain lengths (useful for symmetric clustering). TM > 0.5 indicates the same fold; TM > 0.8 indicates equivalent topology / close structural relationship; TM < 0.2 indicates random structural similarity. RMSD alone is misleading: RMSD scales with length, depends on outliers, and an "optimization RMSD" (used to fit) is not the same as an "evaluation RMSD" (used to compare). Always report TM-score alongside RMSD and the alignment length.
| Metric | Threshold | Source / Interpretation | |--------|-----------|-------------------------| | TM-score | > 0.5 | Same fold (Zhang & Skolnick 2004 Proteins; reported by TM-align, US-align, Foldseek, DALI) | | TM-score | > 0.8 | Equivalent topology (homologous) | | TM-score | < 0.2 | Random structural similarity | | DALI Z-score | > 20 | Definitely homologous (Holm 2020 Prot Sci) | | DALI Z-score | 8 - 19 | Probable homology | | DALI Z-score | 2 - 8 | Candidate; verify with TM-score or biology | | DALI Z-score | < 2 | Not significant (random) | | GDT-TS | > 50 | Correct fold (CASP/LGA scoring; reported by LGA, MaxCluster, OpenStructure -- not by TM-align/Foldseek) | | LDDT | > 0.6 | Conventional cutoff for "correctly modelled" residue (Mariani et al 2013 Bioinf shows the score is fold-architecture-dependent and does not define a universal hard threshold; >= 0.6 is widely used in CAMEO and Foldseek output as a working cutoff) | | RMSD | < 2 A over >100 residues | Strong superposition; below 1.5 A is excellent |
Both target structural homolog search, but with different strengths:
| Question | Foldseek | DALI | |----------|----------|------| | Find any same-fold homolog quickly | YES (3Di indexed; 1000-1M seq/s) | Slow (full distance matrix) | | Sensitivity at TM > 0.5 (same-fold) | Comparable to TM-align | Slightly higher recall | | Sensitivity at TM 0.3-0.5 (twilight fold) | Recall drops sharply | DALI Z-score retains signal (Holm 2022) | | Alignment quality for indel-rich pairs | Local 3Di+AA; can miss large indels | Distance-matrix optimal handles indels well | | AFDB-scale all-vs-all | Tractable | Now tractable post-2022 (Holm DALI server update) |
Practical workflow for remote homology: (1) Foldseek easy-search to retrieve same-fold candidates fast, (2) for hits with TM < 0.5 or with structural-indel rich alignments, re-align with DALI for higher-quality residue equivalences. The two tools are complementary; Foldseek-only misses some twilight-fold relationships, DALI-only misses the AFDB-scale opportunity.
Apply the 0.5 same-fold rule only to globular domains of ~100-300 residues; for chains <60 residues TM > 0.5 occurs by chance ~3-5%, so use Xu & Zhang 2010's length-aware Gumbel p-value instead. Always report both length-normalised TM-scores (or use -a T for the symmetrised average) and pair RMSD with the atom count ((RMSD, n_atoms_used)); raw RMSD without length normalisation is misleading.
Goal: Compute TM-score and RMSD between two structures and write a superposed PDB.
Approach: Invoke TMalign or USalign with output flags; parse the structured output for downstream filtering.
TMalign chainA.pdb chainB.pdb -o superposed.sup
TMalign chainA.pdb chainB.pdb -outfmt 2
USalign chainA.pdb chainB.pdb -mol prot -outfmt 2
USalign complex_A.pdb complex_B.pdb -mm 1 -ter 0
-outfmt 2 returns tabular output (one line per pair). For multi-chain complexes, US-align with -mm 1 -ter 0 aligns full assemblies; TM-align is single-chain only.
US-align multi-chain score interpretation. US-align -mm 1 -ter 0 produces multiple TM-scores per complex: normalised by query length, normalised by template length, and (for symmetric homo-oligomers) the best-matching chain permutation. For different stoichiometries (query hetero-A2B2 vs template hetero-A4B4), use -mm 4 and -byresi 0. The "best-of" TM-score is the appropriate cross-complex homology metric; per-chain TM-scores serve as a sanity check (low per-chain + high complex TM = topology match without local fold conservation, suggests promiscuous interaction not deep homology). See Zhang et al 2022 Nat Methods for the full mode taxonomy.
For multi-chain complex search at AFDB-multimer scale (~214k entries) or against PDB complexes, US-align is too slow; Foldseek-Multimer (Kim et al 2025 Nat Methods) is the modern default. Two modes share the chain-pairing prefilter:
| Mode | Algorithm | Use when |
|------|-----------|----------|
| Foldseek-MM (default) | 3Di+AA Gotoh per chain pair | Fast database search; default speed-sensitivity tradeoff |
| Foldseek-MM-TM | TM-align per chain pair after prefilter | Top-hit refinement with full TM-score |
foldseek easy-multimersearch query_complex.pdb afdb_multimer result tmp/
foldseek easy-multimersearch query_complex.pdb afdb_multimer result tmp/ --multimer-tm-threshold 0.5
foldseek easy-multimercluster *.pdb cluster_result tmp/ --multimer-tm-threshold 0.65
Decision guide: pairwise complex pair on a few hundred targets -> US-align with -mm 1 -ter 0. Database search across thousands to millions of complex entries -> Foldseek-Multimer. For AFDB-Multimer or PDB100-Multimer scale, US-align is computationally infeasible; Foldseek-Multimer matches US-align chain-pairing in >99% of cases at 10-100x speedup pairwise and 10^3-10^4x in database mode (Kim et al 2025 benchmark).
Reporting convention: Always quote BOTH the multimer TM-score (whole-complex) AND the worst per-chain TM-score; high multimer TM with one low per-chain score signals topology-matched but locally divergent chains (often promiscuous binders or paralog swaps), which is biologically distinct from a uniformly-conserved complex.
Goal: Superpose a known atom correspondence and compute RMSD without running an external aligner.
Approach: Use when residue equivalence is already established (e.g. same sequence, different conformations). For unknown correspondence, prefer TM-align / US-align.
from Bio.PDB import PDBParser, Superimposer
parser = PDBParser(QUIET=True)
mobile_atoms = list(parser.get_structure('mobile', 'mobile.pdb').get_atoms())
reference_atoms = list(parser.get_structure('ref', 'reference.pdb').get_atoms())
ca_mobile = [a for a in mobile_atoms if a.get_id() == 'CA']
ca_reference = [a for a in reference_atoms if a.get_id() == 'CA']
n = min(len(ca_mobile), len(ca_reference))
sup = Superimposer()
sup.set_atoms(ca_reference[:n], ca_mobile[:n])
sup.apply(mobile_atoms)
print(f'RMSD: {sup.rms:.3f} A over {n} CA atoms')
Run Foldseek for AlphaFoldDB-scale structural search; it indexes a 20-letter 3Di alphabet for thousand- to million-fold speedup over TM-align at comparable same-fold sensitivity.
# Search query structures against AFDB (default: --alignment-type 2)
foldseek easy-search query.pdb afdb_database result.m8 tmp/
# Refine top hits with full TM-align rotation (slower but global TM-score)
foldseek easy-search query.pdb afdb_database result.m8 tmp/ --alignment-type 1
# All-versus-all clustering at TM > 0.5
foldseek easy-cluster structures/*.pdb cluster_result tmp/ --tmscore-threshold 0.5
# Custom output columns
foldseek easy-search query.pdb afdb_database result.m8 tmp/ \
--format-output query,target,evalue,alntmscore,qtmscore,ttmscore,lddt,bits
| Foldseek --alignment-type | Algorithm | Use when |
|-----------------------------|-----------|----------|
| 0 | 3Di Gotoh-Smith-Waterman (local) | Not recommended; 3Di alone, no amino-acid signal |
| 1 | TMalign (global) | Refine top hits with full TM-score; slow |
| 2 | 3Di+AA Gotoh-Smith-Waterman (local) | Default; best speed-sensitivity tradeoff |
Mask residues with pLDDT < 70 before Foldseek indexing or search; their backbone coordinates encode as random 3Di letters and contaminate hits below TM ~ 0.4. AFDB clusters from Barrio-Hernandez et al 2023 are pre-filtered; ESMFold predictions need the same step.
# Mask low-pLDDT residues during database creation (B-factor column stores pLDDT)
# The *.pdb glob requires shell expansion; for Python subprocess calls use glob.glob()
# to expand the file list before passing as argv.
foldseek createdb --mask-bfactor-threshold 70.0 *.pdb afdb_masked
| Tool | Reference | Best for |
|------|-----------|----------|
| Foldmason | Gilchrist et al 2026 Science 391:485 | Billion-protein-scale structural MSA on AFDB; the structural counterpart to MAFFT |
| 3D-Coffee / Expresso | Notredame group | <100 chains with mixed PDB and sequence input |
| mTM-align | Dong et al 2018 Bioinf | Multiple structure alignment; output suitable for tree building. Server: yanglab.qd.sdu.edu.cn/mTM-align/ (Yang Lab moved from Nankai to Shandong University in 2021; verify URL before use) |
| PROMALS3D | Pei, Kim & Grishin 2008 NAR | Hybrid sequence-structure profile MSA; uses PSI-BLAST + DALI/SAP templates |
| MUSTANG | Konagurthu et al 2006 Proteins | Pure structural MSA via residue-residue equivalences |
foldmason easy-msa structures/*.pdb result tmp/ \
--refine-iters 100 \
--report-mode 1
# Outputs: result_aa.fa (amino-acid MSA), result_3di.fa (3Di MSA), result.nw (guide tree), result.html (LDDT report)
--refine-iters controls iterative MSA refinement; --report-mode 1 produces an HTML report with per-column LDDT confidence. The 3Di MSA can be used directly for evolutionary analyses where structure rather than sequence is the appropriate signal.
Per-column LDDT extraction: Run with --report-mode 2 to produce machine-readable JSON output:
foldmason easy-msa structures/*.pdb result tmp/ --report-mode 2
# Produces result.json with per-column LDDT in machine-readable form
import json
with open('result.json') as f:
report = json.load(f)
lddt_per_column = report.get('per_column_lddt')
Verify the exact JSON schema with foldmason easy-msa --help and inspect a sample result.json; the field name may evolve across releases.
When a small dataset has known structures or templates, hybrid tools dramatically improve MSA accuracy:
-template_file templates.txt; uses SAP or TM-align as the structural method.expresso) is a current alternative for structure-informed sequence MSA.t_coffee input.fasta -mode expresso -output fasta_aln -outfile aligned.fasta
t_coffee input.fasta -template_file templates.txt -mode 3dcoffee
The "predict-then-align" workflow has become standard for remote-homology problems:
structural-biology/modern-structure-prediction for prediction workflows.Search AFDB clusters (~2.3 M Foldseek-derived clusters from Barrio-Hernandez et al 2023) as the canonical remote-homology entry point; supplement with the ESM Atlas (~600 M ESMFold metagenomic structures) when AFDB recall is insufficient. For pLDDT semantics and curated AlphaFoldDB entry handling, see structural-biology/alphafold-predictions.
Run a pLM aligner (TM-Vec, vcMSA, DEDAL, pLM-BLAST) when no structure is available but identity is in the twilight zone (<15-25%). They recover signal that DP aligners miss but do NOT replace TM-align / US-align when structures exist; use them as a complementary first pass.
| Tool | Reference | Embedding source |
|------|-----------|------------------|
| vcMSA | McWhite, Armour-Garb & Singh 2023 Genome Res | ProtT5 vector clustering for MSA (alpha-stage tool per its repo README; last release Oct 2023; test on representative inputs before pipeline use) |
| DEDAL | Llinares-Lopez et al 2023 Bioinf | Differentiable end-to-end alignment with pLM features |
| TM-Vec | Hamamsy et al 2024 Nat Biotech | TM-score prediction from ProtT5 embeddings (active fork: valentynbez/tmvec; original tymor22/tm-vec is in limited maintenance) |
| pLM-BLAST | Kaminski et al 2023 Bioinf | BLAST-style hits via pLM cosine similarity |
These run on raw sequence (no structure required) and recover signal at <15% identity that DP aligners miss entirely. They do NOT replace structural alignment when structures exist; treat them as complementary.
pLM aligner accuracy ceiling. TM-Vec, vcMSA, and DEDAL predict alignment-equivalent quantities (TM-score, alignment columns) from sequence-only embeddings. Hamamsy et al 2024 report TM-Vec TM-score prediction RMSE ~0.07 on SCOP -- sufficient for coarse filtering ("is this hit at all structurally similar?") but NOT for fine-grained ranking (e.g. choosing among hits with TM 0.5-0.7). For final structural-similarity scoring, run TM-align or US-align on predicted structures (ColabFold + USalign) rather than relying on the pLM-predicted score.
Structural alignments are read by viewing the superposed structures, not the sequence text:
# PyMOL headless: -cq runs without GUI/quietly; -d passes commands directly
pymol -cq -d "load reference.pdb; load mobile.pdb; super mobile, reference; ray 800,600; png fig.png"
# ChimeraX (modern, replaces Chimera): MatchMaker uses Needleman-Wunsch + iterative refinement
ChimeraX --cmd "open reference.pdb mobile.pdb; matchmaker #2 to #1; save fig.png"
PyMOL super performs cycle-fitting for distantly related structures (better than align when sequence identity is low); cealign runs CE; tmalign (PyMOL plugin) wraps TM-align. ChimeraX MatchMaker is the modern replacement and uses Needleman-Wunsch + iterative refinement with the option to invoke external tools.
| Goal | First-line tool |
|------|-----------------|
| Two structures, known correspondence | Bio.PDB.Superimposer |
| Two structures, unknown correspondence | TMalign or USalign |
| Multi-chain complex, pairwise | USalign with -mm 1 -ter 0 |
| Multi-chain complex, database search | foldseek easy-multimersearch (Foldseek-Multimer) |
| Twilight-fold homology (TM 0.3-0.5) | DALI (Z-score ranks low-similarity hits better than Foldseek) |
| Structural homolog search at AFDB scale (single chain) | foldseek easy-search |
| All-vs-all clustering of structures | foldseek easy-cluster --tmscore-threshold 0.5 |
| Multiple structure alignment, < 100 chains | T-Coffee Expresso or mTM-align |
| Multiple structure alignment, > 1000 chains | Foldmason easy-msa |
| Hybrid sequence-structure MSA | PROMALS3D or T-Coffee Expresso |
| pLM aligner for dark proteome | TM-Vec, vcMSA, DEDAL |
| Distant homology with no structures | Predict with ColabFold/ESMFold first, then Foldseek |
| Error | Cause | Solution |
|-------|-------|----------|
| TM-align "atoms not enough" | Single-residue chains or ligand-only PDB | Filter to canonical amino acids before running |
| Foldseek "no hits" | Wrong database format | Run foldseek databases to confirm AFDB / PDB100 download |
| Bio.PDB Superimposer "different number of atoms" | Atom selection mismatch | Filter both lists to common atom names (CA only, or N/CA/C/O) |
| TM-score normalised by length 1 | Used -L 1 | Drop -L flag; default normalisation is correct |
| Foldmason output empty | Input structures not in same chain ID | Renumber and rename chains uniformly before running |
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.