alignment/pairwise-alignment/SKILL.md
Perform pairwise sequence alignment using Biopython Bio.Align.PairwiseAligner. Use when comparing two sequences, finding optimal alignments, scoring similarity, and identifying local or global matches between DNA, RNA, or protein sequences.
npx skillsauth add GPTomics/bioSkills bio-alignment-pairwiseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: BioPython 1.83+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Align two sequences" -> Compute an optimal alignment between a pair of sequences using dynamic programming.
PairwiseAligner() (BioPython Bio.Align)needle (global) or water (local) from EMBOSSpairwiseAlignment() (Biostrings)Align two sequences using dynamic programming algorithms (Needleman-Wunsch for global, Smith-Waterman for local).
Goal: Load modules needed for pairwise alignment operations.
Approach: Import the PairwiseAligner class along with sequence and I/O utilities from Biopython.
from Bio.Align import PairwiseAligner
from Bio.Seq import Seq
from Bio import SeqIO
Bio.Align.PairwiseAligner is the right default for interactive use, scripting, and pair sizes up to a few thousand residues, but it is not the fastest or most scalable option. For high-throughput screens, very long sequences, or production pipelines, switch to a SIMD-accelerated or specialised library.
| Library | Speed vs Bio.Align | Alphabet | Scoring | Vectorization | When to use |
|---------|-------------------|----------|---------|---------------|-------------|
| Bio.Align.PairwiseAligner (BioPython) | 1x baseline | DNA / RNA / protein | Matrix + affine | C-backed Gotoh | Default, <10 kb pairs, interactive use |
| parasail (Daily 2016 BMC Bioinf) | 10-100x | DNA / protein | Matrix + affine | SSE / AVX SIMD | High-throughput SW or NW; benchmark loops |
| edlib (Sosic & Sikic 2017 Bioinf) | 100-1000x | DNA only | Edit distance only | Bit-parallel Myers | Read mapping, k-mer search, primer placement |
| pywfa / WFA2 (Marco-Sola 2021 Bioinformatics 37:456; BiWFA: Marco-Sola 2023 Bioinformatics 39:btad074) | Best for low-divergence | DNA | Matrix + affine | Wavefront, O(s) memory | Long, near-identical sequences (>10 kb, <5% diverged) |
| mappy / minimap2 (Li 2018 Bioinf) | Production reads-to-genome | DNA | Chain + base-level | k-mer chain | Long-read mapping, splice-aware DNA |
| Bio.pairwise2 | DEPRECATED | -- | -- | -- | Migrate to PairwiseAligner (deprecated in BioPython 1.80; not yet removed; migrate proactively) |
| EMBOSS needle / water | ~Bio.Align | DNA / protein | Matrix + affine | None | Reproducibility, audit trails (1996 published defaults) |
Speed numbers in the table are rough; benchmark on representative inputs before committing. Critical caveats:
When uncertain which algorithm Biopython's aligner selected internally, inspect aligner.algorithm after configuration -- it returns the resolved variant ("Needleman-Wunsch", "Smith-Waterman", "Gotoh global alignment algorithm", "Gotoh local alignment algorithm", "Waterman-Smith-Beyer global alignment algorithm", "Waterman-Smith-Beyer local alignment algorithm") for deterministic auditing.
| Mode | Algorithm | Use Case |
|------|-----------|----------|
| global | Needleman-Wunsch | Full-length alignment, similar-length sequences |
| local | Smith-Waterman | Find best matching regions, different-length sequences |
| global + free end gaps | Semi-global | Overlap detection, fragment-to-reference alignment |
Common mistake: Global alignment of sequences with very different lengths forces biologically meaningless terminal gaps. If sequences differ substantially in length, use local or semi-global instead.
| Scenario | Align As | Rationale | |----------|----------|-----------| | Nucleotide identity >70% | DNA | Sufficient signal at nucleotide level | | Nucleotide identity <70% | Protein | Codon degeneracy masks signal at DNA level; protein alignment is ~3x more sensitive | | Noncoding sequences (UTRs, intergenic) | DNA | No protein translation possible | | Coding sequences for dN/dS analysis | Protein first, then back-translate codons (PAL2NAL) | Preserves reading frame for selection analysis |
When in doubt, align at the protein level. It captures functional constraint better because 20 amino acids provide richer signal than 4 nucleotides.
Goal: Configure a PairwiseAligner with appropriate scoring for the sequence type.
Approach: Instantiate PairwiseAligner with mode, scoring parameters, or a substitution matrix depending on DNA vs protein input.
# Basic aligner with defaults
aligner = PairwiseAligner()
# Configure mode and scoring
aligner = PairwiseAligner(mode='global', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)
# For protein alignment with substitution matrix
from Bio.Align import substitution_matrices
aligner = PairwiseAligner(mode='global', substitution_matrix=substitution_matrices.load('BLOSUM62'))
"Align two sequences" -> Compute optimal alignment(s) between a pair of sequences, returning alignment objects or a score.
Goal: Align two sequences and retrieve the optimal alignment(s) or score.
Approach: Call aligner.align() for full alignment objects or aligner.score() for score-only (faster for large sequences).
seq1 = Seq('ACCGGTAACGTAG')
seq2 = Seq('ACCGTTAACGAAG')
# Get all optimal alignments
alignments = aligner.align(seq1, seq2)
print(f'Found {len(alignments)} optimal alignments')
print(alignments[0]) # Print first alignment
# Get score only (faster for large sequences)
score = aligner.score(seq1, seq2)
target 0 ACCGGTAACGTAG 13
0 |||||.||||.|| 13
query 0 ACCGTTAACGAAG 13
Goal: Extract alignment properties including score, shape, aligned sequences, and coordinate mappings.
Approach: Access alignment object attributes and indexing to retrieve per-sequence aligned strings and coordinate arrays.
alignment = alignments[0]
# Basic properties
print(alignment.score) # Alignment score
print(alignment.shape) # (num_seqs, alignment_length)
print(len(alignment)) # Alignment length
# Get aligned sequences with gaps
target_aligned = alignment[0, :] # First sequence (target) with gaps
query_aligned = alignment[1, :] # Second sequence (query) with gaps
# Get coordinate mapping
print(alignment.aligned) # Array of aligned segment coordinates
print(alignment.coordinates) # Full coordinate array
Goal: Quantify identities, mismatches, and gaps in an alignment to calculate percent identity.
Approach: Use the .counts() method on the alignment object and derive percent identity from identity and mismatch totals.
alignment = alignments[0]
counts = alignment.counts()
print(f'Identities: {counts.identities}')
print(f'Mismatches: {counts.mismatches}')
print(f'Gaps: {counts.gaps}')
# Calculate percent identity
total_aligned = counts.identities + counts.mismatches
percent_identity = counts.identities / total_aligned * 100
print(f'Percent identity: {percent_identity:.1f}%')
PairwiseAligner() with no arguments uses match_score=1, mismatch_score=0, open_gap_score=0, extend_gap_score=0. Combined with a positive-scoring substitution matrix (BLOSUM62), this produces alignments with arbitrarily many short gaps -- gaps cost nothing while matches pay positive score. Always specify gap penalties explicitly when using a substitution matrix. BLASTP defaults: open=-11, extend=-1 with BLOSUM62; Smith-Waterman EMBOSS water defaults: open=-10, extend=-0.5. Inspect with print(aligner) after configuration to verify the resolved parameter set.
aligner = PairwiseAligner(mode='global', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)
from Bio.Align import substitution_matrices
blosum62 = substitution_matrices.load('BLOSUM62')
aligner = PairwiseAligner(mode='global', substitution_matrix=blosum62, open_gap_score=-11, extend_gap_score=-1)
aligner = PairwiseAligner(mode='local', match_score=2, mismatch_score=-1, open_gap_score=-10, extend_gap_score=-0.5)
# Free end gaps on query -- for aligning a fragment against a full-length reference
# or detecting overlap between reads
aligner = PairwiseAligner(mode='global')
aligner.query_left_open_gap_score = 0
aligner.query_left_extend_gap_score = 0
aligner.query_right_open_gap_score = 0
aligner.query_right_extend_gap_score = 0
# Free end gaps on BOTH sequences -- for overlap detection between two reads
aligner = PairwiseAligner(mode='global')
aligner.end_gap_score = 0.0
Goal: Select the appropriate substitution matrix based on expected sequence divergence.
Approach: Match matrix to divergence level. BLOSUM and PAM number in opposite directions: higher BLOSUM = closer sequences; higher PAM = more distant sequences.
| Divergence Level | BLOSUM | PAM | When To Use | |-----------------|--------|-----|-------------| | Very close (<20% divergence) | BLOSUM80, BLOSUM90 | PAM30 | Recently duplicated genes, strain comparison | | Moderate | BLOSUM62 (default) | PAM120 | General-purpose, most analyses | | Distant (>50% divergence) | BLOSUM45, BLOSUM50 | PAM250 | Remote homology detection |
BLOSUM62 is the universal default (used by BLAST, most alignment tools). When in doubt, use BLOSUM62. Switch to BLOSUM80 for very similar proteins or BLOSUM45 for distant homologs.
DNA matrices: NUC.4.4 (match=+5, mismatch=-4) handles IUPAC ambiguity codes. HOXD70 is tuned for human-mouse whole-genome alignment from noncoding regions.
from Bio.Align import substitution_matrices
print(substitution_matrices.load()) # List all 30 available matrices
blosum62 = substitution_matrices.load('BLOSUM62') # General protein (default)
blosum80 = substitution_matrices.load('BLOSUM80') # Close homologs
blosum45 = substitution_matrices.load('BLOSUM45') # Distant homologs
nuc44 = substitution_matrices.load('NUC.4.4') # DNA with IUPAC support
Gap penalties control how gaps (insertions/deletions) are scored. The affine model (penalty = open + extend * (L-1)) is almost always preferred over linear because it reflects indel biology: a DNA break introduces the first gap (costly), but extending an existing gap is mechanistically easier (less costly). This models the observation that indels in real sequences tend to occur as single contiguous events.
Typical values with BLOSUM62: gap open = -11, gap extend = -1 (BLASTP defaults). Setting gap open equal to gap extend (linear model) over-penalizes long indels and under-penalizes scattered single-residue gaps, producing biologically unrealistic alignments.
Goal: Align sequences loaded from FASTA files rather than hardcoded strings.
Approach: Parse SeqRecord objects from a FASTA file and pass their .seq attributes to the aligner.
from Bio import SeqIO
records = list(SeqIO.parse('sequences.fasta', 'fasta'))
seq1, seq2 = records[0].seq, records[1].seq
aligner = PairwiseAligner(mode='global', match_score=1, mismatch_score=-1)
alignments = aligner.align(seq1, seq2)
# Limit number of alignments returned (memory efficient)
aligner.max_alignments = 100
for i, alignment in enumerate(alignments):
print(f'Alignment {i+1}: score={alignment.score}')
if i >= 4:
break
Goal: Extract observed substitution frequencies from a completed alignment.
Approach: Access the .substitutions property to get a matrix of observed base/residue substitution counts.
alignment = alignments[0]
substitutions = alignment.substitutions
# View as array (rows=target, cols=query)
print(substitutions)
# Access specific substitution counts
# substitutions['A', 'T'] gives count of A aligned to T
Goal: Convert an alignment to standard bioinformatics file formats for downstream tools.
Approach: Use Python's format() function with format specifiers (fasta, clustal, psl, sam) on the alignment object.
alignment = alignments[0]
# Various output formats
print(format(alignment, 'fasta')) # FASTA format
print(format(alignment, 'clustal')) # Clustal format
print(format(alignment, 'psl')) # PSL format (BLAT)
print(format(alignment, 'sam')) # SAM format
| Parameter | Description | Typical DNA | Typical Protein |
|-----------|-------------|-------------|-----------------|
| match_score | Score for identical bases | 1-2 | Use matrix |
| mismatch_score | Penalty for mismatches | -1 to -3 | Use matrix |
| open_gap_score | Cost to start a gap | -5 to -15 | -10 to -12 |
| extend_gap_score | Cost per gap extension | -0.5 to -2 | -0.5 to -1 |
| substitution_matrix | Scoring matrix | N/A | BLOSUM62 |
| Error | Cause | Solution |
|-------|-------|----------|
| OverflowError | Too many optimal alignments | Set aligner.max_alignments |
| Low scores | Wrong scoring scheme | Use substitution matrix for proteins |
| No alignments in local mode | Scores all negative | Ensure match_score > 0 |
There are four common ways to calculate percent identity from the same alignment, producing different values:
| Method | Denominator | Best For | |--------|-------------|----------| | PID1 | Aligned positions + internal gaps | Gap-aware, conservative | | PID2 | Aligned residue pairs (excluding gaps) | Always highest value | | PID3 | Shorter sequence length | Length-normalized | | PID4 | Mean sequence length | Best correlation with structural similarity |
Practical impact: Up to 11.5% difference between methods on a single alignment. Combined with different alignment algorithms, variation reaches 22%. Always report which method was used. The counts() method above uses aligned non-gap positions (similar to PID2).
Use bit score (database-size-independent) and E-value (expected chance hits) to interpret raw alignment scores; never compare raw scores across scoring schemes. For non-default gap penalties or non-protein/DNA alphabets, generate empirical p-values via sequence shuffling instead of trusting the formula (examples/empirical_pvalue.py; for DNA, use the dinucleotide shuffle of Altschul & Erickson 1985 via ushuffle).
| Bit score | E-value (typical 1e6 db) | Interpretation | |-----------|--------------------------|----------------| | > 200 | < 1e-50 | Essentially certain homology | | 50-200 | 1e-5 to 1e-50 | Likely homology | | 30-50 | 0.01 to 1e-5 | Possible homology; verify with profile methods | | < 30 | > 0.01 | Suspect; not significant |
Karlin-Altschul lambda/K are calibrated empirically per (matrix, gap-open, gap-extend) tuple; switching gap penalties without recalibrating produces wrong E-values.
Mask compositional bias before scoring. Low-complexity regions (poly-Q, proline-rich, leucine-rich TM helices) inflate raw scores. Pre-filter with SEG for protein or DUST for DNA, then run BLAST with -comp_based_stats 2 (modern default; Yu & Altschul 2005 Bioinf, conditional) or -comp_based_stats 3 for repeat-rich queries (Yu & Altschul 2005 unconditional; Schaffer et al 2001 NAR introduced the original level-1 statistics). SEG pre-filtering is complementary to -comp_based_stats, not redundant.
Pick the alignment method by protein identity; below 15% identity DP alignments are statistically indistinguishable from random pairings, so escape to structure or pLM tools.
| Identity (protein) | Recommended approach |
|-------------------|---------------------|
| >= 40% | Any DP aligner; Bio.Align or BLAST is sufficient |
| 25-40% | Use sensitive iterative methods (MMseqs2 iterative, BLASTP with composition-based statistics) |
| 15-25% | Profile-profile (HHsearch, HMMER phmmer/jackhmmer) |
| < 15% | Structural alignment (Foldseek, TM-align, US-align) or pLM embeddings (TM-Vec, ESM-2 + cosine) -- see structural-alignment |
Length amplifies the signal: 30% identity over 200 residues is far more reliable than 30% over 50.
When pairwise becomes the wrong tool. A single DP pairwise alignment is correct for two sequences. For one query against thousands to millions of targets (genome-scale homology search) or for many-vs-many all-by-all (clustering, ortholog detection), the right tool is profile- or k-mer-indexed search, not iterated DP:
--num-iterations 3 matches PSI-BLAST and approaches jackhmmerjackhmmer for single queries on one NVIDIA L40S; use when GPU is available and the dataset is sensitivity-boundalignment/structural-alignment)Other failure modes:
R vs A) as +1 rather than +5; verify behaviour with print(substitution_matrices.load('NUC.4.4')) before relying on ambiguity-aware scoring.tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.