alignment/msa-statistics/SKILL.md
Calculate alignment statistics including sequence identity, conservation scores, substitution matrices, and similarity metrics. Use when comparing alignment quality, measuring sequence divergence, and analyzing evolutionary patterns.
npx skillsauth add GPTomics/bioSkills bio-alignment-msa-statisticsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: BioPython 1.83+, numpy 1.26+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Calculate sequence identity, conservation scores, substitution counts, and other alignment metrics.
Goal: Load modules for alignment I/O, substitution scoring, and statistical calculations.
Approach: Import AlignIO for reading alignments, Counter for column analysis, numpy for matrix operations, and math for entropy calculations.
from Bio import AlignIO
from Bio.Align import substitution_matrices
from collections import Counter
import numpy as np
import math
"Calculate percent identity" -> Compute the fraction of identical aligned residues between sequence pairs.
Goal: Measure sequence similarity as percent identity for individual pairs or across all sequences in an alignment.
Approach: Count matching non-gap positions divided by total aligned positions; optionally compute a full N-by-N identity matrix.
There are four common denominators, producing up to 11.5% difference on the same alignment. Combined with different alignment algorithms, variation reaches 22%. Always report which method was used.
| Method | Denominator | Code |
|--------|-------------|------|
| PID1 | Aligned positions including internal gaps | sum(a != '-' or b != '-' for a, b in zip(s1, s2)) |
| PID2 | Aligned residue pairs only (no gaps) | sum(a != '-' and b != '-' for a, b in zip(s1, s2)) |
| PID3 | Shorter sequence length (ungapped) | min(len(s1.replace('-', '')), len(s2.replace('-', ''))) |
| PID4 | Mean sequence length (ungapped) | (len(s1.replace('-', '')) + len(s2.replace('-', ''))) / 2 |
PID2 always gives the highest value; PID4 correlates best with structural similarity (Raghava & Barton 2006 BMC Bioinf, r=0.86 with Q-score) and is recommended for evolutionary analyses.
Length-asymmetry pathology: When sequences differ greatly in length, PID4 and PID2 diverge sharply. Example: 80 matches between a 500-residue protein and a 100-residue domain fragment yields PID2 ~84% (matches over aligned residue pairs) but PID4 ~27% (matches over mean ungapped length). Neither is wrong; they answer different questions:
For ortholog identification at the protein level (full-length, similar size), PID4 is recommended. For domain detection or fragment-vs-genome alignment, PID2 with explicit length annotation is more interpretable. Always report alignment length alongside any percent identity to disambiguate.
def pairwise_identity(seq1, seq2, method='pid1'):
matches = sum(a == b and a != '-' for a, b in zip(seq1, seq2))
if method == 'pid1':
denom = sum(a != '-' or b != '-' for a, b in zip(seq1, seq2))
elif method == 'pid2':
denom = sum(a != '-' and b != '-' for a, b in zip(seq1, seq2))
elif method == 'pid3':
denom = min(len(seq1.replace('-', '')), len(seq2.replace('-', '')))
elif method == 'pid4':
denom = (len(seq1.replace('-', '')) + len(seq2.replace('-', ''))) / 2
return matches / denom if denom > 0 else 0
alignment = AlignIO.read('alignment.fasta', 'fasta')
seq1, seq2 = str(alignment[0].seq), str(alignment[1].seq)
for method in ['pid1', 'pid2', 'pid3', 'pid4']:
print(f'{method}: {pairwise_identity(seq1, seq2, method) * 100:.1f}%')
The double-loop is O(N^2 * L) and fine for hundreds of sequences; for thousands, vectorize via numpy broadcasting:
def identity_matrix_vectorized(alignment):
# Build N x L character array; for each row, broadcast equality and validity masks against all rows
...
Full implementation: examples/identity_matrix.py. For very large alignments (>10k sequences), switch to k-mer-based distance estimation (e.g. mash) -- exact pairwise identity becomes prohibitive.
Pick a conservation score by what the downstream task needs:
| Pick this | When | |-----------|------| | Majority fraction or Shannon entropy | Quick screening; DNA/RNA logos; coarse column QC | | Capra-Singh JSD (modern default) | Catalytic-residue / functional-site prediction on protein MSAs | | ConSurf rate4site | PDB surface mapping when a phylogenetic tree is available |
Goal: Quantify per-column and overall alignment conservation to identify conserved and variable regions.
Approach: Calculate the fraction of the most common residue at each column, optionally ignoring gaps, and smooth with a sliding window.
def column_conservation(alignment, col_idx, ignore_gaps=True):
column = alignment[:, col_idx]
if ignore_gaps:
column = column.replace('-', '')
if not column:
return 0.0
counts = Counter(column)
most_common_count = counts.most_common(1)[0][1]
return most_common_count / len(column)
alignment = AlignIO.read('alignment.fasta', 'fasta')
for i in range(min(20, alignment.get_alignment_length())):
cons = column_conservation(alignment, i)
print(f'Column {i}: {cons*100:.0f}% conserved')
def average_conservation(alignment, ignore_gaps=True):
scores = []
for col_idx in range(alignment.get_alignment_length()):
scores.append(column_conservation(alignment, col_idx, ignore_gaps))
return sum(scores) / len(scores)
avg_cons = average_conservation(alignment)
print(f'Average conservation: {avg_cons*100:.1f}%')
def conservation_profile(alignment, window=10):
profile = []
for i in range(alignment.get_alignment_length()):
start = max(0, i - window // 2)
end = min(alignment.get_alignment_length(), i + window // 2)
scores = [column_conservation(alignment, j) for j in range(start, end)]
profile.append(sum(scores) / len(scores))
return profile
profile = conservation_profile(alignment, window=10)
Goal: Score columns by divergence from a residue-frequency background, with a window-smoothed neighbour penalty for catalytic residue prediction.
Approach: Compute JSD between the column distribution and a residue-frequency background (Capra & Singh 2007 Bioinf used BLOSUM62-derived; the example uses Robinson & Robinson 1991 PNAS, which gives effectively-equivalent column ranking), then mix with the windowed-neighbour mean. Defaults window=3, lambda_window=0.5 track catalytic-residue annotation in the Catalytic Site Atlas. Full implementation: examples/capra_singh_jsd.py.
def capra_singh_score(alignment, background=None, window=3, lambda_window=0.5):
# raw[i] = JSD(column_i, background) * (1 - gap_fraction_i) # gap-penalty per Capra-Singh reference impl
# smoothed[i] = (1 - lambda) * raw[i] + lambda * mean(raw[i-window:i+window+1] excluding i)
...
Threshold for catalytic-residue prediction: Capra & Singh 2007 (Bioinf 23:1875) report AUC ~0.94 and Top-30 score ~0.75 on the Catalytic Site Atlas using JSD with neighbor mixing (window=3, lambda=0.5); the paper does not prescribe a single threshold. Choose by ROC tradeoff for the specific use case. ConSurf-derived rate4site (rate-of-evolution) is competitive but requires a phylogenetic tree; Capra-Singh JSD is the alignment-only equivalent.
Goal: Tabulate observed substitution frequencies from the alignment for evolutionary analysis or custom scoring matrices.
Approach: Enumerate all pairwise non-gap character comparisons at each column and tally substitution pairs.
def substitution_counts(alignment):
# Tally pairwise non-gap residue mismatches across all column-pair comparisons
...
Full implementation: examples/substitution_counts.py. The script also reports Ti/Tv ratio for DNA alignments.
For pairwise alignments created with PairwiseAligner, use the .substitutions property:
from Bio.Align import PairwiseAligner
aligner = PairwiseAligner(mode='global', match_score=1, mismatch_score=-1)
print(aligner.align(seq1, seq2)[0].substitutions)
Different tools calibrate Karlin-Altschul lambda differently (NCBI BLAST tabulates 0.3176; FASTA/SSEARCH recomputes per query; HMMER phmmer derives it via Forward calibration; Bio.Align does not implement Karlin-Altschul). Expect ~2% bit-score variation across tools on the same alignment. Always record tool and version when citing bit scores; for borderline (~30 bit) hits this matters.
Goal: Measure column variability using Shannon entropy and derive information content for identifying functionally important positions.
Approach: Compute Shannon entropy from character frequencies per column; information content is max entropy minus observed entropy.
import math
def shannon_entropy(column, ignore_gaps=True):
if ignore_gaps:
column = column.replace('-', '')
if not column:
return 0.0
counts = Counter(column)
total = len(column)
entropy = 0.0
for count in counts.values():
p = count / total
if p > 0:
entropy -= p * math.log2(p)
return entropy
alignment = AlignIO.read('alignment.fasta', 'fasta')
for i in range(min(20, alignment.get_alignment_length())):
column = alignment[:, i]
ent = shannon_entropy(column)
print(f'Column {i}: entropy = {ent:.2f} bits')
The classic uniform-background formulation IC = log2(alphabet_size) - H (Schneider & Stephens 1990 NAR) is only valid when the genomic background is uniform. This is approximately true for random DNA but emphatically wrong for protein, where amino acid frequencies range from 1.4% (Trp) to 9.4% (Leu). For amino acids, use Kullback-Leibler divergence IC = sum_i p_i * log2(p_i / b_i) against the Robinson & Robinson 1991 PNAS empirical background. Full implementation: examples/entropy_analysis.py.
ROBINSON_BACKGROUND = {
'A': 0.0780, 'R': 0.0512, 'N': 0.0427, 'D': 0.0530, 'C': 0.0193,
'Q': 0.0419, 'E': 0.0629, 'G': 0.0738, 'H': 0.0224, 'I': 0.0526,
'L': 0.0922, 'K': 0.0596, 'M': 0.0224, 'F': 0.0399, 'P': 0.0508,
'S': 0.0712, 'T': 0.0584, 'W': 0.0133, 'Y': 0.0327, 'V': 0.0653,
}
DNA_UNIFORM = {'A': 0.25, 'C': 0.25, 'G': 0.25, 'T': 0.25}
def information_content(column, background, ignore_gaps=True):
if ignore_gaps:
column = column.replace('-', '')
if not column:
return 0.0
counts = Counter(column)
total = len(column)
return sum((c / total) * math.log2((c / total) / background.get(r, 1e-9))
for r, c in counts.items() if c > 0)
For sequence-logo letter heights, use the Schneider-Stephens form (letter_height = p_i * (log2(alphabet) - H_observed), uniform background) when the comparison is "informative vs random"; use the KL form (against an empirical background such as Robinson-Robinson 1991) for protein logos where amino-acid frequencies are non-uniform or when the comparison is "informative vs the proteome". When the background is unknown, default to the empirical alignment composition rather than uniform.
Goal: Summarize gap distribution across the alignment to assess alignment quality and identify problematic regions.
Approach: Calculate gap fractions per column and aggregate statistics including total gaps, gap-free columns, and gappiest sequence/column.
def gap_profile(alignment):
profile = []
for col_idx in range(alignment.get_alignment_length()):
column = alignment[:, col_idx]
gap_fraction = column.count('-') / len(alignment)
profile.append(gap_fraction)
return profile
gaps = gap_profile(alignment)
avg_gaps = sum(gaps) / len(gaps)
print(f'Average gap fraction: {avg_gaps*100:.1f}%')
def gap_statistics(alignment):
# total gaps, gap fraction, gappiest sequence and column indices, gap-free column count
...
Full implementation: examples/gap_statistics.py.
Goal: Score alignment quality using sum-of-pairs or simple match/mismatch/gap scoring across all columns.
Approach: For each column, score all pairwise residue comparisons and sum across the alignment.
def alignment_score(alignment, match=1, mismatch=-1, gap=-2):
total_score = 0
for col_idx in range(alignment.get_alignment_length()):
column = alignment[:, col_idx]
for i, c1 in enumerate(column):
for c2 in column[i+1:]:
if c1 == '-' or c2 == '-':
total_score += gap
elif c1 == c2:
total_score += match
else:
total_score += mismatch
return total_score
score = alignment_score(alignment)
print(f'Alignment score: {score}')
Biopython's substitution_matrices.load('BLOSUM62') returns a Bio.Align.substitution_matrices.Array object (a numpy-backed 2D array indexed by residue characters), NOT a dict. The correct accessor is matrix[c1, c2]; calling .get((c1, c2), 0) on an Array silently always returns 0. Standard BLOSUM62 includes B, Z, X, and *; pairs containing residues outside the matrix alphabet (e.g. U selenocysteine, J Leu/Ile) raise IndexError and should be skipped or scored zero.
def sum_of_pairs(alignment, substitution_matrix=None):
if substitution_matrix is None:
substitution_matrix = substitution_matrices.load('BLOSUM62')
total = 0.0
for col_idx in range(alignment.get_alignment_length()):
column = alignment[:, col_idx]
for i, c1 in enumerate(column):
for c2 in column[i+1:]:
if c1 == '-' or c2 == '-':
continue
try:
total += substitution_matrix[c1, c2]
except (KeyError, IndexError):
continue
return total
SP-score is biased on unbalanced datasets. The above implementation gives equal weight to every sequence pair. On phylogenetically structured datasets (e.g. 95 mammals + 5 outgroups), 99% of pairs are mammal-mammal and the SP score reports only mammal-internal alignment quality. MUSCLE and T-Coffee internally compute weighted SP using sequence weights (Henikoff or position-based) so pair contributions are downweighted by cluster redundancy. For SP-as-quality-score on real data, multiply each pair contribution by weight[i] * weight[j] from the Henikoff weights in alignment/msa-parsing (examples/henikoff_weights.py).
Goal: Build a position-specific scoring matrix from the alignment for motif analysis or sequence scoring.
Approach: Raw counts give frequencies; without pseudocounts, log-odds against background diverge to negative infinity at any column missing a residue. Henikoff JG & Henikoff S 1996 (Bioinf 12:135-143) introduced data-dependent pseudocount weighting; a simple Laplace add-one is the minimal correct approach for production use. Full implementation: examples/pssm.py.
def pssm_with_pseudocounts(alignment, background, pseudocount=1.0):
# log2((counts[r] + pc * background[r]) / (n + pc) / background[r]) per column
...
pseudocount=1.0 is the Laplace prior; HMMER uses Dirichlet mixtures for sophisticated smoothing. For motif scanning, score a candidate site by summing per-position log-odds; sites above a calibrated threshold are predicted hits. Use ROBINSON_BACKGROUND (defined in the IC section above) for protein.
Goal: Estimate non-redundant sequence count for MSA-depth metrics.
Approach: Cluster at an identity threshold (0.62 protein, 0.80 nucleotide) and weight by inverse cluster size; reference implementation lives in msa-parsing (examples/neff.py). Neff/L > 0.5 is the rule-of-thumb for direct-coupling-analysis contact prediction; AlphaFold's MSA-depth scoring uses a closely related metric.
Goal: Detect coevolving column pairs as a coupling/contact signal.
Approach: Pairwise MI minus average-product correction (Dunn, Wahl, Gloor 2008 Bioinf). Reference implementation lives in msa-parsing (examples/mi_apc.py). For production-grade contact prediction beyond a few hundred columns, switch to plmDCA (Ekeberg et al 2013) or EVcouplings (Hopf et al 2017).
For publication-grade pairwise distances, do NOT pick a model by rule of thumb. Run ModelTest-NG to select the best-fit substitution model by AIC/BIC, then apply that correction via IQ-TREE2 (.mldist output) or EMBOSS distmat. Hand-coded JC69 / K80 / blosum62 corrections via Bio.Phylo.TreeConstruction.DistanceCalculator are appropriate only for exploratory work.
modeltest-ng -i alignment.fasta -d nt -t ml
modeltest-ng -i alignment.fasta -d aa -t ml -p 4
from Bio.Phylo.TreeConstruction import DistanceCalculator
calculator = DistanceCalculator('blosum62')
distance_matrix = calculator.get_distance(alignment)
For substitution-model selection in the context of full phylogenetic inference, see phylogenetics/modern-tree-inference.
| Warning Sign | Implication | Action | |-------------|-------------|--------| | Average pairwise identity <25% (protein) | Twilight zone; alignment may be unreliable | Use GUIDANCE2 to assess; consider structural alignment | | >30% of columns have >50% gaps | Possible non-homologous sequences or misalignment | Remove outlier sequences and re-align | | Identity varies dramatically across regions | Domain architecture mismatch | Align domains separately | | Conservation pattern absent in expected functional regions | Alignment error or non-homology | Verify with BLAST that sequences are truly homologous |
Alignment uncertainty propagates directly into evolutionary inference; for selection analysis (dN/dS), misaligned codons create artificial nonsynonymous differences and false positive signals. For column-confidence quantification (GUIDANCE2, MUSCLE5 ensemble, T-Coffee TCS), see the Confidence Assessment section in alignment/multiple-alignment.
The AlignInfo.SummaryInfo class is deprecated in recent Biopython versions. Use the custom functions in this skill instead:
pssm_with_pseudocounts() aboveinformation_content() function earlier in this skill| Metric | Description | Range | |--------|-------------|-------| | Identity | Fraction of identical residues | 0-1 | | Conservation | Most common residue frequency | 0-1 | | Shannon Entropy | Variability measure | 0 to log2(alphabet) | | Information Content | Max entropy - observed entropy | 0 to log2(alphabet) | | Gap Fraction | Proportion of gaps | 0-1 |
| Error | Cause | Solution |
|-------|-------|----------|
| ZeroDivisionError | Empty column after gap removal | Check for gap-only columns |
| KeyError | Character not in substitution matrix | Handle gaps separately |
| Negative IC | Wrong alphabet size | Use 4 for DNA, 20 for protein |
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.