database-access/remote-homology/SKILL.md
Detect distant homologs using profile and structure-aware methods that go beyond standard BLAST. Use when sequence identity falls into the twilight zone (<35% pairwise), when BLAST fails to find homologs that should exist, when working at metagenomic scale (DIAMOND, MMseqs2), or when structure beats sequence (Foldseek). Covers PSI-BLAST (iterative PSSM), jackhmmer (iterative HMM), HHblits/HHsearch (profile-profile), DIAMOND, MMseqs2, and Foldseek (3Di structural alphabet, van Kempen 2024).
npx skillsauth add GPTomics/bioSkills bio-remote-homologyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: NCBI BLAST+ 2.15+, HMMER 3.4+, MMseqs2 15+, DIAMOND 2.1+, HH-suite3 3.3+, Foldseek 9+
Before using code patterns, verify installed versions match. If versions differ:
<tool> --version then <tool> --help to confirm flagspip show <package> then introspect signaturesIf a flag is unrecognized or behavior changes, introspect with --help and adapt the example to match the installed version rather than retrying.
"Find homologs my BLAST missed" -> Standard BLAST detects similarity reliably down to ~35% pairwise identity (the "twilight zone", Rost 1999 Protein Eng 12:85). Below that, profile methods (PSSMs, HMMs) and structure-aware methods (Foldseek) recover homologs that pairwise alignment misses.
This skill covers the decision: which method, when, against what database. The competition has shifted substantially since 2015: PSI-BLAST is no longer the de-facto standard; MMseqs2 and DIAMOND have replaced BLAST in most large-scale workflows; Foldseek (van Kempen et al. 2024 Nat Biotechnol 42:243) detects homologs no sequence method can reach by searching with a 3Di structural alphabet derived from AlphaFold/ESMFold predictions.
psiblast, jackhmmer, hmmsearch, hhblits, mmseqs, diamond, foldseekBio.SearchIO for output parsing; tool-specific clients exist but subprocess is preferred# Install via conda
conda install -c bioconda hmmer mmseqs2 diamond hhsuite foldseek
# BLAST+ (separate)
conda install -c bioconda blast
# Verify
hmmsearch -h | head -3 # HMMER 3.4+
mmseqs version # MMseqs2 15+
diamond --version # DIAMOND 2.1+
hhblits -h | head -3 # HH-suite3 3.3+
foldseek --version # Foldseek 9+
| Question | Best tool | Why | Sensitivity / Speed |
|---|---|---|---|
| Quick all-vs-all proteome | MMseqs2 or DIAMOND | 100-10,000x faster than BLAST at comparable sensitivity | Highest throughput, near-BLAST sensitivity |
| Identify distant protein homolog (single query) | jackhmmer | Iterative HMM; usually beats PSI-BLAST | Higher sensitivity than PSI-BLAST |
| Distant homology where structure available | Foldseek | 3Di alphabet finds homologs sequence misses | Finds hits PSI-BLAST/HMMER cannot |
| Profile-profile comparison (PDB70 / Pfam) | HHblits + HHsearch | Profile vs profile is most sensitive when target also has profile | Best sensitivity for very-deep homology |
| Domain assignment | hmmscan against Pfam-A | Curated, calibrated thresholds | Standard practice |
| Metagenomic protein clustering | MMseqs2 easy-cluster | Scales to >1B sequences | Production-grade |
| ORF search vs metagenome | DIAMOND blastx --frameshift | Frameshift-aware; long reads | Best for noisy long reads |
| Structure-aware homology (no AF2 prediction available) | Foldseek + ProstT5 | Predicts 3Di alphabet from sequence via PLM | Skip the AF2 step |
Foldseek (van Kempen, Kim, Tumescheit et al. 2024 Nat Biotechnol 42:243) searches protein structures by representing each residue's local geometry as a 21-letter "3Di" alphabet, then running BLAST-style alignment in this alphabet. Two consequences:
Two access modes:
foldseek easy-search query.pdb db_dir result.m8 tmp_dirfoldseek databases ProstT5 prostt5_db tmp then foldseek easy-search seq.fa db result.m8 tmp --prostt5-model prostt5_dbThe major prebuilt Foldseek databases (AlphaFoldDB, PDB100, ESMAtlas) are downloadable via foldseek databases.
PSI-BLAST (Altschul et al. 1997 Nucleic Acids Res 25:3389) builds a PSSM iteratively: each iteration includes hits below -inclusion_ethresh (default 0.005) in the next PSSM. Convergence is when no new hits cross the threshold. Stopping at convergence is often the wrong call -- iterations 2-3 are usually optimal; iterations 4+ frequently drift into paralog inclusion, contaminating the PSSM.
| Parameter | Default | Postdoc tuning |
|---|---|---|
| -num_iterations | 1 | 3 for most workflows; >3 risks drift |
| -inclusion_ethresh | 0.005 | 0.002 if specificity matters (Altschul 1997 recommendation) |
| -evalue | 10 | 0.01 for reporting cutoff |
| -num_threads | 1 | 8 for large DBs |
PSI-BLAST is also non-deterministic in detail: different input order or DB version can produce different PSSMs. For reproducibility, save the PSSM (-out_pssm pssm.asn) and re-use with -in_pssm.
HMMER 3 (Eddy 2011 PLoS Comput Biol 7:e1002195) is profile HMM search. Two main workflows:
hmmsearch profile.hmm seqdb: search a database with a known HMM (Pfam, custom).jackhmmer query.fa seqdb: iterative search like PSI-BLAST but with full HMM math. Typically higher sensitivity than PSI-BLAST at the same number of iterations.For domain assignment, hmmscan query.fa Pfam-A.hmm is the canonical pipeline. Pfam HMMs come with calibrated gathering thresholds (-gathering) -- use them instead of arbitrary E-value cutoffs.
# Build domain database once
wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz
hmmpress Pfam-A.hmm
# Annotate query against Pfam-A with gathering threshold (calibrated cutoff)
hmmscan --cut_ga --domtblout query.domtbl Pfam-A.hmm query.fa
HHblits (Remmert et al. 2012 Nat Methods 9:173; HH-suite3: Steinegger et al. 2019 BMC Bioinformatics 20:473) is profile-profile alignment. The most sensitive method when both query and target have an HMM representation. Standard pipeline:
hhblits against UniRef30 (or HHblits' default DB).hhsearch.Output is in HHM format. For very deep homology (the structural twilight zone), HHsearch vs PDB70 is still the gold standard.
MMseqs2 (Steinegger & Soding 2017 Nat Biotechnol 35:1026) replaces BLAST in nearly all large-scale workflows.
Key advantages:
-s 7.5, matches HMMER sensitivity (but on raw sequence, not profiles).mmseqs search --num-iterations 3 matches PSI-BLAST behavior at much higher speed.# Build target DB
mmseqs createdb target.fasta targetDB
mmseqs createindex targetDB tmp
# Sensitive search
mmseqs easy-search query.fa targetDB results.m8 tmp -s 7.5 --num-iterations 3
# All-vs-all clustering at 50% sequence identity
mmseqs easy-cluster all_proteins.fa cluster tmp --min-seq-id 0.5 -c 0.8
The -s parameter trades sensitivity for speed: 1.0 (fast), 4.0 (default), 7.5 (HMMER-like sensitivity).
DIAMOND (Buchfink et al. 2015 Nat Methods 12:59; v2: Buchfink et al. 2021 Nat Methods 18:366) is the de-facto replacement for blastp on large-scale workflows.
| Feature | DIAMOND v2 | blastp |
|---|---|---|
| Speed | 100-10,000x faster | baseline |
| Sensitivity (default) | ~95% of blastp | baseline |
| --ultra-sensitive | >99% of blastp | baseline |
| Frameshift-aware (long reads) | Yes (--frameshift 15) | No |
| GPU support | No (CPU-only) | No |
# Build DIAMOND DB
diamond makedb --in nr.fa -d nr
# Sensitive search
diamond blastp -d nr -q query.fa -o results.tsv \
--more-sensitive -e 1e-10 -p 16 \
--outfmt 6 qseqid sseqid pident length qcovhsp evalue bitscore stitle
# Long-read frameshift-aware (for nanopore/PacBio metagenomics)
diamond blastx -d nr -q longreads.fa --frameshift 15 -o reads.tsv --outfmt 6
For protein remote homology in 2026, DIAMOND --ultra-sensitive or MMseqs2 -s 7.5 should be the default before reaching for BLAST.
# 3 iterations, save checkpoint HMM at each iteration
jackhmmer -N 3 --chkhmm iter.hmm --tblout hits.tbl query.fa uniref90.fa
# After convergence (no new hits below threshold) use the final HMM for downstream searches
hmmsearch iter-3.hmm target.fa > hits.txt
Goal: Find structural homologs of a protein structure (or sequence via ProstT5) in AlphaFold's predicted structure database.
Approach: Download AlphaFoldDB structure DB (or ProstT5 for sequence-only); search with foldseek easy-search; parse m8 tabular output.
Reference (Foldseek 9+):
#!/bin/bash
# Reference: foldseek 9+ | Verify API if version differs
mkdir -p foldseek_dbs tmp
# Download the AlphaFoldDB Swiss-Prot subset (~few GB; full AFDB is much larger)
foldseek databases Alphafold/Swiss-Prot afdb_sp foldseek_dbs/tmp
# Structure-vs-structure search
foldseek easy-search query.pdb foldseek_dbs/afdb_sp results.m8 tmp \
--format-output query,target,fident,alnlen,evalue,bits,prob,qtmscore,ttmscore
head -5 results.m8
# qtmscore/ttmscore are TM-score equivalents from local Foldseek alignment.
# Hits with prob > 0.9 are confidently structurally homologous.
foldseek databases ProstT5 prostt5_model tmp
foldseek easy-search query.fa foldseek_dbs/afdb_sp seq_results.m8 tmp \
--prostt5-model prostt5_model --threads 8
This is the path to take when only a protein sequence is available -- ProstT5 (a protein language model) predicts the 3Di alphabet directly from sequence.
Goal: Build a position-specific scoring matrix iteratively, then re-use it for downstream searches.
Approach: 3 iterations against UniRef90 (or nr); save ASN.1 + ASCII PSSM; subsequent searches use -in_pssm.
Reference (NCBI BLAST+ 2.15+):
psiblast -query distant_protein.fa -db uniref90 \
-num_iterations 3 \
-inclusion_ethresh 0.002 \
-evalue 0.01 \
-num_threads 8 \
-out_pssm distant.pssm.asn \
-out_ascii_pssm distant.pssm.txt \
-out psiblast_results.txt
# Reuse saved PSSM in subsequent searches against a different DB
psiblast -in_pssm distant.pssm.asn -db swissprot \
-out swissprot_via_pssm.txt
Goal: PSI-BLAST-equivalent iterative profile search, but 100x faster.
Approach: mmseqs search --num-iterations 3 -s 7.5.
Reference (MMseqs2 15+):
mmseqs createdb query.fa queryDB
mmseqs createdb uniref90.fa uniref90DB
mmseqs createindex uniref90DB tmp
mmseqs search queryDB uniref90DB resultDB tmp \
--num-iterations 3 \
-s 7.5 \
-e 1e-5 \
--threads 16
mmseqs convertalis queryDB uniref90DB resultDB results.m8 \
--format-output query,target,fident,alnlen,evalue,bits
# One-time prep
hmmpress Pfam-A.hmm
# Annotate
hmmscan --cut_ga --domtblout query.domtbl --cpu 8 Pfam-A.hmm query.fa
# Filter: gathering threshold passes are already significance-validated
awk '!/^#/ {print $1, $2, $4, $5, $7, $8, $13}' query.domtbl | head
# columns: target_name, accession, query_name, accession, full_evalue, full_score, i_evalue
# Build query MSA via HHblits vs UniRef30
hhblits -i query.fa -d uniref30 -oa3m query.a3m -n 3 -cpu 8
# Search PDB70 with the query profile
hhsearch -i query.a3m -d pdb70 -o query.hhr -cpu 8
head -30 query.hhr # Top hits with probability + alignment statistics
diamond makedb --in uniref90.fa -d uniref90
diamond blastp -d uniref90 -q metagenome_proteins.fa -o hits.tsv \
--ultra-sensitive -e 1e-5 -p 32 \
--outfmt 6 qseqid sseqid pident length qcovhsp evalue bitscore stitle
-inclusion_ethresh 0.001.mmseqs easy-search without -s.-s 4.0 is fast but misses remote homologs.-s 7.5 for distant homology; -s 5.7 is a middle ground.diamond blastp without --more-sensitive or --ultra-sensitive.--more-sensitive for general work, --ultra-sensitive for remote homology.segmasker -infmt fasta -in query.fa) before building the profile.foldseek databases ProstT5 ... and --prostt5-model.| Error / symptom | Cause | Solution |
|---|---|---|
| PSI-BLAST returns implausible hits | Profile drift (too many iterations) | Cap at 3 iterations; tighter -inclusion_ethresh |
| MMseqs2 hits all unrelated | Default sensitivity too low | -s 7.5 |
| DIAMOND misses BLAST hits | Default mode lossy | --more-sensitive |
| Foldseek hits structurally unrelated proteins | Common fold, no homology | Cross-check with sequence and functional residues |
| HHblits prefilter no hits | Query MSA too sparse | Add -n 4 iterations; check input |
| jackhmmer ConvergenceError | Loop bug pre-v3.4 | Upgrade HMMER |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.