database-access/sequence-similarity/SKILL.md
Find homologous sequences using iterative BLAST (PSI-BLAST), profile HMMs (HMMER), and reciprocal best hit analysis. Use when identifying orthologs, distant homologs, or protein family members where standard BLAST is not sensitive enough.
npx skillsauth add GPTomics/bioSkills bio-sequence-similarityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: BioPython 1.83+, NCBI BLAST+ 2.15+
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signatures<tool> --version then <tool> --help to confirm flagsIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
Advanced methods for finding homologous sequences beyond standard BLAST.
"Find distant homologs" → Use iterative search (PSI-BLAST) or profile HMMs (HMMER) to detect remote sequence similarity that standard BLAST misses.
psiblast -query seq.fa -db nr -num_iterations 3 or hmmsearch profile.hmm seqdbNcbipsiblastCommandline() (BioPython)Builds a position-specific scoring matrix (PSSM) through iterations to find distant homologs.
psiblast -query protein.fasta -db nr -out results.txt -num_iterations 3
psiblast -query protein.fasta -db nr \
-out results.txt \
-out_pssm pssm.asn \
-out_ascii_pssm pssm.txt \
-num_iterations 5
psiblast -in_pssm pssm.asn -db nr -out results.txt
psiblast -query protein.fasta -db nr \
-out results.txt \
-outfmt 6 \
-num_iterations 3 \
-inclusion_ethresh 0.001
psiblast -query protein.fasta -db nr \
-num_iterations 5 \
-inclusion_ethresh 0.001 \
-evalue 0.01 \
-num_threads 8 \
-out results.txt
| Parameter | Default | Description | |-----------|---------|-------------| | -num_iterations | 1 | Number of iterations | | -inclusion_ethresh | 0.002 | E-value for PSSM inclusion | | -evalue | 10 | E-value threshold for reporting | | -num_threads | 1 | CPU threads |
HMMER uses profile hidden Markov models for sensitive sequence searches.
jackhmmer -o results.txt -A aligned.sto --cpu 8 query.fasta database.fasta
hmmbuild profile.hmm alignment.sto
hmmsearch -o results.txt --tblout hits.tbl profile.hmm database.fasta
hmmsearch -o results.txt --domtblout domains.tbl profile.hmm database.fasta
wget https://ftp.ebi.ac.uk/pub/databases/Pfam/current_release/Pfam-A.hmm.gz
gunzip Pfam-A.hmm.gz
hmmpress Pfam-A.hmm
hmmscan --tblout pfam_hits.tbl --domtblout domains.tbl Pfam-A.hmm query.fasta
grep -v "^#" hits.tbl | head
awk '$5 < 1e-10' hits.tbl
| Column | Description | |--------|-------------| | 1 | Target name | | 2 | Accession | | 3 | Query name | | 4 | Query accession | | 5 | E-value (full sequence) | | 6 | Score (full sequence) | | 7 | Bias | | 8 | E-value (best domain) | | 9 | Score (best domain) |
Find orthologs using bidirectional best hits.
makeblastdb -in species_A.fasta -dbtype prot -out species_A_db
makeblastdb -in species_B.fasta -dbtype prot -out species_B_db
blastp -query species_A.fasta -db species_B_db -outfmt 6 -evalue 1e-5 -max_target_seqs 1 > A_vs_B.txt
blastp -query species_B.fasta -db species_A_db -outfmt 6 -evalue 1e-5 -max_target_seqs 1 > B_vs_A.txt
awk 'FNR==NR {a[$1]=$2; next} $2 in a && a[$2]==$1 {print $1"\t"$2}' \
A_vs_B.txt B_vs_A.txt > reciprocal_best_hits.txt
Goal: Identify orthologous gene pairs between two species using the reciprocal best hit criterion.
Approach: Parse forward and reverse BLAST results to extract the top hit per query, then retain only pairs where each sequence is the other's best match.
def find_rbh(forward_blast, reverse_blast):
'''Find reciprocal best hits from BLAST results'''
forward = {}
with open(forward_blast) as f:
for line in f:
parts = line.strip().split('\t')
query, subject = parts[0], parts[1]
if query not in forward:
forward[query] = subject
reverse = {}
with open(reverse_blast) as f:
for line in f:
parts = line.strip().split('\t')
query, subject = parts[0], parts[1]
if query not in reverse:
reverse[query] = subject
rbh = []
for a, b in forward.items():
if b in reverse and reverse[b] == a:
rbh.append((a, b))
return rbh
rbh_pairs = find_rbh('A_vs_B.txt', 'B_vs_A.txt')
for a, b in rbh_pairs:
print(f'{a}\t{b}')
Uses conserved domain database for more sensitive initial search.
deltablast -query protein.fasta -db nr -rpsdb cdd_delta -out results.txt
Search with a pattern plus sequence.
phi_pattern="G-x(2)-[ST]-x-[RK]"
phiblast -query protein.fasta -db nr -pattern "$phi_pattern" -out results.txt
from Bio.Blast import NCBIWWW, NCBIXML
with open('query.fasta') as f:
query = f.read()
result_handle = NCBIWWW.qblast('psiblast', 'nr', query, expect=0.001, word_size=3)
with open('psiblast_result.xml', 'w') as out:
out.write(result_handle.read())
result_handle.close()
with open('psiblast_result.xml') as f:
records = NCBIXML.parse(f)
for record in records:
for alignment in record.alignments:
for hsp in alignment.hsps:
if hsp.expect < 1e-10:
print(f'{alignment.hit_def[:50]}: E={hsp.expect}')
from Bio import SearchIO
results = SearchIO.parse('hmmsearch_output.txt', 'hmmer3-text')
for query_result in results:
print(f'Query: {query_result.id}')
for hit in query_result:
print(f' Hit: {hit.id}, E-value: {hit.evalue}')
for hsp in hit:
print(f' Domain: {hsp.bitscore} bits')
Similar to PSI-BLAST but uses HMM profiles.
jackhmmer -N 5 -o results.txt --tblout hits.tbl query.fasta database.fasta
jackhmmer -N 5 -A iterations.sto --chkhmm checkpoint query.fasta database.fasta
orthofinder -f proteomes/ -t 8
orthofinder -f proteomes/ -t 8 -M msa
mkdir proteomes
cp species_*.fasta proteomes/
| File | Content | |------|---------| | Orthogroups.tsv | All orthogroups | | Orthogroups_SingleCopyOrthologues.txt | 1:1 orthologs | | Species_Tree/ | Inferred species tree | | Gene_Trees/ | Individual gene trees |
| E-value | Interpretation | |---------|----------------| | < 1e-50 | Highly significant, likely homolog | | 1e-50 to 1e-10 | Significant, probable homolog | | 1e-10 to 1e-3 | Marginal, possible remote homolog | | > 0.01 | Not significant |
Goal: Run an end-to-end reciprocal best hit ortholog analysis from two proteome FASTA files.
Approach: Build BLAST databases for both species, run bidirectional best-hit searches, and extract reciprocal pairs using awk.
#!/bin/bash
SPECIES_A=$1
SPECIES_B=$2
EVALUE=1e-10
THREADS=8
echo "Building databases..."
makeblastdb -in $SPECIES_A -dbtype prot -out db_A
makeblastdb -in $SPECIES_B -dbtype prot -out db_B
echo "Running forward BLAST..."
blastp -query $SPECIES_A -db db_B -outfmt 6 -evalue $EVALUE \
-max_target_seqs 1 -num_threads $THREADS > forward.txt
echo "Running reverse BLAST..."
blastp -query $SPECIES_B -db db_A -outfmt 6 -evalue $EVALUE \
-max_target_seqs 1 -num_threads $THREADS > reverse.txt
echo "Finding reciprocal best hits..."
awk 'FNR==NR {best[$1]=$2; next}
$2 in best && best[$2]==$1 {print $1"\t"$2}' \
forward.txt reverse.txt > orthologs.txt
echo "Found $(wc -l < orthologs.txt) ortholog pairs"
rm -f db_A.* db_B.*
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.