Version Compatibility

Reference examples tested with: NCBI BLAST+ 2.15+

Before using code patterns, verify installed versions match. If versions differ:

CLI: blastn -version then blastn -help to confirm flags
CLI: makeblastdb -help to confirm database build options

If a flag is unrecognized or behavior changes, introspect with -help and adapt the example to match the installed version rather than retrying.

Local BLAST

"Run BLAST locally for speed and control" -> Build or download a BLAST+ database, run the appropriate program with carefully chosen -task, masking, and thread settings, parse tabular output. Local BLAST is the right tool when remote is rate-limited or when the database must be reproducible (frozen).

The biggest mistakes are (a) using nt/nr without realizing they're >250 GB and grow weekly, (b) not building with -parse_seqids and then being unable to extract hit sequences with blastdbcmd, (c) using default blastn for cross-species when dc-megablast is correct, and (d) thinking -num_threads 32 will scale -- past ~16 threads BLAST is I/O bound.

CLI: makeblastdb, blastn/blastp, blastdbcmd, update_blastdb.pl (NCBI BLAST+)
Python: subprocess wrapper (preferred); Bio.Blast.Applications was deprecated and removed -- do not use

Installation

# conda (preferred)
conda install -c bioconda blast

# macOS
brew install blast

# Ubuntu
sudo apt install ncbi-blast+

# Verify
blastn -version    # NCBI BLAST+ 2.15+ expected
update_blastdb.pl --showall pretty | head

Database format: v5 vs v4

NCBI introduced BLAST database v5 in BLAST+ 2.10 (2020). v5 includes taxonomy indexing directly in the database files, enabling -taxids and -taxidlist filtering without a companion file. v4 databases require taxonomy4blast.sqlite3 to be present and discoverable.

| Feature | v4 | v5 | |---|---|---| | Default for prebuilt NCBI dbs | No (legacy) | Yes (since 2020) | | -taxids, -taxidlist support | No | Yes | | blastdbcmd -taxids | No | Yes | | New -info output fields | No | Yes |

update_blastdb.pl downloads v5 by default. When building a database manually with makeblastdb, v5 format requires -blastdb_version 5. Always pass -blastdb_version 5 and -parse_seqids when building from scratch.

`makeblastdb` flag taxonomy

| Flag | Effect | When | |---|---|---| | -dbtype nucl or -dbtype prot | Required | Always | | -parse_seqids | Indexes accessions so blastdbcmd -entry <acc> works | Almost always (downstream extraction) | | -hash_index | Speeds up extraction by accession | Large dbs | | -blastdb_version 5 | Use v5 format | Always | | -taxid 9606 | Single taxid for all seqs | Single-species DB | | -taxid_map file.tsv | Per-sequence taxid mapping (seqid<TAB>taxid) | Multi-species DB | | -mask_data masking.asnb | Apply precomputed soft-masking | Production pipelines | | -title "..." | Free-text label | Cosmetic | | -out path/prefix | DB file path prefix | Always |

makeblastdb -in reference.fasta -dbtype nucl \
            -blastdb_version 5 \
            -parse_seqids \
            -hash_index \
            -title "Custom reference 2026-05" \
            -out custom_db

`-task` taxonomy (the most-misused BLAST setting)

For blastn, the -task flag picks among heuristics with different word sizes and gap parameters.

| -task | Word | Gapped | Use case | Mistake to avoid | |---|---|---|---|---| | megablast (default) | 28 | linear | >=95% identity, intra-species, primer hits, contamination check | Used for cross-species and misses everything | | dc-megablast | 11 (discontiguous) | yes | Cross-species mRNA homology | Underused -- this is what blastn "should" be for cross-species | | blastn | 11 | yes | General sensitive DNA | Slower than dc-megablast for same job | | blastn-short | 7 | yes | Queries <50 nt (primers, small RNAs) | Default megablast can't seed at length 7 | | rmblastn | 11 | yes | Repeat masking; bundled with RepeatModeler | Specialized |

For blastp:

| -task | Word | Use case | |---|---|---| | blastp (default) | 3 | General protein similarity | | blastp-fast | 6 | Faster, less sensitive | | blastp-short | 2 | Peptides <30 aa, with PAM30 + word_size=2 typical |

Soft vs hard masking

| Setting | Effect on seed | Effect on extension | Effect on score | |---|---|---|---| | -soft_masking true (default for several tasks) | Skip masked positions when seeding | Allow extension through masked | Score includes masked positions | | -soft_masking false + -dust yes / -seg yes | Skip masked positions when seeding | Skip masked positions in extension | Score excludes masked positions | | Hard-mask in input FASTA (N or X) | Hard exclusion everywhere | Hard exclusion | Treated as mismatches |

Soft masking is correct for almost all cases. Hard masking creates artificial mismatches at masked boundaries and can split true alignments. The exception: searching against a database of repeats explicitly, where hard masking on the query is the right choice.

Thread scaling

BLAST+ parallelizes per-query (with -num_threads) but is I/O bound past ~16 threads on most hardware. For >100,000 query batches the better answer is splitting the input FASTA into N chunks and running N parallel blastn invocations -- this saturates CPUs better than -num_threads 64.

| Threads | Typical speedup vs single | Notes | |---|---|---| | 1-8 | Near-linear | Default sweet spot | | 8-16 | Sub-linear (1.5-2x over 8) | Useful on big SMP boxes | | 16-32 | Diminishing returns | I/O bound for most DBs | | 32+ | Often slower | Cache thrash + I/O contention |

For massive workflows, prefer DIAMOND (Buchfink et al. 2021 Nat Methods 18:366) or MMseqs2 (Steinegger & Soding 2017 Nat Biotechnol 35:1026) -- 100-10,000x faster than BLASTP at comparable sensitivity. See remote-homology skill.

Output format reference (`-outfmt`)

| -outfmt | Description | Use | |---|---|---| | 0 | Pairwise (default; human-readable) | Debugging, inspection | | 5 | XML | Programmatic parsing (Bio.SearchIO) | | 6 | Tabular (no header) | Most pipelines | | 7 | Tabular with comment headers | Self-documenting | | 11 | ASN.1 binary | Re-parse with later versions |

Custom tabular fields:

blastn -query q.fa -db db -outfmt "6 qseqid sseqid pident length qcovs qcovhsp evalue bitscore staxids sscinames stitle"

Field key fields for analysis:

pident = percent identity over the HSP (NOT the query); for query-level, use qcovhsp
qcovs = total query coverage by all HSPs of this subject (the "coverage" most users want)
qcovhsp = query coverage by best HSP alone (use when there's only one HSP per hit)
staxids = taxonomy IDs (v5 only); critical for any "what species" workflow

Prebuilt NCBI databases via `update_blastdb.pl`

# List available
update_blastdb.pl --showall pretty | grep -E 'refseq|swissprot|nt|nr'

# Download (with decompress)
update_blastdb.pl --decompress refseq_select_rna

# Download specific volume of split database
update_blastdb.pl --decompress refseq_protein

# Download with parallelism
update_blastdb.pl --decompress --num_threads 4 refseq_select_rna

Sizes (approximate, 2026):

refseq_select_rna: ~5 GB
refseq_protein: ~30 GB
swissprot: <1 GB
nt: ~250 GB
nr: ~300 GB

For most use cases, refseq_select_* is the right starting point. nt/nr are storage-heavy and reproducibility-hostile.

Code patterns

Build and search a custom protein database

Goal: Build a BLAST+ protein database from a custom FASTA and search against it.

Approach: makeblastdb with v5 + parse_seqids + hash_index; blastp with explicit outfmt.

Reference (NCBI BLAST+ 2.15+):

#!/bin/bash
# Reference: NCBI BLAST+ 2.15+ | Verify API if version differs

REF=reference_proteins.fasta
DB=ref_prot_db
QUERY=query.fasta
OUT=hits.tsv

makeblastdb -in "$REF" -dbtype prot \
            -blastdb_version 5 -parse_seqids -hash_index \
            -title "$REF $(date +%Y-%m-%d)" \
            -out "$DB"

blastp -query "$QUERY" -db "$DB" \
       -evalue 1e-10 \
       -num_threads 8 \
       -max_target_seqs 500 \
       -outfmt "6 qseqid sseqid pident length qcovs evalue bitscore stitle" \
       -out "$OUT"

# Top hit per query by bit-score (column 7)
sort -k1,1 -k7,7gr "$OUT" | awk '!seen[$1]++' > top_hit_per_query.tsv

Cross-species DNA with dc-megablast

blastn -query mouse_cdna.fa -db human_refseq_rna \
       -task dc-megablast \
       -word_size 11 \
       -evalue 1e-10 \
       -outfmt "6 qseqid sseqid pident length qcovs evalue bitscore" \
       -num_threads 8 \
       -out cross_species.tsv

Short primer search

blastn -query primers.fa -db genome_db \
       -task blastn-short \
       -word_size 7 \
       -evalue 1000 \
       -outfmt 6 \
       -out primer_hits.tsv

Taxonomy-filtered search (BLAST v5 only)

# Restrict to specific taxids
blastp -query query.fa -db nr \
       -taxids 9606,10090,10116 \
       -outfmt "6 qseqid sseqid staxids sscinames evalue bitscore" \
       -out mammalian_hits.tsv

# Or to a taxid subtree (NCBI BLAST+ 2.13+)
echo 9606 > human_only.txt
blastp -query query.fa -db nr -taxidlist human_only.txt -outfmt 6 -out human_hits.tsv

Extract subject sequences for top hits

# Requires database built with -parse_seqids
cut -f2 top_hit_per_query.tsv | sort -u > hit_accessions.txt
blastdbcmd -db ref_prot_db -entry_batch hit_accessions.txt -out hits.fasta

# Pull a range of a sequence
blastdbcmd -db genome_db -entry NC_000001.11 -range 1000000-1001000 -out region.fa

Reciprocal best hit (RBH) for ortholog candidates

See ortholog-inference skill for the principled treatment. Quick version:

blastp -query A.fa -db B_db -outfmt 6 -evalue 1e-5 -num_threads 8 \
       -max_target_seqs 5 -out A_vs_B.tsv
blastp -query B.fa -db A_db -outfmt 6 -evalue 1e-5 -num_threads 8 \
       -max_target_seqs 5 -out B_vs_A.tsv

# Best forward + reverse, intersect
awk '!seen[$1]++ {print $1"\t"$2}' A_vs_B.tsv | sort > A_best
awk '!seen[$1]++ {print $1"\t"$2}' B_vs_A.tsv | sort > B_best
awk 'NR==FNR{a[$1]=$2; next} a[$2]==$1' A_best B_best > rbh.tsv

This works but does NOT handle paralog mis-pairs from gene duplication; for that use OrthoFinder or OMA (in ortholog-inference).

Python wrapper with version pinning

import subprocess
import shutil


def require_tool(name, min_version=None):
    if not shutil.which(name):
        raise RuntimeError(f'{name} not on PATH')
    out = subprocess.run([name, '-version'], capture_output=True, text=True)
    print(f'  {out.stdout.strip().splitlines()[0]}')


def run_blast(query, db, out, program='blastp', evalue=1e-10, threads=8, hitlist=500):
    require_tool(program)
    cmd = [program, '-query', query, '-db', db, '-out', out,
           '-evalue', str(evalue),
           '-num_threads', str(threads),
           '-max_target_seqs', str(hitlist),
           '-outfmt', '6 qseqid sseqid pident length qcovs qcovhsp evalue bitscore stitle']
    subprocess.run(cmd, check=True)


def parse_tabular(path):
    cols = ['qseqid', 'sseqid', 'pident', 'length', 'qcovs', 'qcovhsp', 'evalue', 'bitscore', 'stitle']
    rows = []
    with open(path) as f:
        for line in f:
            vals = line.rstrip('\n').split('\t')
            d = dict(zip(cols, vals))
            for k in ('pident', 'qcovs', 'qcovhsp', 'evalue', 'bitscore'):
                d[k] = float(d[k])
            d['length'] = int(d['length'])
            rows.append(d)
    return rows

Failure modes

`nt`/`nr` size shock

Trigger: update_blastdb.pl --decompress nt without realizing the size.
Mechanism: nt is ~250 GB compressed, ~1 TB indexed.
Symptom: Disk fills mid-download; partial DB unusable.
Fix: Use refseq_select for most workflows; only pull nt/nr with intent and >1 TB free.

Missing `-parse_seqids`

Trigger: Built DB without -parse_seqids; later try blastdbcmd -entry.
Mechanism: Without the parsed index, blastdbcmd can't look up by accession.
Symptom: Error: ... not found in database.
Fix: Rebuild with -parse_seqids (cheap if FASTA still on disk).

Wrong `-task` for the question

Trigger: Default blastn for cross-species mRNA (word=11 but ungapped seeding).
Mechanism: Discontiguous seed (dc-megablast) is much more sensitive across species.
Symptom: Far fewer hits than the question warrants.
Fix: Use -task dc-megablast for cross-species; -task megablast only for >=95% identity.

Thread saturation

Trigger: -num_threads 64 on a 32-core box.
Mechanism: I/O bound past ~16 threads; cache thrash hurts past CPU count.
Symptom: No speedup or slowdown.
Fix: Cap at 8-16; split FASTA and run parallel processes instead for very large batches.

v4 database, expecting v5 features

Trigger: Old prebuilt DB; -taxids flag returns "Taxonomy database not available".
Mechanism: v4 needs taxonomy4blast.sqlite3 companion; v5 has taxonomy indexed in DB.
Symptom: Taxonomy filtering silently no-ops or errors.
Fix: Re-download with update_blastdb.pl --decompress (gets v5); or use v5 explicitly when building.

Soft-masking confusion

Trigger: Hard-masking input (replacing repeats with N or X) instead of using -dust/-seg.
Mechanism: Hard-mask creates artificial mismatches at boundaries.
Symptom: True alignments split into multiple short HSPs.
Fix: Pass unmasked FASTA + soft-mask via -soft_masking true + -dust yes/-seg yes.

`max_target_seqs` truncation

Trigger: -max_target_seqs 10 (Shah et al. 2019 Bioinformatics 35:1613).
Mechanism: Early termination, not top-N filter.
Symptom: Different top-10 than -max_target_seqs 500 + post-filter.
Fix: Set -max_target_seqs large (500+); filter top N in awk/Python.

Common errors

| Error / symptom | Cause | Solution | |---|---|---| | BLAST Database error | DB path wrong, or alias missing | blastdbcmd -db <db> -info to confirm | | Error: entry not found | Built without -parse_seqids | Rebuild | | Taxonomy filter no-op | v4 DB | Upgrade to v5 | | Threads >16 not faster | I/O bound | Split input + parallel invocations | | nt download fills disk | Database is huge | Use refseq_select | | Sequence too short | Query < word_size | Use -task blastn-short (word=7) | | Out of memory | Single large query | Reduce -num_threads, split query |

References

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. (2009) BLAST+: architecture and applications. BMC Bioinformatics 10:421.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-3402.
Shah N, Nute MG, Warnow T, Pop M. (2019) Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows. Bioinformatics 35:1613-1614.
Boratyn GM, Camacho C, Cooper PS, et al. (2013) BLAST: a more efficient report with usability improvements. Nucleic Acids Res 41:W29-W33.

Related Skills

blast-searches - Remote BLAST against NCBI servers
remote-homology - PSI-BLAST, jackhmmer, HHblits, MMseqs2, DIAMOND, Foldseek for distant homology
ortholog-inference - Reciprocal best hit, OrthoFinder, OMA for ortholog calls
sequence-io/read-sequences - Load query/reference FASTA files
batch-downloads - Download large reference FASTA sets before makeblastdb

Version Compatibility

Reference examples tested with: NCBI BLAST+ 2.15+

Before using code patterns, verify installed versions match. If versions differ:

CLI: blastn -version then blastn -help to confirm flags
CLI: makeblastdb -help to confirm database build options

If a flag is unrecognized or behavior changes, introspect with -help and adapt the example to match the installed version rather than retrying.

Local BLAST

CLI: makeblastdb, blastn/blastp, blastdbcmd, update_blastdb.pl (NCBI BLAST+)
Python: subprocess wrapper (preferred); Bio.Blast.Applications was deprecated and removed -- do not use

Installation

# conda (preferred)
conda install -c bioconda blast

# macOS
brew install blast

# Ubuntu
sudo apt install ncbi-blast+

# Verify
blastn -version    # NCBI BLAST+ 2.15+ expected
update_blastdb.pl --showall pretty | head

Database format: v5 vs v4

`makeblastdb` flag taxonomy

makeblastdb -in reference.fasta -dbtype nucl \
            -blastdb_version 5 \
            -parse_seqids \
            -hash_index \
            -title "Custom reference 2026-05" \
            -out custom_db

`-task` taxonomy (the most-misused BLAST setting)

For blastn, the -task flag picks among heuristics with different word sizes and gap parameters.

For blastp:

Soft vs hard masking

Thread scaling

Output format reference (`-outfmt`)

Custom tabular fields:

blastn -query q.fa -db db -outfmt "6 qseqid sseqid pident length qcovs qcovhsp evalue bitscore staxids sscinames stitle"

Field key fields for analysis:

pident = percent identity over the HSP (NOT the query); for query-level, use qcovhsp
qcovs = total query coverage by all HSPs of this subject (the "coverage" most users want)
qcovhsp = query coverage by best HSP alone (use when there's only one HSP per hit)
staxids = taxonomy IDs (v5 only); critical for any "what species" workflow

Prebuilt NCBI databases via `update_blastdb.pl`

# List available
update_blastdb.pl --showall pretty | grep -E 'refseq|swissprot|nt|nr'

# Download (with decompress)
update_blastdb.pl --decompress refseq_select_rna

# Download specific volume of split database
update_blastdb.pl --decompress refseq_protein

# Download with parallelism
update_blastdb.pl --decompress --num_threads 4 refseq_select_rna

Sizes (approximate, 2026):

refseq_select_rna: ~5 GB
refseq_protein: ~30 GB
swissprot: <1 GB
nt: ~250 GB
nr: ~300 GB

For most use cases, refseq_select_* is the right starting point. nt/nr are storage-heavy and reproducibility-hostile.

Code patterns

Build and search a custom protein database

Goal: Build a BLAST+ protein database from a custom FASTA and search against it.

Approach: makeblastdb with v5 + parse_seqids + hash_index; blastp with explicit outfmt.

Reference (NCBI BLAST+ 2.15+):

#!/bin/bash
# Reference: NCBI BLAST+ 2.15+ | Verify API if version differs

REF=reference_proteins.fasta
DB=ref_prot_db
QUERY=query.fasta
OUT=hits.tsv

makeblastdb -in "$REF" -dbtype prot \
            -blastdb_version 5 -parse_seqids -hash_index \
            -title "$REF $(date +%Y-%m-%d)" \
            -out "$DB"

blastp -query "$QUERY" -db "$DB" \
       -evalue 1e-10 \
       -num_threads 8 \
       -max_target_seqs 500 \
       -outfmt "6 qseqid sseqid pident length qcovs evalue bitscore stitle" \
       -out "$OUT"

# Top hit per query by bit-score (column 7)
sort -k1,1 -k7,7gr "$OUT" | awk '!seen[$1]++' > top_hit_per_query.tsv

Cross-species DNA with dc-megablast

blastn -query mouse_cdna.fa -db human_refseq_rna \
       -task dc-megablast \
       -word_size 11 \
       -evalue 1e-10 \
       -outfmt "6 qseqid sseqid pident length qcovs evalue bitscore" \
       -num_threads 8 \
       -out cross_species.tsv

Short primer search

blastn -query primers.fa -db genome_db \
       -task blastn-short \
       -word_size 7 \
       -evalue 1000 \
       -outfmt 6 \
       -out primer_hits.tsv

Taxonomy-filtered search (BLAST v5 only)

# Restrict to specific taxids
blastp -query query.fa -db nr \
       -taxids 9606,10090,10116 \
       -outfmt "6 qseqid sseqid staxids sscinames evalue bitscore" \
       -out mammalian_hits.tsv

# Or to a taxid subtree (NCBI BLAST+ 2.13+)
echo 9606 > human_only.txt
blastp -query query.fa -db nr -taxidlist human_only.txt -outfmt 6 -out human_hits.tsv

Extract subject sequences for top hits

# Requires database built with -parse_seqids
cut -f2 top_hit_per_query.tsv | sort -u > hit_accessions.txt
blastdbcmd -db ref_prot_db -entry_batch hit_accessions.txt -out hits.fasta

# Pull a range of a sequence
blastdbcmd -db genome_db -entry NC_000001.11 -range 1000000-1001000 -out region.fa

Reciprocal best hit (RBH) for ortholog candidates

See ortholog-inference skill for the principled treatment. Quick version:

blastp -query A.fa -db B_db -outfmt 6 -evalue 1e-5 -num_threads 8 \
       -max_target_seqs 5 -out A_vs_B.tsv
blastp -query B.fa -db A_db -outfmt 6 -evalue 1e-5 -num_threads 8 \
       -max_target_seqs 5 -out B_vs_A.tsv

# Best forward + reverse, intersect
awk '!seen[$1]++ {print $1"\t"$2}' A_vs_B.tsv | sort > A_best
awk '!seen[$1]++ {print $1"\t"$2}' B_vs_A.tsv | sort > B_best
awk 'NR==FNR{a[$1]=$2; next} a[$2]==$1' A_best B_best > rbh.tsv

This works but does NOT handle paralog mis-pairs from gene duplication; for that use OrthoFinder or OMA (in ortholog-inference).

Python wrapper with version pinning

import subprocess
import shutil


def require_tool(name, min_version=None):
    if not shutil.which(name):
        raise RuntimeError(f'{name} not on PATH')
    out = subprocess.run([name, '-version'], capture_output=True, text=True)
    print(f'  {out.stdout.strip().splitlines()[0]}')


def run_blast(query, db, out, program='blastp', evalue=1e-10, threads=8, hitlist=500):
    require_tool(program)
    cmd = [program, '-query', query, '-db', db, '-out', out,
           '-evalue', str(evalue),
           '-num_threads', str(threads),
           '-max_target_seqs', str(hitlist),
           '-outfmt', '6 qseqid sseqid pident length qcovs qcovhsp evalue bitscore stitle']
    subprocess.run(cmd, check=True)


def parse_tabular(path):
    cols = ['qseqid', 'sseqid', 'pident', 'length', 'qcovs', 'qcovhsp', 'evalue', 'bitscore', 'stitle']
    rows = []
    with open(path) as f:
        for line in f:
            vals = line.rstrip('\n').split('\t')
            d = dict(zip(cols, vals))
            for k in ('pident', 'qcovs', 'qcovhsp', 'evalue', 'bitscore'):
                d[k] = float(d[k])
            d['length'] = int(d['length'])
            rows.append(d)
    return rows

Failure modes

`nt`/`nr` size shock

Trigger: update_blastdb.pl --decompress nt without realizing the size.
Mechanism: nt is ~250 GB compressed, ~1 TB indexed.
Symptom: Disk fills mid-download; partial DB unusable.
Fix: Use refseq_select for most workflows; only pull nt/nr with intent and >1 TB free.

Missing `-parse_seqids`

Trigger: Built DB without -parse_seqids; later try blastdbcmd -entry.
Mechanism: Without the parsed index, blastdbcmd can't look up by accession.
Symptom: Error: ... not found in database.
Fix: Rebuild with -parse_seqids (cheap if FASTA still on disk).

Wrong `-task` for the question

Trigger: Default blastn for cross-species mRNA (word=11 but ungapped seeding).
Mechanism: Discontiguous seed (dc-megablast) is much more sensitive across species.
Symptom: Far fewer hits than the question warrants.
Fix: Use -task dc-megablast for cross-species; -task megablast only for >=95% identity.

Thread saturation

Trigger: -num_threads 64 on a 32-core box.
Mechanism: I/O bound past ~16 threads; cache thrash hurts past CPU count.
Symptom: No speedup or slowdown.
Fix: Cap at 8-16; split FASTA and run parallel processes instead for very large batches.

v4 database, expecting v5 features

Trigger: Old prebuilt DB; -taxids flag returns "Taxonomy database not available".
Mechanism: v4 needs taxonomy4blast.sqlite3 companion; v5 has taxonomy indexed in DB.
Symptom: Taxonomy filtering silently no-ops or errors.
Fix: Re-download with update_blastdb.pl --decompress (gets v5); or use v5 explicitly when building.

Soft-masking confusion

Trigger: Hard-masking input (replacing repeats with N or X) instead of using -dust/-seg.
Mechanism: Hard-mask creates artificial mismatches at boundaries.
Symptom: True alignments split into multiple short HSPs.
Fix: Pass unmasked FASTA + soft-mask via -soft_masking true + -dust yes/-seg yes.

`max_target_seqs` truncation

Trigger: -max_target_seqs 10 (Shah et al. 2019 Bioinformatics 35:1613).
Mechanism: Early termination, not top-N filter.
Symptom: Different top-10 than -max_target_seqs 500 + post-filter.
Fix: Set -max_target_seqs large (500+); filter top N in awk/Python.

Common errors

References

Camacho C, Coulouris G, Avagyan V, Ma N, Papadopoulos J, Bealer K, Madden TL. (2009) BLAST+: architecture and applications. BMC Bioinformatics 10:421.
Altschul SF, Madden TL, Schaffer AA, Zhang J, Zhang Z, Miller W, Lipman DJ. (1997) Gapped BLAST and PSI-BLAST: a new generation of protein database search programs. Nucleic Acids Res 25:3389-3402.
Shah N, Nute MG, Warnow T, Pop M. (2019) Misunderstood parameter of NCBI BLAST impacts the correctness of bioinformatics workflows. Bioinformatics 35:1613-1614.
Boratyn GM, Camacho C, Cooper PS, et al. (2013) BLAST: a more efficient report with usability improvements. Nucleic Acids Res 41:W29-W33.

Related Skills

blast-searches - Remote BLAST against NCBI servers
remote-homology - PSI-BLAST, jackhmmer, HHblits, MMseqs2, DIAMOND, Foldseek for distant homology
ortholog-inference - Reciprocal best hit, OrthoFinder, OMA for ortholog calls
sequence-io/read-sequences - Load query/reference FASTA files
batch-downloads - Download large reference FASTA sets before makeblastdb

Adoption

GPTomics/bio-local-blast

$ install --global

Security Scan Results

SKILL.md

Version Compatibility

Local BLAST

Installation

Database format: v5 vs v4

makeblastdb flag taxonomy

-task taxonomy (the most-misused BLAST setting)

Soft vs hard masking

Thread scaling

Output format reference (-outfmt)

Prebuilt NCBI databases via update_blastdb.pl

Code patterns

Build and search a custom protein database

Cross-species DNA with dc-megablast

Short primer search

Taxonomy-filtered search (BLAST v5 only)

Extract subject sequences for top hits

Reciprocal best hit (RBH) for ortholog candidates

Python wrapper with version pinning

Failure modes

nt/nr size shock

Missing -parse_seqids

Wrong -task for the question

Thread saturation

v4 database, expecting v5 features

Soft-masking confusion

max_target_seqs truncation

Common errors

References

Related Skills

Related Skills

GPTomics/bio-workflows-clip-pipeline

GPTomics/bio-comparative-genomics-whole-genome-duplication

GPTomics/bio-comparative-genomics-whole-genome-alignment

GPTomics/bio-comparative-genomics-synteny-analysis

GPTomics/bio-local-blast

$ install --global

Security Scan Results

SKILL.md

Version Compatibility

Local BLAST

Installation

Database format: v5 vs v4

makeblastdb flag taxonomy

-task taxonomy (the most-misused BLAST setting)

Soft vs hard masking

Thread scaling

Output format reference (-outfmt)

Prebuilt NCBI databases via update_blastdb.pl

Code patterns

Build and search a custom protein database

Cross-species DNA with dc-megablast

Short primer search

Taxonomy-filtered search (BLAST v5 only)

Extract subject sequences for top hits

Reciprocal best hit (RBH) for ortholog candidates

Python wrapper with version pinning

Failure modes

nt/nr size shock

Missing -parse_seqids

Wrong -task for the question

Thread saturation

v4 database, expecting v5 features

Soft-masking confusion

max_target_seqs truncation

Common errors

References

Related Skills

Related Skills

GPTomics/bio-workflows-clip-pipeline

GPTomics/bio-comparative-genomics-whole-genome-duplication

GPTomics/bio-comparative-genomics-whole-genome-alignment

GPTomics/bio-comparative-genomics-synteny-analysis

`makeblastdb` flag taxonomy

`-task` taxonomy (the most-misused BLAST setting)

Output format reference (`-outfmt`)

Prebuilt NCBI databases via `update_blastdb.pl`

`nt`/`nr` size shock

Missing `-parse_seqids`

Wrong `-task` for the question

`max_target_seqs` truncation

`makeblastdb` flag taxonomy

`-task` taxonomy (the most-misused BLAST setting)

Output format reference (`-outfmt`)

Prebuilt NCBI databases via `update_blastdb.pl`

`nt`/`nr` size shock

Missing `-parse_seqids`

Wrong `-task` for the question

`max_target_seqs` truncation