database-access/batch-downloads/SKILL.md
Download large datasets from NCBI efficiently using EPost, history server, batching, rate limiting, and retry logic. Use when bulk-fetching tens of thousands of sequences, pulling all results of a large ESearch, designing reproducible pipelines, comparing E-utilities to NCBI Datasets v2 CLI, or implementing checksum-validated downloads. Encodes WebEnv TTL (~8h), EPost 200-ID limit, retmax caps, parallelization design, and integrity verification.
npx skillsauth add GPTomics/bioSkills bio-batch-downloadsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: BioPython 1.83+, NCBI Datasets CLI 16.0+, Entrez Direct 21.0+
Before using code patterns, verify installed versions match. If versions differ:
pip show biopython then help(Bio.Entrez.efetch) to check signaturesdatasets --version and efetch -versionIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Download N thousand records from NCBI without getting blocked" -> The right answer is rarely "parallelize requests". For >5000 records the answer is the history server: search once, fetch in chunks server-side. For >100,000 records or whole genomes, the modern answer is NCBI Datasets v2 CLI -- the E-utilities are not optimized for bulk genome/gene data anymore.
This skill encodes (a) when to use each retrieval strategy, (b) the precise rate-limit math, (c) WebEnv lifecycle for long-running jobs, (d) how to design retry/resume, and (e) when to defect to Datasets CLI instead.
Entrez.esearch(usehistory='y') + chunked Entrez.efetch() (BioPython)datasets download genome accession ... (NCBI Datasets v2 -- preferred for genome/gene bulk)epost | efetch -mode webenv (Entrez Direct)from Bio import Entrez
import time
Entrez.email = '[email protected]'
Entrez.api_key = 'YOUR_KEY' # 3 -> 10 req/sec; mandatory for bulk
Entrez.tool = 'project-name'
| Record count | Source | Strategy | Why |
|---|---|---|---|
| < 200 known IDs | Any db | EFetch with comma-joined id= | Single round-trip; trivial |
| 200-5,000 known IDs | Any db | EPost (chunked at 200) -> history -> chunked EFetch | URL length limit + chunked retrieval |
| 5,000-100,000 from a query | Any db | ESearch with usehistory='y' -> chunked EFetch | Push to server once; pull in batches |
| > 100,000 sequences | nucleotide/protein | Consider FTP mirror or Datasets CLI; chunk if E-utils still | NCBI throttles bulk; offline mirror is faster |
| Whole genome assemblies | Assembly/Datasets | datasets download genome accession ... | Datasets v2 is the modern bulk endpoint |
| All RefSeq for a species | Datasets | datasets download genome taxon ... | Replaces assembly_summary.txt scraping |
| All gene records for a list | Datasets | datasets download gene gene-id ... | Cleaner output than EFetch gene XML |
| Raw sequencing reads | SRA | prefetch + fasterq-dump (or ENA mirror) | See sra-data skill |
The Datasets CLI is the right answer for any genome- or gene-centric bulk workflow as of 2023+. The E-utilities remain right for PubMed, ESummary metadata, custom queries, and anything not in the Datasets API. See ncbi-datasets-cli skill.
| Auth | req/sec | Sleep between calls | Bulk-friendly notes |
|---|---|---|---|
| Email only | 3 | 0.34 s | Single-threaded only; parallelism violates ToS |
| Email + API key | 10 | 0.10 s | Modest parallelism (max ~4 workers) safe |
| Institutional bulk | Negotiated | Email [email protected] | For >100K queries; courtesy expected |
NCBI's terms ask that heavy automated downloads run outside US weekday business hours (9 AM-5 PM ET). Cron the job for nights/weekends; pipelines that ignore this get IP-throttled.
Critical: parallelizing API calls is the WRONG bulk strategy. One stream with history server + larger batches is faster AND more polite than N parallel streams. The bottleneck is rarely NCBI's throughput at small N -- it's the round-trip count.
| Property | Value | Failure mode |
|---|---|---|
| TTL | 8 hours absolute (per NCBI E-utils help) | Job started Friday evening dies Saturday morning |
| Idle eviction | ~15 min empirically under load | A worker that stalls loses its WebEnv |
| Per-session isolation | One WebEnv string per session | Don't share across processes if isolation matters |
| Expired session behavior | HTTP 200 with <ERROR>WebEnv not found</ERROR> | Won't surface as HTTP error -- must parse body |
| Recovery | Re-run ESearch; resume at retstart | Need to checkpoint progress to disk |
Production pattern: checkpoint the retstart cursor after each successful chunk to disk; on restart, re-run ESearch (cheap), pick up retstart from checkpoint, continue.
EPost pushes a list of UIDs to the history server so downstream EFetch can pull by WebEnv/QueryKey instead of by ID. Two constraints:
To intersect: term=#{key1} AND #{key2} against the WebEnv produces a new key.
| Database | rettype | Optimal batch | Per-record payload | |---|---|---|---| | nucleotide | fasta | 500-1000 | ~1 KB | | nucleotide | gb | 100-200 | ~10-50 KB | | protein | fasta | 500-1000 | ~0.5 KB | | protein | gp | 100-200 | ~5-30 KB | | pubmed | medline | 1000-2000 | ~2 KB | | pubmed | xml | 200-500 | ~10-30 KB | | any | esummary (docsum) | 500 per call | ~1 KB |
Smaller batches for GenBank/XML because per-record payload is larger; larger batches for FASTA because the per-call HTTP overhead dominates.
Goal: Download all records matching a query, robust to mid-job failures and session expiry.
Approach: ESearch with history; checkpoint cursor to disk; on error, retry the chunk; on session expiry, re-run ESearch and resume from checkpoint.
Reference (BioPython 1.83+):
import json
import time
from pathlib import Path
from urllib.error import HTTPError
from Bio import Entrez
def checkpointed_batch_download(db, term, out_path, ckpt_path, rettype='fasta',
retmode='text', batch_size=500, max_retries=3):
'''Download all matching records with disk checkpoint for resumability.'''
delay = 0.1 if Entrez.api_key else 0.34
ckpt = Path(ckpt_path)
start = json.loads(ckpt.read_text())['start'] if ckpt.exists() else 0
h = Entrez.esearch(db=db, term=term, usehistory='y', retmax=0)
s = Entrez.read(h); h.close()
webenv, query_key, total = s['WebEnv'], s['QueryKey'], int(s['Count'])
print(f'{total:,} records matched; resuming at {start:,}')
mode = 'a' if start else 'w'
with open(out_path, mode) as out:
while start < total:
for attempt in range(max_retries):
try:
h = Entrez.efetch(db=db, rettype=rettype, retmode=retmode,
retstart=start, retmax=batch_size,
webenv=webenv, query_key=query_key)
body = h.read(); h.close()
if isinstance(body, bytes):
body = body.decode('utf-8', errors='replace')
if '<ERROR>' in body[:500]:
raise RuntimeError(f'Server error in body: {body[:200]}')
out.write(body)
break
except HTTPError as e:
if e.code == 429:
wait = 10 * (attempt + 1)
print(f' Rate-limited; sleeping {wait}s')
time.sleep(wait)
elif attempt == max_retries - 1:
raise
else:
time.sleep(5 * (attempt + 1))
except RuntimeError as e:
# Likely WebEnv expired; re-run ESearch
print(f' {e}; refreshing WebEnv')
h = Entrez.esearch(db=db, term=term, usehistory='y', retmax=0)
s = Entrez.read(h); h.close()
webenv, query_key = s['WebEnv'], s['QueryKey']
start += batch_size
ckpt.write_text(json.dumps({'start': start, 'total': total}))
time.sleep(delay)
print(f' {min(start, total):,}/{total:,}')
ckpt.unlink(missing_ok=True)
Goal: Download by a known list of 5,000 accessions without 414 URI errors.
Approach: EPost in 200-ID chunks; reuse WebEnv across chunks; final fetch reads from history.
Reference (BioPython 1.83+):
def epost_and_fetch(db, ids, out_path, rettype='fasta', retmode='text', batch_size=500):
delay = 0.1 if Entrez.api_key else 0.34
webenv = None
posted_keys = [] # (query_key, n_ids) so we iterate each key's actual size
for i in range(0, len(ids), 200):
chunk = ids[i:i+200]
kwargs = {'db': db, 'id': ','.join(chunk)}
if webenv:
kwargs['WebEnv'] = webenv
h = Entrez.epost(**kwargs)
r = Entrez.read(h); h.close()
webenv = r['WebEnv']
posted_keys.append((r['QueryKey'], len(chunk)))
time.sleep(delay)
with open(out_path, 'w') as out:
for qk, n in posted_keys:
for start in range(0, n, batch_size):
h = Entrez.efetch(db=db, rettype=rettype, retmode=retmode,
retstart=start, retmax=min(batch_size, n - start),
webenv=webenv, query_key=qk)
out.write(h.read()); h.close()
time.sleep(delay)
Goal: Confirm downloaded FASTA has the expected record count and no truncation.
Approach: Count expected (from ESearch Count) vs observed (from SeqIO.parse).
from Bio import SeqIO
def verify_fasta_count(path, expected):
observed = sum(1 for _ in SeqIO.parse(path, 'fasta'))
assert observed == expected, f'Expected {expected:,} records, found {observed:,}'
return True
For genome assemblies and known-checksum files, NCBI provides MD5 manifests (e.g. md5checksums.txt in FTP genome directories). NCBI Datasets CLI verifies checksums automatically; the FTP-direct route needs explicit md5sum -c.
def estimate_efetch_calls(total, batch_size):
return -(-total // batch_size) # ceiling division
For 100,000 nucleotide records at 500/batch with API key: 200 calls * 0.1s = 20s minimum. For the same workflow via datasets download gene gene-id 100000: one CLI invocation, parallel download, automatic checksum. For genome-scale bulk, Datasets wins by an order of magnitude.
Goal: Pull from two independent queries concurrently without violating rate limits.
Approach: Async with a global semaphore that enforces the API-key-permitted rate. Max 4 concurrent workers is the polite cap.
import asyncio
from asyncio import Semaphore
# Pseudo-pattern; real impl needs aiohttp + Bio.Entrez async wrappers
async def fetch_with_semaphore(sem, db, id_, rettype):
async with sem:
# call EFetch
await asyncio.sleep(0.1) # rate gate
# ... actual call
sem = Semaphore(4)
Never exceed 4 concurrent workers with an API key, or 1 without. Above that NCBI throttles by IP and the whole pipeline grinds.
<ERROR>WebEnv not found</ERROR> body.<ERROR>; re-run ESearch and resume at checkpointed retstart.id= to EFetch with 250+ IDs.datasets download genome ... for genomes; datasets download gene ... for gene records. See ncbi-datasets-cli.usehistory='y'; Count > 9999.usehistory='y' for any query expected to return >5000.| Error / symptom | Cause | Solution |
|---|---|---|
| HTTPError 429 | Rate limit | Sleep with backoff; get API key |
| HTTPError 414 | URL too long | EPost first |
| <ERROR>WebEnv not found</ERROR> (HTTP 200) | Session expired | Re-run ESearch; resume at checkpoint |
| Output file ends mid-record | Crash mid-chunk | Truncate-to-newline on resume |
| Slow despite API key | Too few records per call | Increase batch_size to 500+ for FASTA |
| Datasets CLI faster than EFetch | Workflow is genome/gene bulk | Switch to ncbi-datasets-cli |
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.