database-access/ncbi-datasets-cli/SKILL.md
Download genome assemblies, gene records, and ortholog data from NCBI using the modern Datasets v2 CLI (replaces assembly_summary.txt scraping and many EFetch workflows). Use when bulk-pulling genome assemblies, gene metadata across species, ortholog sets, or BLAST databases; when E-utilities are too slow for genome-scale work; or when automatic checksum verification, parallel download, and clean accession-driven retrieval are required. Encodes the JSON-lines output format, dataformat conversion, --dehydrated for cloud workflows, and when Datasets is/isn't the right tool.
npx skillsauth add GPTomics/bioSkills bio-ncbi-datasets-cliInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: NCBI Datasets CLI 16.0+ (2024), dataformat 16.0+
Before using code patterns, verify installed versions match. If versions differ:
datasets --version, dataformat --versiondatasets <subcommand> --helpIf a subcommand or flag is unrecognized, run datasets --help and adapt. The CLI is under active development; major releases (v15 -> v16) added subcommands and renamed flags.
"Pull genome / gene / ortholog data from NCBI in 2026" -> The Datasets v2 CLI (launched 2023) is the official, supported bulk endpoint for genome and gene-centric data. It replaces the prior best-practice of scraping assembly_summary.txt + parallel FTP + manual checksum verification. For genome-scale data, it is strictly better than E-utilities (EFetch).
The CLI is not the right answer for everything. PubMed, SRA reads, and custom Entrez queries still belong to E-utilities. The defection rule: if the question is about genome assemblies, gene records, or pre-computed orthologs, use Datasets; otherwise stay with E-utilities.
datasets download genome accession GCF_...datasets summary gene symbol BRCA1 --taxon humansubprocess wrapper; Python client ncbi-datasets-pylib (experimental as of 2024)# conda
conda install -c conda-forge ncbi-datasets-cli
# Or direct download (Linux, macOS, Windows binaries)
curl -O https://ftp.ncbi.nlm.nih.gov/pub/datasets/command-line/v2/linux-amd64/datasets
datasets --version # 16.0+ expected
dataformat --version # bundled companion tool
| Question | Datasets | Use instead |
|---|---|---|
| Genome assembly download | yes | — |
| All reference genomes for a taxon | yes | — |
| Gene record metadata (multi-species) | yes | — |
| Ortholog data for a gene | yes (datasets summary gene ... --ortholog) | OrthoDB / Compara for tree-aware orthology |
| Virus data (assemblies, metadata) | yes (datasets download virus) | — |
| Annotation files (GFF3, GTF) for a genome | yes | — |
| Protein records (curated, with cross-refs) | partial | UniProt REST for richer annotation |
| PubMed | no | entrez-search / entrez-fetch |
| SRA reads | no | sra-data |
| BLAST | no | blast-searches / local-blast |
| Custom Entrez queries | no | entrez-search |
| Pre-computed alignments (Compara) | no | ensembl-rest |
| Subcommand | Purpose | Example |
|---|---|---|
| datasets summary genome | Metadata only; JSON output | datasets summary genome accession GCF_000001405.40 |
| datasets download genome | Download data files | datasets download genome accession GCF_... |
| datasets summary gene | Gene record metadata | datasets summary gene symbol BRCA1 --taxon human |
| datasets download gene | Download gene products | datasets download gene symbol BRCA1 --taxon human |
| datasets summary taxonomy | Taxonomy info | datasets summary taxonomy taxon human |
| datasets download virus | Virus assemblies/proteins | datasets download virus genome taxon SARS-CoV-2 |
| dataformat tsv / dataformat excel | Convert JSON-lines to tabular | dataformat tsv gene-summary |
datasets summary always returns JSON-lines on stdout (one object per record). datasets download produces a .zip (default) or a "dehydrated" stub for cloud workflows.
| Flag | Effect |
|---|---|
| --filename out.zip | Where to write the archive |
| --include genome,gff3,gtf,protein,cds,rna,seq-report | Which file types to include |
| --reference | Restrict to reference assemblies only (one per species) |
| --annotated | Restrict to annotated assemblies |
| --assembly-source RefSeq / GenBank / all | Database source |
| --assembly-level chromosome,complete | Assembly quality level |
| --released-after 2024-01-01 | Date filter |
| --dehydrated | Skip data; download just stubs + URL list (for parallel pull) |
| --api-key XXX | Optional API key (raises rate limit) |
| --no-progressbar | For non-interactive use |
For very large pulls (1000+ genomes), --dehydrated is the right choice: download the metadata stubs first, then run datasets rehydrate later or pull URLs in parallel from the manifest.
datasets summary returns JSON-lines (one JSON object per line) on stdout. Pipe through dataformat tsv for tabular:
datasets summary genome taxon "Escherichia coli" --reference --as-json-lines \
| dataformat tsv genome --fields accession,organism-name,assembly-level,scaffold-n50 \
> ecoli_refs.tsv
dataformat subcommands match summary types: genome, gene, virus-genome, etc. The --fields list is documented per type via dataformat tsv <type> --help.
The "dehydrated" mode separates data discovery from data transfer:
datasets download genome taxon human --reference --dehydrated --filename human.zip (fast; ~MB).unzip -p human.zip ncbi_dataset/fetch.txt -- a TSV of all URLs to pull.datasets rehydrate --directory ./human/ or use aria2c --input-file=fetch.txt for parallel pull.This is essential for HPC / cloud pipelines where inspection of the pending transfer is needed before committing the I/O.
datasets verifies MD5 checksums for every downloaded file automatically. Rehydrate workflows also verify. If a file fails checksum, Datasets retries up to 3 times then errors. This replaces the md5sum -c step that was required with assembly_summary.txt-based scraping.
Goal: Get human reference assembly with genome + GTF + protein + CDS.
Approach: datasets download genome accession ... --include ....
Reference (NCBI Datasets CLI 16.0+):
#!/bin/bash
# Reference: NCBI Datasets CLI 16.0+ | Verify API if version differs
datasets download genome accession GCF_000001405.40 \
--include genome,gff3,gtf,protein,cds,seq-report \
--filename human_grch38.zip
unzip -q human_grch38.zip -d human_grch38/
ls -lh human_grch38/ncbi_dataset/data/GCF_000001405.40/
Goal: Pull every RefSeq reference bacterial assembly with annotation.
Approach: --dehydrated first for inspection; rehydrate with parallel pull.
Reference (NCBI Datasets CLI 16.0+):
#!/bin/bash
# Step 1: dehydrated discovery
datasets download genome taxon Bacteria \
--reference --annotated --assembly-source RefSeq \
--include genome,gff3,protein \
--dehydrated --filename bact_refs.zip
unzip -q bact_refs.zip -d bact_refs/
wc -l bact_refs/ncbi_dataset/fetch.txt # how many files will be pulled
# Step 2: parallel pull via aria2 (or datasets rehydrate)
aria2c --input-file=bact_refs/ncbi_dataset/fetch.txt \
--dir=bact_refs/ncbi_dataset/data/ \
--max-concurrent-downloads=8 \
--retry-wait=5
datasets summary gene symbol BRCA1 \
--taxon Mammalia \
--as-json-lines \
| dataformat tsv gene --fields gene-id,symbol,taxname,description,nomenclature-authority,chromosomes \
> brca1_mammals.tsv
head brca1_mammals.tsv
datasets summary gene symbol BRCA1 --taxon human --ortholog --as-json-lines \
| dataformat tsv gene --fields gene-id,symbol,taxname,description \
> brca1_orthologs.tsv
--ortholog returns NCBI's ortholog set (a single representative per species; tree-aware orthology with multiple co-orthologs is in ortholog-inference / Compara / OMA).
datasets summary genome taxon "Salmonella enterica" \
--assembly-level chromosome,complete \
--released-after 2024-01-01 \
--as-json-lines \
| dataformat tsv genome --fields accession,organism-name,assembly-level,scaffold-n50,submission-date \
> sal_2024.tsv
Reference (NCBI Datasets CLI 16.0+):
import subprocess
import json
from pathlib import Path
def datasets_summary(subcommand, *args):
'''Run `datasets summary` and parse JSON-lines stdout.'''
cmd = ['datasets', 'summary', subcommand, *args, '--as-json-lines']
out = subprocess.run(cmd, capture_output=True, text=True, check=True)
return [json.loads(line) for line in out.stdout.strip().split('\n') if line]
def datasets_download(subcommand, *args, out='dataset.zip', include=None):
cmd = ['datasets', 'download', subcommand, *args, '--filename', out]
if include:
cmd += ['--include', ','.join(include)]
subprocess.run(cmd, check=True)
return Path(out)
genomes = datasets_summary('genome', 'taxon', 'Escherichia coli', '--reference')
print(f'{len(genomes)} reference E. coli assemblies')
for g in genomes[:3]:
acc = g.get('accession')
n50 = g.get('assemblyStats', {}).get('contigN50')
print(f' {acc} N50={n50}')
datasets_download('genome', 'accession', 'GCF_000005845.2',
out='ecoli_k12.zip',
include=['genome', 'gff3', 'protein'])
# E-utilities path: ESearch in assembly db -> ESummary -> manual FTP pull
# ~30 API calls + manual md5 + serial download
# Datasets path:
# datasets download genome accession GCF_... # one command, automatic md5, parallel inside
For genome workflows, Datasets is 5-50x faster than the equivalent E-utilities pipeline and far more reliable.
sra-data skill (prefetch/fasterq-dump) for raw reads.--reference filter loses too much--reference returns one per species.--reference for full set; add --assembly-level chromosome,complete for quality filter instead.--dehydrated.--dehydrated + aria2c with --max-concurrent-downloads.dataformat tsv genome --fields foo,bar with invented field names.dataformat tsv genome --help lists valid field names; pull JSON-lines and inspect with jq to discover fields.https://ftp.ncbi.nlm.nih.gov/genomes/all/refseq/....--api-key.--api-key YOUR_KEY to bulk commands; obtain from https://www.ncbi.nlm.nih.gov/account/settings/.conda update ncbi-datasets-cli.| Error / symptom | Cause | Solution |
|---|---|---|
| "command not found: datasets" | Not installed | conda install -c conda-forge ncbi-datasets-cli |
| Subcommand not found | Old version | Upgrade to v16+ |
| Slow 1000-genome pull | Serial download | Use --dehydrated + aria2c |
| "Unknown field" in dataformat | Wrong field name | Check dataformat <type> --help |
| Throttled bulk pull | No API key | Pass --api-key |
| --reference returns 1 per species | By design | Drop the flag or use --assembly-level |
| MD5 mismatch retried | Network issue | Datasets retries automatically; persistent failure -> investigate network |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.