src/agent/skills/ncbi_gene/SKILL.md
Query NCBI Gene via E-utilities/Datasets API. Search by symbol/ID, retrieve gene info (RefSeqs, GO, locations, phenotypes), batch lookups, for gene annotation and functional analysis.
npx skillsauth add ai4protein/VenusFactory ncbi_geneInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
NCBI Gene is a comprehensive database integrating gene information from diverse species. It provides nomenclature, reference sequences (RefSeqs), chromosomal maps, biological pathways, genetic variations, phenotypes, and cross-references to global genomic resources. In this project the agent exposes 3 download tools: fetch gene data by ID or symbol (Datasets API), and batch lookup by symbols. For finer control, ncbi_operations.py also provides E-utilities download/query functions (esearch, esummary, efetch) and additional batch operations. All download/query functions return rich JSON {status, file_info/content, content_preview, biological_metadata, execution_context}.
This skill should be used when working with gene data including searching by gene symbol or ID, retrieving gene sequences and metadata, analyzing gene functions and pathways, or performing batch gene lookups.
The skill provides:
src/tools/database/ncbi/: fetch_gene_data.py (Datasets API), query_gene.py (E-utilities), batch_gene_lookup.py (batch operations), ncbi_operations.py (query/download operations); gene download functions re-exported via package.references/api_reference.md, references/common_workflows.mdNCBI provides two main APIs:
download_ncbi_gene_by_id and download_ncbi_gene_by_symbol tools| Tool name | Arguments | Purpose |
|-----------|-----------|---------|
| download_ncbi_gene_by_id | gene_id, out_path | Download gene data by NCBI Gene ID (Datasets API) to JSON |
| download_ncbi_gene_by_symbol | symbol, taxon, out_path | Download gene data by symbol and organism (Datasets API) to JSON |
| download_ncbi_batch_lookup_by_symbols | gene_symbols, organism, out_path | Batch lookup multiple genes by symbols to JSON |
Also available as general NCBI tools (useful in gene workflows):
| Tool name | Arguments | Purpose |
|-----------|-----------|---------|
| download_ncbi_sequence | ncbi_id, out_path, db (optional) | Download NCBI sequence by accession (FASTA) |
| download_ncbi_metadata | ncbi_id, out_path, db, rettype (optional) | Download NCBI metadata (GenBank/XML) |
| download_ncbi_blast | sequence, out_path, program, database, etc. | Submit BLAST search and download XML |
| Capability | Function | Module | Purpose |
|------------|----------|--------|---------|
| Fetch by ID | fetch_gene_by_id(gene_id, api_key) | fetch_gene_data.py | Datasets API: gene data as dict |
| Fetch by symbol | fetch_gene_by_symbol(symbol, taxon, api_key) | fetch_gene_data.py | Datasets API: gene data as dict |
| Fetch multiple | fetch_multiple_genes(gene_ids, api_key) | fetch_gene_data.py | Datasets API: multiple genes at once |
| Taxon lookup | get_taxon_id(taxon_name) | fetch_gene_data.py | Convert name to NCBI taxon ID |
| E-util search | esearch(query, retmax, api_key) | query_gene.py | Search Gene DB, returns Gene IDs |
| E-util summary | esummary(gene_ids, api_key) | query_gene.py | Get document summaries by Gene IDs |
| E-util fetch | efetch(gene_ids, retmode, api_key) | query_gene.py | Fetch full gene records (XML/text) |
| Search+summarize | search_and_summarize(query, organism, max_results, api_key) | query_gene.py | Convenience: search + display |
| Batch search | batch_esearch(queries, organism, api_key) | batch_gene_lookup.py | Search multiple symbols → ID map |
| Batch summary | batch_esummary(gene_ids, api_key, chunk_size) | batch_gene_lookup.py | Summaries in chunks |
| Batch by IDs | batch_lookup_by_ids(gene_ids, api_key) | batch_gene_lookup.py | Structured gene data by IDs |
| Batch by symbols | batch_lookup_by_symbols(gene_symbols, organism, api_key) | batch_gene_lookup.py | Structured gene data by symbols |
| Query: by ID | query_ncbi_gene_by_id(gene_id) | ncbi_operations.py | Returns rich JSON in memory |
| Query: by symbol | query_ncbi_gene_by_symbol(symbol, taxon) | ncbi_operations.py | Returns rich JSON in memory |
| Query: esearch | query_ncbi_gene_esearch(query, retmax) | ncbi_operations.py | Returns rich JSON in memory |
| Query: esummary | query_ncbi_gene_esummary(gene_ids) | ncbi_operations.py | Returns rich JSON in memory |
| Query: efetch | query_ncbi_gene_efetch(gene_ids, retmode) | ncbi_operations.py | Returns rich JSON in memory |
| Query: batch search | query_ncbi_batch_esearch(queries, organism) | ncbi_operations.py | Returns rich JSON in memory |
| Query: batch by IDs | query_ncbi_batch_lookup_by_ids(gene_ids) | ncbi_operations.py | Returns rich JSON in memory |
| Query: batch by symbols | query_ncbi_batch_lookup_by_symbols(gene_symbols, organism) | ncbi_operations.py | Returns rich JSON in memory |
| Download: by ID | download_ncbi_gene_by_id(gene_id, out_path) | ncbi_operations.py | Save to file, return rich JSON |
| Download: by symbol | download_ncbi_gene_by_symbol(symbol, taxon, out_path) | ncbi_operations.py | Save to file, return rich JSON |
| Download: esearch | download_ncbi_gene_esearch(query, out_path, retmax) | ncbi_operations.py | Save to file, return rich JSON |
| Download: esummary | download_ncbi_gene_esummary(gene_ids, out_path) | ncbi_operations.py | Save to file, return rich JSON |
| Download: efetch | download_ncbi_gene_efetch(gene_ids, out_path, retmode) | ncbi_operations.py | Save to file, return rich JSON |
| Download: batch search | download_ncbi_batch_esearch(queries, out_path, organism) | ncbi_operations.py | Save to file, return rich JSON |
| Download: batch by IDs | download_ncbi_batch_lookup_by_ids(gene_ids, out_path) | ncbi_operations.py | Save to file, return rich JSON |
| Download: batch by symbols | download_ncbi_batch_lookup_by_symbols(gene_symbols, organism, out_path) | ncbi_operations.py | Save to file, return rich JSON |
from src.tools.database.ncbi import download_ncbi_gene_by_id
result = download_ncbi_gene_by_id("672", "output/ncbi_gene_brca1.json")
# Returns rich JSON with gene metadata, RefSeqs, GO annotations, etc.
from src.tools.database.ncbi import download_ncbi_gene_by_symbol
result = download_ncbi_gene_by_symbol("BRCA1", "human", "output/ncbi_gene_brca1_by_symbol.json")
result = download_ncbi_gene_by_symbol("TP53", "Homo sapiens", "output/ncbi_gene_tp53.json")
from src.tools.database.ncbi import download_ncbi_batch_lookup_by_symbols
result = download_ncbi_batch_lookup_by_symbols(
["BRCA1", "TP53", "EGFR"], "human", "output/ncbi_genes_batch.json"
)
from src.tools.database.ncbi.ncbi_operations import (
query_ncbi_gene_esearch,
download_ncbi_gene_esummary,
download_ncbi_gene_efetch,
)
# Step 1: Search for gene IDs
esearch_result = query_ncbi_gene_esearch("BRCA1[gene] AND human[organism]", retmax=10)
# Step 2: Download summaries or full records
download_ncbi_gene_esummary(["672", "7157"], "output/gene_summaries.json")
download_ncbi_gene_efetch(["672"], "output/gene_full_record.xml")
from src.tools.database.ncbi.fetch_gene_data import fetch_gene_by_id, fetch_gene_by_symbol
from src.tools.database.ncbi.query_gene import esearch, esummary, efetch
from src.tools.database.ncbi.batch_gene_lookup import batch_lookup_by_symbols
# Datasets API
gene_data = fetch_gene_by_id("672")
gene_data = fetch_gene_by_symbol("BRCA1", "human")
# E-utilities
gene_ids = esearch("insulin[gene] AND human[organism]")
summaries = esummary(gene_ids)
records = efetch(gene_ids, retmode="xml")
# Batch
results = batch_lookup_by_symbols(["BRCA1", "TP53"], "human")
from src.tools.database.ncbi import (
download_ncbi_gene_by_symbol,
download_ncbi_batch_lookup_by_symbols,
)
# Single gene
download_ncbi_gene_by_symbol("BRCA1", "human", "output/brca1_annotation.json")
# Gene panel
download_ncbi_batch_lookup_by_symbols(
["BRCA1", "BRCA2", "TP53", "PTEN", "ATM"], "human", "output/cancer_panel.json"
)
from src.tools.database.ncbi.ncbi_operations import (
query_ncbi_gene_esearch,
download_ncbi_gene_by_id,
)
import json
# Search by complex query
result = query_ncbi_gene_esearch("p53 AND human[organism]", retmax=5)
parsed = json.loads(result)
# Download details for each hit
if parsed.get("status") == "success":
content = json.loads(parsed["content"])
for gene_id in content.get("gene_ids", []):
download_ncbi_gene_by_id(str(gene_id), f"output/gene_{gene_id}.json")
from src.tools.database.ncbi import (
download_ncbi_gene_by_id,
download_ncbi_sequence,
download_ncbi_metadata,
)
# Step 1: Get comprehensive gene info
download_ncbi_gene_by_id("672", "output/brca1_gene_info.json")
# Step 2: Download protein sequence
download_ncbi_sequence("NP_009225.1", "output/brca1_protein.fasta", db="protein")
# Step 3: Download GenBank metadata
download_ncbi_metadata("NP_009225.1", "output/brca1_metadata.gb", db="protein")
Example E-utilities query patterns for NCBI Gene:
insulin[gene name] AND human[organism]dystrophin[gene name] AND muscular dystrophy[disease]human[organism] AND 17q21[chromosome]GO:0006915[biological process] (apoptosis)diabetes[phenotype] AND mouse[organism]insulin signaling pathway[pathway]Rate Limits:
Authentication: Register for a free NCBI API key at https://www.ncbi.nlm.nih.gov/account/ to increase rate limits.
NCBI Gene data can be retrieved in multiple formats:
| Format | Use case | |--------|----------| | JSON | Modern applications, programmatic processing | | XML | Legacy systems, detailed metadata | | GenBank | Sequence data with annotations | | FASTA | Sequence analysis workflows |
fetch_gene_data.py — NCBI Datasets API: fetch_gene_by_id(), fetch_gene_by_symbol(), fetch_multiple_genes()query_gene.py — E-utilities: esearch(), esummary(), efetch(), search_and_summarize()batch_gene_lookup.py — Batch: batch_esearch(), batch_esummary(), batch_lookup_by_ids(), batch_lookup_by_symbols()references/api_reference.md — E-utilities and Datasets API documentation, endpoints, parameters, response formatsreferences/common_workflows.md — Additional examples and use case patternsdevelopment
Query STRING API for protein-protein interactions (59M proteins, 20B interactions). Network analysis, GO/KEGG enrichment, interaction discovery, 5000+ species, for systems biology.
development
Statistical visualization with pandas integration. Use for quick exploration of distributions, relationships, and categorical comparisons with attractive defaults. Best for box plots, violin plots, pair plots, heatmaps. Built on matplotlib. For interactive plots use plotly; for publication styling use scientific-visualization.
tools
Cheminformatics toolkit for fine-grained molecular control. SMILES/SDF parsing, descriptors (MW, LogP, TPSA), fingerprints, substructure search, 2D/3D generation, similarity, reactions. For standard workflows with simpler interface, use datamol (wrapper around RDKit). Use rdkit for advanced control, custom sanitization, specialized algorithms.
tools
Query NCBI ClinVar for variant clinical significance. Search by gene/condition/CLNSIG, interpret pathogenicity, use E-utilities or FTP; annotate VCFs. Use project tools in src.tools.database.ncbi.