skills/genomics-bioinformatics/ensembl-database/SKILL.md
Ensembl REST API for gene/transcript/variant annotations in 300+ species. Gene info by symbol/ID, sequence, cross-refs (HGNC, RefSeq, UniProt), regulatory features. For bulk local use pyensembl; for pathways use kegg-database.
npx skillsauth add jaechang-hits/sciagent-skills ensembl-databaseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Ensembl is a comprehensive genome annotation database covering 300+ vertebrate and non-vertebrate species. The Ensembl REST API provides programmatic access to gene models, transcript/protein sequences, variant annotations, cross-references, regulatory features, and comparative genomics without requiring any login or API key.
pyensembl insteadkegg-database or reactome-database insteadrequestsexpand=1 and batch endpoints to minimize callspip install requests
import requests
BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}
def ensembl_get(endpoint, params=None):
r = requests.get(f"{BASE}{endpoint}", headers=HEADERS, params=params)
r.raise_for_status()
return r.json()
# Look up human BRCA1
gene = ensembl_get("/lookup/symbol/homo_sapiens/BRCA1", params={"expand": 1})
print(f"ID: {gene['id']}, Chr: {gene['seq_region_name']}:{gene['start']}-{gene['end']}")
print(f"Transcripts: {len(gene.get('Transcript', []))}")
Retrieve gene metadata from a gene symbol or Ensembl stable ID.
import requests
BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}
# By gene symbol
r = requests.get(
f"{BASE}/lookup/symbol/homo_sapiens/TP53",
headers=HEADERS,
params={"expand": 1}
)
gene = r.json()
print(f"Ensembl ID : {gene['id']}")
print(f"Location : {gene['seq_region_name']}:{gene['start']}-{gene['end']} ({gene['strand']})")
print(f"Biotype : {gene['biotype']}")
print(f"Transcripts: {len(gene.get('Transcript', []))}")
# By stable ID (works for genes, transcripts, proteins)
r = requests.get(
f"{BASE}/lookup/id/ENSG00000141510",
headers=HEADERS,
params={"expand": 0}
)
obj = r.json()
print(f"Symbol: {obj.get('display_name')}, Species: {obj.get('species')}")
Retrieve information for multiple IDs in one call (POST endpoint).
import requests, json
BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}
# Batch lookup by symbols
symbols = ["BRCA1", "BRCA2", "TP53", "EGFR", "MYC"]
r = requests.post(
f"{BASE}/lookup/symbol/homo_sapiens",
headers=HEADERS,
data=json.dumps({"symbols": symbols})
)
results = r.json()
for sym, data in results.items():
if data:
print(f"{sym}: {data['id']} ({data['seq_region_name']}:{data['start']}-{data['end']})")
Fetch genomic, cDNA, CDS, or protein sequences.
import requests
BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "text/plain"}
# Protein sequence for canonical transcript
r = requests.get(
f"{BASE}/sequence/id/ENST00000269305",
headers=HEADERS,
params={"type": "protein"}
)
seq = r.text
print(f"Protein sequence ({len(seq)} aa): {seq[:60]}...")
# Genomic region sequence
HEADERS_JSON = {"Content-Type": "application/json"}
r = requests.get(
f"{BASE}/sequence/region/human/17:43044295..43125364",
headers=HEADERS_JSON,
params={"coord_system_version": "GRCh38"}
)
result = r.json()
print(f"Retrieved {len(result['seq'])} bp of genomic sequence")
Map Ensembl IDs to external database identifiers.
import requests
BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}
# All xrefs for a gene
r = requests.get(
f"{BASE}/xrefs/id/ENSG00000141510",
headers=HEADERS
)
xrefs = r.json()
# Group by database
from collections import defaultdict
by_db = defaultdict(list)
for x in xrefs:
by_db[x["dbname"]].append(x["primary_id"])
for db in ["HGNC", "RefSeq_gene_name", "Uniprot_gn", "MIM_gene"]:
if db in by_db:
print(f"{db}: {by_db[db]}")
Predict functional consequences of variants via REST VEP endpoint.
import requests, json
BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}
# Annotate a list of hgvs notations
variants = ["17:g.43094692C>T", "13:g.32929387C>T"]
r = requests.post(
f"{BASE}/vep/human/hgvs",
headers=HEADERS,
data=json.dumps({"hgvs_notations": variants})
)
for v in r.json():
print(f"\nVariant: {v.get('input')}")
for tc in v.get("transcript_consequences", [])[:2]:
print(f" Gene: {tc.get('gene_symbol')}, Impact: {tc.get('impact')}, Consequence: {tc.get('consequence_terms')}")
# Annotate by rsID
r = requests.get(
f"{BASE}/vep/human/id/rs699",
headers=HEADERS
)
v = r.json()[0]
print(f"rsID rs699 in gene: {v['transcript_consequences'][0]['gene_symbol']}")
print(f"Consequence: {v['transcript_consequences'][0]['consequence_terms']}")
Query regulatory build features in a genomic region.
import requests
BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}
# Regulatory features in BRCA1 region
r = requests.get(
f"{BASE}/overlap/region/human/17:43044000-43126000",
headers=HEADERS,
params={"feature": "regulatory"}
)
features = r.json()
print(f"Found {len(features)} regulatory features")
for f in features[:5]:
print(f" {f.get('feature_type')}: {f.get('start')}-{f.get('end')} ({f.get('description', 'n/a')})")
Find orthologs and paralogs across species.
import requests
BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}
# Get mouse ortholog for human TP53
r = requests.get(
f"{BASE}/homology/symbol/human/TP53",
headers=HEADERS,
params={"target_species": "mus_musculus", "type": "orthologues"}
)
data = r.json()
for homo in data["data"][0]["homologies"][:3]:
tgt = homo["target"]
print(f"Mouse ortholog: {tgt['id']} ({tgt.get('perc_id', 'n/a')}% identity)")
Ensembl uses stable IDs with optional version suffixes (e.g., ENSG00000141510.17). Genes (ENSG), transcripts (ENST), proteins (ENSP), and exons (ENSE) each have their own prefix. IDs are preserved across releases when possible; retired IDs can still be resolved via the archive API.
Human genome: GRCh38 (current) and GRCh37 (legacy, via grch37.rest.ensembl.org). Always specify which assembly your coordinates belong to when making region-based queries.
Goal: Retrieve all key annotations for a gene list — coordinates, transcripts, xrefs, and canonical protein sequence.
import requests, json, time
BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}
def batch_lookup(symbols, species="homo_sapiens"):
r = requests.post(
f"{BASE}/lookup/symbol/{species}",
headers=HEADERS,
data=json.dumps({"symbols": symbols, "expand": 1})
)
return r.json()
def canonical_transcript(gene_data):
"""Return the ID of the canonical (longest CDS) transcript."""
transcripts = gene_data.get("Transcript", [])
coding = [t for t in transcripts if t.get("biotype") == "protein_coding"]
if not coding:
return None
return max(coding, key=lambda t: t.get("Translation", {}).get("length", 0))
genes = ["BRCA1", "BRCA2", "TP53"]
lookup = batch_lookup(genes)
for sym in genes:
g = lookup.get(sym)
if not g:
print(f"{sym}: not found")
continue
canon = canonical_transcript(g)
print(f"\n{sym} ({g['id']})")
print(f" Location: {g['seq_region_name']}:{g['start']}-{g['end']}")
if canon:
prot_len = canon.get("Translation", {}).get("length", "n/a")
print(f" Canonical transcript: {canon['id']} ({prot_len} aa)")
time.sleep(0.1) # be polite
Goal: Annotate a VCF-style variant list with gene, consequence, and impact.
import requests, json, pandas as pd
BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}
# Input: list of hgvs notations
hgvs_list = [
"17:g.43094692C>T",
"17:g.43063873A>G",
"13:g.32929387C>T",
]
# Annotate in batches of 200
def vep_batch(hgvs_batch):
r = requests.post(
f"{BASE}/vep/human/hgvs",
headers=HEADERS,
data=json.dumps({"hgvs_notations": hgvs_batch})
)
r.raise_for_status()
return r.json()
records = []
for ann in vep_batch(hgvs_list):
for tc in ann.get("transcript_consequences", []):
if tc.get("canonical") == 1:
records.append({
"variant": ann["input"],
"gene": tc.get("gene_symbol"),
"consequence": ",".join(tc.get("consequence_terms", [])),
"impact": tc.get("impact"),
"biotype": tc.get("biotype"),
})
df = pd.DataFrame(records)
print(df.to_string(index=False))
df.to_csv("vep_results.csv", index=False)
print(f"\nSaved {len(df)} variant annotations → vep_results.csv")
| Parameter | Module | Default | Range / Options | Effect |
|-----------|--------|---------|-----------------|--------|
| expand | Lookup | 0 | 0 or 1 | Include nested transcripts/translations |
| type | Sequence | "genomic" | "genomic", "cDNA", "CDS", "protein" | Sequence type to return |
| target_species | Homology | None | Species name or taxon ID | Filter homologs to target species |
| feature | Overlap | required | "gene", "transcript", "regulatory", "variation" | Feature type to retrieve |
| coord_system_version | Region | "GRCh38" | "GRCh38", "GRCh37" | Genome assembly |
| content_type | All | via header | "application/json", "text/plain" | Response format |
Use batch endpoints: POST /lookup/symbol/{species} and POST /vep/human/hgvs accept up to 1000 IDs; single-ID GET requests in a loop will hit rate limits quickly.
Pin assembly version: For region-based queries always specify coord_system_version=GRCh38 (or use grch37.rest.ensembl.org for legacy coordinates) to avoid silent mismatch errors.
Cache responses: Gene metadata rarely changes between Ensembl releases; cache results to disk (joblib.Memory) to avoid redundant API calls during development.
from joblib import Memory
mem = Memory("cache/", verbose=0)
cached_lookup = mem.cache(batch_lookup)
Use expand=0 for metadata: When you only need gene coordinates and biotype (not transcript details), keep expand=0 for smaller payloads and faster responses.
Check canonical flag in VEP: VEP returns consequences for all overlapping transcripts; filter on tc.get("canonical") == 1 to get the biologically most relevant consequence per variant.
When to use: Build a lookup table from gene symbols to Ensembl IDs for downstream analysis.
import requests, json, pandas as pd
BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}
symbols = ["EGFR", "KRAS", "BRAF", "PIK3CA", "PTEN", "AKT1", "MYC", "RB1"]
r = requests.post(
f"{BASE}/lookup/symbol/homo_sapiens",
headers=HEADERS,
data=json.dumps({"symbols": symbols})
)
data = r.json()
rows = [{"symbol": s, "ensembl_id": d["id"] if d else None,
"chrom": d["seq_region_name"] if d else None} for s, d in data.items()]
df = pd.DataFrame(rows)
df.to_csv("symbol_to_ensembl.csv", index=False)
print(df.to_string(index=False))
When to use: Find all genes overlapping a genomic interval (e.g., a GWAS locus).
import requests, pandas as pd
BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}
chrom, start, end = "17", 43044295, 43125364
r = requests.get(
f"{BASE}/overlap/region/human/{chrom}:{start}-{end}",
headers=HEADERS,
params={"feature": "gene", "biotype": "protein_coding"}
)
genes = r.json()
df = pd.DataFrame([{
"id": g["id"], "name": g.get("external_name"),
"start": g["start"], "end": g["end"], "strand": g["strand"]
} for g in genes])
print(df.to_string(index=False))
print(f"\n{len(df)} protein-coding genes in region")
When to use: Check which species are available in Ensembl before querying.
import requests
BASE = "https://rest.ensembl.org"
HEADERS = {"Content-Type": "application/json"}
r = requests.get(f"{BASE}/info/species", headers=HEADERS)
species_list = r.json()["species"]
print(f"Total species: {len(species_list)}")
vertebrates = [s for s in species_list if s.get("division") == "EnsemblVertebrates"]
print(f"Vertebrates: {len(vertebrates)}")
for s in vertebrates[:5]:
print(f" {s['common_name']} ({s['name']}): {s['assembly']}")
| Problem | Cause | Solution |
|---------|-------|----------|
| HTTP 429 Too Many Requests | Exceeding ~15 req/s rate limit | Add time.sleep(0.1) between requests; use batch POST endpoints |
| HTTP 400 Bad Request on VEP | Malformed HGVS notation | Verify format: chr:g.posREF>ALT (e.g., 17:g.43094692C>T) |
| Gene not found | Gene symbol not in Ensembl | Try alternative symbol; check species name (use homo_sapiens not human for symbols) |
| Region query returns wrong genes | Assembly mismatch | Set coord_system_version=GRCh38 or use grch37.rest.ensembl.org |
| Old ID not resolving | Retired Ensembl ID | Query GET /archive/id/{id} to get current mapping |
| HTTP 503 Service Unavailable | Server maintenance | Retry after a few minutes; check Ensembl status at status.ensembl.org |
gget-genomic-databases — CLI/Python wrapper covering Ensembl + 20 other databases; use for quick lookups without raw API codebiopython-molecular-biology — Biopython's Entrez module for NCBI databases (alternative for RefSeq/GenBank queries)kegg-database — Pathway/metabolic annotations for the same gene setreactome-database — Pathway enrichment and hierarchy queriestools
Fast short-read DNA aligner for WGS/WES/ChIP-seq. 2× faster BWA-MEM successor; outputs SAM/BAM with read group headers for GATK. Primary plus supplementary records for chimeric reads. Use STAR for RNA-seq splice-aware alignment; Bowtie2 is a comparable alternative.
tools
smina molecular docking CLI. AutoDock Vina fork with customizable scoring functions, native SDF/MOL2/PDB ligand input, autoboxing, local energy minimization, and per-atom score breakdowns. Pipeline: receptor PDBQT prep -> ligand prep (RDKit/OpenBabel) -> dock via autobox or explicit grid -> rescore/minimize with custom scoring -> rank poses by affinity. Choose smina over Vina when you need custom scoring terms (--custom_scoring), local optimization of an existing pose (--local_only), per-atom contributions (--atom_term_data), or SDF/MOL2 ligands without manual PDBQT conversion. For unknown binding sites use diffdock-blind-docking; for the Python-bindings/Vinardo workflow use autodock-vina-docking.
development
mdtraj molecular dynamics trajectory analysis (Python). Reads DCD/XTC/TRR/NetCDF/H5/PDB topologies and trajectories; computes RMSD vs time, radius of gyration, per-residue RMSF, residue-residue contact frequency maps, phi/psi torsions for Ramachandran plots (general + Gly/Pro), and 8-state DSSP secondary structure. Modules: trajectory I/O, geometry (distances/angles/dihedrals), structural analysis (RMSD/Rg/RMSF/SASA), contacts, hydrogen bonds, secondary structure (DSSP), NMR observables. For broader atom-selection grammar use mdanalysis-trajectory; for running MD simulations use OpenMM/GROMACS.
development
Programmatic PubMed access via NCBI E-utilities REST API. Covers Boolean/MeSH queries, field-tagged search, endpoints (ESearch, EFetch, ESummary, EPost, ELink), history server for batches, citation matching, systematic review strategies. Use for biomedical literature search or automated pipelines.