database-access/biomart-queries/SKILL.md
Bulk-query Ensembl BioMart (and other BioMart instances) for cross-database ID mapping, gene/transcript/exon coordinates, and ortholog tables. Use when batch-converting Ensembl IDs to other namespaces (HGNC, RefSeq, UniProt, Entrez), pulling gene coordinate tables for thousands of genes, building ortholog wide-tables across species, or replacing slow Ensembl REST loops with one-shot bulk export. Encodes BioMart's XML query format, R biomaRt vs Python pybiomart trade-off, mart-vs-dataset hierarchy, and the URL endpoint that's BioMart-specific (separate from rest.ensembl.org).
npx skillsauth add GPTomics/bioSkills bio-biomart-queriesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: pybiomart 0.9+, R biomaRt 2.58+ (Bioconductor); Ensembl BioMart (release 110+)
Before using code patterns, verify installed versions match. If versions differ:
pip show pybiomartpackageVersion('biomaRt')The BioMart XML query format is stable across Ensembl releases; the underlying mart names and attribute IDs can change between Ensembl releases. For published work, pin the Ensembl release via useEnsembl(version=110).
"Bulk-convert IDs / pull coordinate tables / extract ortholog wide tables" -> BioMart is the right answer for any Ensembl-rooted query producing >5,000 rows. It is a separate service from the Ensembl REST API, with separate rate behavior and a different query model (XML-based, batch-oriented). For one-off lookups (<100 records), Ensembl REST is more convenient; for bulk anything, BioMart wins.
The single most important fact: BioMart returns a flat table from a single query. There is no per-record loop, no rate-limit cascade, no async polling. One XML query in; one TSV out.
pybiomart (https://github.com/jrderuiter/pybiomart) is the lightest clientbiomaRt Bioconductor (Durinck et al. 2009 Nat Protoc 4:1184) is the canonical clientcurl against the XML endpoint works but is rarely used directlyhttps://www.ensembl.org/biomart/martview for interactive query designpip install pybiomart pandas
# R:
# BiocManager::install('biomaRt')
| Level | Examples |
|---|---|
| Mart | ENSEMBL_MART_ENSEMBL (genes), ENSEMBL_MART_SNP (variants), ENSEMBL_MART_MOUSE (mouse-specific) |
| Dataset | hsapiens_gene_ensembl, mmusculus_gene_ensembl, etc. (per species) |
| Attribute | Fields to return: ensembl_gene_id, external_gene_name, chromosome_name, etc. |
| Filter | Constraints on the query: chromosome_name = 17, biotype = protein_coding, etc. |
A query is: pick a mart, pick a dataset, list attributes to return, list filters to constrain. BioMart returns a single TSV.
Discovery:
from pybiomart import Server
server = Server(host='http://www.ensembl.org')
print(server.marts) # list marts
mart = server['ENSEMBL_MART_ENSEMBL']
print(mart.datasets) # list datasets (species)
ds = mart['hsapiens_gene_ensembl']
print(ds.attributes) # list attributes
print(ds.filters) # list filters
| Question | BioMart | Ensembl REST |
|---|---|---|
| Bulk ID mapping (>5000 IDs) | yes (1 query) | rate-limited cascade |
| Single-gene lookup | overkill | yes |
| Coordinate tables for thousands of genes | yes | rate-limited |
| Ortholog wide-table across species | yes (multi-species mart) | per-gene loop |
| VEP variant annotation | no | yes (or local VEP) |
| Sequence retrieval | partial | yes |
| Real-time | no (batch) | yes (per-record) |
| Reproducibility (version pin) | useEnsembl(version=110) | archive URL e110.rest.ensembl.org |
For >5K rows, BioMart is the right tool. For real-time per-record lookups, REST.
| Attribute | Returns |
|---|---|
| ensembl_gene_id | Stable Ensembl Gene ID |
| ensembl_gene_id_version | With .N version suffix |
| external_gene_name | HGNC symbol (or species-equivalent) |
| hgnc_id, hgnc_symbol | HGNC permanent ID and symbol |
| entrezgene_id | NCBI Gene ID |
| refseq_mrna, refseq_peptide | RefSeq accessions |
| uniprotswissprot, uniprotsptrembl | UniProt accessions |
| chromosome_name, start_position, end_position, strand | Gene coordinates |
| transcript_count, exon_count | Counts |
| biotype | protein_coding, lncRNA, miRNA, etc. |
| description | Free-text gene description |
| go_id, name_1006, namespace_1003 | GO term ID, name, namespace |
| Filter | Constraint |
|---|---|
| ensembl_gene_id | List of Gene IDs |
| external_gene_name | List of symbols |
| entrezgene_id | List of NCBI Gene IDs |
| chromosome_name | One or more chromosomes |
| start / end | Coordinate range |
| biotype | One or more biotypes |
| with_<source> | Boolean: has cross-ref to <source> (e.g. with_hpa = has Human Protein Atlas) |
Goal: Convert 5,000 Ensembl Gene IDs to HGNC symbols, RefSeq mRNA accessions, and UniProt accessions in one query.
Approach: pybiomart query with three attributes; ID list as a filter; returns one TSV.
Reference (pybiomart 0.9+, Ensembl release 110+):
from pybiomart import Server
import pandas as pd
server = Server(host='http://www.ensembl.org')
mart = server['ENSEMBL_MART_ENSEMBL']
ds = mart['hsapiens_gene_ensembl']
ensembl_ids = ['ENSG00000139618', 'ENSG00000141510', 'ENSG00000171862'] # ...up to 5K+
df = ds.query(
attributes=['ensembl_gene_id', 'external_gene_name', 'hgnc_id',
'refseq_mrna', 'uniprotswissprot'],
filters={'ensembl_gene_id': ensembl_ids},
)
print(df.head())
# One row per (gene, cross-ref) pair; genes with multiple RefSeq mRNAs get multiple rows.
df = ds.query(
attributes=['ensembl_gene_id', 'external_gene_name', 'chromosome_name',
'start_position', 'end_position', 'strand', 'biotype'],
filters={'chromosome_name': '17', 'biotype': 'protein_coding'},
)
print(f'{len(df)} protein-coding genes on chr17')
Goal: One TSV with human Ensembl ID, mouse ortholog Ensembl ID, zebrafish ortholog Ensembl ID per row.
Approach: Ortholog attributes from the human mart query both species' orthologs.
df = ds.query(
attributes=['ensembl_gene_id', 'external_gene_name',
'mmusculus_homolog_ensembl_gene', 'mmusculus_homolog_orthology_type',
'drerio_homolog_ensembl_gene', 'drerio_homolog_orthology_type'],
filters={'chromosome_name': '17'},
)
# pybiomart columns use the mart display names, which can vary across releases.
# Resolve column names defensively rather than hardcoding strings:
mouse_type_col = next(c for c in df.columns if 'Mouse' in c and 'type' in c)
zebra_type_col = next(c for c in df.columns if 'Zebrafish' in c and 'type' in c)
df_one2one = df[(df[mouse_type_col] == 'ortholog_one2one') &
(df[zebra_type_col] == 'ortholog_one2one')]
print(f'{len(df_one2one)} 1:1 orthologs across all three species on chr17')
df = ds.query(
attributes=['ensembl_gene_id', 'external_gene_name',
'go_id', 'name_1006', 'namespace_1003'],
filters={'external_gene_name': ['TP53', 'BRCA1', 'MYC', 'EGFR']},
)
# Long format: one row per (gene, GO term) pair
# Reference: Bioconductor biomaRt 2.58+ | Verify API if version differs
library(biomaRt)
# Pin to release 110 for reproducibility
ensembl <- useEnsembl(biomart='genes', dataset='hsapiens_gene_ensembl', version=110)
# Or via host URL (for older or specific assemblies)
# ensembl <- useMart('ENSEMBL_MART_ENSEMBL',
# dataset='hsapiens_gene_ensembl',
# host='https://nov2020.archive.ensembl.org')
df <- getBM(
attributes = c('ensembl_gene_id', 'external_gene_name', 'entrezgene_id',
'uniprotswissprot', 'refseq_mrna'),
filters = 'ensembl_gene_id',
values = c('ENSG00000139618', 'ENSG00000141510'),
mart = ensembl
)
head(df)
# What attributes are available?
attrs = ds.attributes
ortho_attrs = [a for a in attrs if 'homolog' in a]
print(f'{len(ortho_attrs)} ortholog attributes; first 5: {ortho_attrs[:5]}')
# What filters?
filts = ds.filters
chrom_filts = [f for f in filts if 'chrom' in f]
useMart('ensembl', ...) without version=.useEnsembl(version=110) or archive host URL.ensembl_gene_id, refseq_mrna; a gene with 10 RefSeq mRNAs produces 10 rows.ensembl_canonical filter where available.filters={'external_gene_name': ['MARCH1']} post-2020.ensembl_gene_id or hgnc_id; these are stable.mmusculus_homolog_ensembl_gene for 30K human genes./lookup/symbol calls.ENSEMBL_MART_SNP.server.marts; pick ENSEMBL_MART_ENSEMBL for genes.| Error / symptom | Cause | Solution |
|---|---|---|
| Empty result | Wrong attribute / filter name | List with ds.attributes and ds.filters |
| Timeout on big query | No filter, too many rows | Chunk by chromosome |
| Drift between re-runs | No version pinning | useEnsembl(version=110) |
| Row count > expected | Many-to-many cross-ref joins | Filter to canonical isoform |
| Symbol filter returns nothing | HGNC rename | Filter by Ensembl ID or HGNC ID |
| Slow on ortholog wide-table | Multi-species join expensive | Chunk by chromosome |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.