Version Compatibility

Reference examples tested with: biomaRt 2.58+, AnnotationDbi 1.66+, org.Hs.eg.db 3.18+, org.Mm.eg.db 3.18+, GenomicFeatures 1.54+, mygene 1.38+ (Python), pyensembl 2.3+, pandas 2.2+, rtracklayer 1.62+

Before using code patterns, verify installed versions match. If versions differ:

R: packageVersion('<pkg>') then ?function_name to verify parameters
Python: pip show <package> then help(module.function) to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Gene ID Mapping

"Convert gene IDs from X to Y" -> Query the appropriate annotation source (local org.db for speed, biomaRt for Ensembl-specific attributes, mygene for cross-database aliases, Ensembl REST for low-level access), with version pinning for reproducibility and explicit handling of one-to-many mappings, withdrawn symbols, and species-specific naming.

The Single Most Important Modern Insight -- Excel autocorrect renamed real genes

Ziemann, Eren, El-Osta 2016 Genome Biol 17:177 scanned 18 leading genomics journals and found ~20% of papers with Excel-attached supplementary gene lists had silently mangled symbols (SEPT2 -> 2-Sep, MARCH1 -> 1-Mar, ...). Five years later the problem persisted. HGNC's response (Bruford, Braschi, Denny, Jones, Seal, Tweedie 2020 Nat Genet 52:754) was to rename the affected genes:

| Old | New | Affected | |-----|-----|----------| | SEPT# | SEPTIN# | SEPT1 - SEPT14 -> SEPTIN1 - SEPTIN14 | | MARCH# | MARCHF# | MARCH1 - MARCH11 -> MARCHF1 - MARCHF11 | | MARC# | MTARC# | MARC1, MARC2 -> MTARC1, MTARC2 | | DEC1 | DELEC1 | DEC1 -> DELEC1 |

Code that hard-codes old symbols silently drops these genes when joined against post-2020 annotations. Detection on import: if a gene column contains ^\d{1,2}-(Jan|Feb|Mar|...|Dec)$ patterns, the file was Excel-corrupted. Always read.csv(colClasses=c(gene='character')) (R) or pd.read_csv(dtype={'gene': str}) (Python) -- but the damage is at Excel-save time, not import time.

Two related insights that determine half the practical work:

Ensembl version suffixes matter sometimes and not others. ENSG00000123456.7 is release-specific; the unversioned ENSG00000123456 is the stable cross-release ID. STRIP for cross-release joins, MSigDB lookups, gene-set databases. KEEP for intra-release reproducibility and clinical reports. CRITICAL: the naive sub('\\..*', '', x) regex ALSO strips the GENCODE _PAR_Y suffix in releases 25-43, collapsing chrY PAR duplicates onto their chrX counterparts. Use sub('\\.[0-9]+(_PAR_Y)?$', '\\1', x).
Never use HGNC symbols as the primary computational key. Symbols change. Use Ensembl or Entrez as keys; carry symbols only as display labels in the final results table.

Algorithmic Taxonomy

| Tool | Source | Speed | Strength | Use for | |------|--------|-------|----------|---------| | AnnotationDbi + org.Hs.eg.db / org.Mm.eg.db | NCBI Gene snapshot, pinned at Bioc install | Fast, local | Stable, version-pinned | Default for Ensembl <-> Entrez <-> Symbol within Bioconductor | | biomaRt | Ensembl BioMart over HTTP | Slow for >5k queries; timeouts | Ensembl-specific attributes (biotype, transcript versions, paralogs, orthologs) | Need Ensembl-specific fields; archive endpoints for release pinning | | mygene.info / mygene (Python) | REST API to a curated meta-database | Server-side batching of 1000 IDs | Best for symbol/alias/prev_symbol resolution | Cross-database; HGNC withdrawn symbol resolution; non-R environments | | Ensembl REST | Direct REST API to Ensembl | Rate-limited (15 req/sec) | Low-level access to variant consequence, sequence, etc. | Specialized queries not covered by biomaRt | | pyensembl | Local Ensembl database (Python) | Fast, local, version-pinned | Reproducible offline; gene objects with transcript and exon access | Python pipelines needing rich annotation | | HGNC API direct | https://rest.genenames.org | REST | Authoritative source for HGNC | Symbol provenance, prev/alias detection |

Decision Tree by Scenario

| Scenario | Recommended approach | Why | |----------|---------------------|-----| | R Bioconductor pipeline, Ensembl <-> Entrez <-> Symbol | AnnotationDbi::mapIds(org.Hs.eg.db, ...) | Fastest, version-pinned, stable | | Need Ensembl-only attributes (biotype, paralog, ortholog) | biomaRt::useEnsembl(version=N) | Only biomaRt exposes these | | Cross-database with alias and withdrawn-symbol fallback | mygene querymany(scopes='symbol,alias,prev_symbol') | Designed for this case | | Python pipeline, reproducible | pyensembl with pinned release | Offline, version-locked | | Clinical report needing canonical transcript per gene | MANE Select (Morales 2022 Nature 604:310) | Cross-database consensus (RefSeq + Ensembl) | | Cross-species mouse <-> human | Ensembl Compara getLDS filtered to one2one | Compara has best coverage; one2one most defensible | | Building tx2gene for tximport | GenomicFeatures::makeTxDbFromGFF on the SAME GTF used in quantification | Annotation pinning matters | | Need to reproduce a 2023 analysis exactly | useEnsembl(version=109) (or whichever release was used) | Without version=, biomaRt floats to current release | | GRCh37 (legacy clinical) | useEnsembl(GRCh=37) dedicated permanent endpoint | GRCh37 -> GRCh38 mappings are not 1:1 |

AnnotationDbi + org.db

Goal: Map Ensembl gene IDs to symbols, Entrez IDs, or descriptions using a local Bioconductor annotation package.

Approach: mapIds() with the source keytype and target column; handle one-to-many via multiVals.

library(org.Hs.eg.db)
library(AnnotationDbi)

ensembl_ids <- sub('\\.[0-9]+(_PAR_Y)?$', '\\1', rownames(counts))

symbols  <- mapIds(org.Hs.eg.db, keys = ensembl_ids,
                    keytype = 'ENSEMBL', column = 'SYMBOL',
                    multiVals = 'first')
entrez   <- mapIds(org.Hs.eg.db, keys = ensembl_ids,
                    keytype = 'ENSEMBL', column = 'ENTREZID',
                    multiVals = 'first')
descrips <- mapIds(org.Hs.eg.db, keys = ensembl_ids,
                    keytype = 'ENSEMBL', column = 'GENENAME',
                    multiVals = 'first')

keytypes(org.Hs.eg.db)

multiVals options: 'first' (silent), 'asNA' (NA for ambiguous), 'list' (preserve all). For DE results tables, 'first' is typical but the mapping rate should be reported.

For mouse: org.Mm.eg.db. For other organisms: check Bioconductor AnnotationData -> OrgDb list.

biomaRt with Version Pinning

Goal: Query Ensembl BioMart with the EXACT release version, for reproducibility.

Approach: useEnsembl(version=N) pins; listEnsemblArchives() lists available archives.

library(biomaRt)

ensembl <- useEnsembl(biomart = 'genes',
                       dataset = 'hsapiens_gene_ensembl',
                       version = 110)

ensembl_grch37 <- useEnsembl(biomart = 'genes',
                              dataset = 'hsapiens_gene_ensembl',
                              GRCh = 37)

mapping <- getBM(
    attributes = c('ensembl_gene_id', 'hgnc_symbol', 'entrezgene_id',
                   'gene_biotype', 'description'),
    filters    = 'ensembl_gene_id',
    values     = ensembl_ids,
    mart       = ensembl
)

The filters= argument is PLURAL. The singular filter= may work via R's partial matching but breaks unpredictably if another argument starts with f. Always spell filters= and values= fully.

Multiple filters:

genes_in_region <- getBM(
    attributes = c('ensembl_gene_id', 'hgnc_symbol'),
    filters    = c('chromosome_name', 'start', 'end'),
    values     = list('16', 1100000, 1250000),
    mart       = ensembl
)

Without version=, biomaRt floats to the current release -- a script written in 2023 against Ensembl 109 produces different mappings in 2026 against Ensembl 113. ALWAYS pin for any published analysis. Cache the mapping table alongside the analysis for reproducibility.

listEnsemblArchives() shows the available historical releases.

mygene (Python)

Goal: Map between any identifier systems using the curated MyGene.info meta-database with alias fallback.

Approach: MyGeneInfo().querymany(ids, scopes, fields, species); auto-batches at 1000 IDs server-side.

import mygene

mg = mygene.MyGeneInfo()

results = mg.querymany(['ENSG00000141510', 'ENSG00000012048', 'ENSG00000141736'],
                        scopes='ensembl.gene', fields='symbol,entrezgene,uniprot',
                        species='human')

mapping = {r['query']: r.get('symbol', None) for r in results}

results = mg.querymany(['SEPT1', 'MARCH1', 'OCT4'],
                        scopes='symbol,alias,prev_symbol',
                        fields='symbol,entrezgene,ensembl.gene',
                        species='human')

For paper-derived gene lists where symbols may be old or aliases (OCT4 vs POU5F1, MARCH1 vs MARCHF1, SEPT2 vs SEPTIN2), scopes='symbol,alias,prev_symbol' handles the resolution. The MyGene database aggregates HGNC's prev/alias columns.

OCT4 is the common usage; POU5F1 is the official HGNC symbol; in MSigDB the gene is POU5F1; in a Western blot legend it's "Oct4". For mapping a stem-cell paper to an Ensembl-quantified matrix, scope to aliases.

pyensembl

from pyensembl import EnsemblRelease

ensembl = EnsemblRelease(110, species='human')

gene = ensembl.gene_by_id('ENSG00000141510')
gene.gene_name

gene = ensembl.genes_by_name('TP53')[0]
gene.gene_id

mapping = {}
for eid in ensembl_ids:
    try:
        gene = ensembl.gene_by_id(eid.split('.')[0])
        mapping[eid] = gene.gene_name
    except ValueError:
        mapping[eid] = None

pyensembl downloads and caches the release database on first use; thereafter offline and version-locked.

Apply Mapping to Count Matrix -- Handling One-to-Many

Goal: Convert the gene index of a count matrix to a different ID type, summing reads from multiple source IDs that map to the same target.

Approach: Look up mapping, replace index, aggregate duplicates by SUM (not mean -- counts add).

import pandas as pd
import mygene

def map_count_matrix_ids(counts, from_type='ensembl.gene', to_type='symbol',
                         species='human'):
    '''Map gene IDs in count matrix index, summing reads when multiple source map to one target.'''
    mg = mygene.MyGeneInfo()
    clean = [g.split('.')[0] for g in counts.index]
    results = mg.querymany(clean, scopes=from_type, fields=to_type, species=species)
    mapping = {r['query']: r[to_type] for r in results if to_type in r}
    new_index = [mapping.get(g.split('.')[0], g) for g in counts.index]
    counts_mapped = counts.copy()
    counts_mapped.index = new_index
    counts_mapped = counts_mapped.groupby(counts_mapped.index).sum()
    return counts_mapped

mapped = map_count_matrix_ids(counts, 'ensembl.gene', 'symbol')

Counts ADD when collapsing multiple source genes to one target. Means or medians would be wrong (they understate library size for the merged target).

library(biomaRt)

ensembl <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', version = 110)

clean <- sub('\\.[0-9]+(_PAR_Y)?$', '\\1', rownames(counts))

mapping <- getBM(
    attributes = c('ensembl_gene_id', 'hgnc_symbol'),
    filters    = 'ensembl_gene_id',
    values     = clean,
    mart       = ensembl
)

counts_df <- as.data.frame(counts)
counts_df$ensembl <- clean
merged <- merge(counts_df, mapping, by.x = 'ensembl', by.y = 'ensembl_gene_id')
counts_by_symbol <- aggregate(. ~ hgnc_symbol,
                              data = merged[, setdiff(colnames(merged), 'ensembl')],
                              FUN = sum)
rownames(counts_by_symbol) <- counts_by_symbol$hgnc_symbol
counts_by_symbol$hgnc_symbol <- NULL

Handle Unmapped IDs

def robust_id_mapping(gene_ids, from_type, to_type, species='human'):
    import mygene
    mg = mygene.MyGeneInfo()
    clean = [g.split('.')[0] for g in gene_ids]
    results = mg.querymany(clean, scopes=from_type, fields=to_type, species=species)
    mapping, unmapped = {}, []
    for r in results:
        original = gene_ids[clean.index(r['query'])]
        if to_type in r:
            mapping[original] = r[to_type]
        else:
            mapping[original] = original
            unmapped.append(original)
    print(f'Mapped: {len(gene_ids) - len(unmapped)}/{len(gene_ids)}')
    return mapping, unmapped

Unmapped fraction is a QC signal:

<5% unmapped: normal (rare genes, recent deprecations)
5-20% unmapped: check Ensembl release alignment; check for HGNC renames
20% unmapped: wrong annotation release, wrong species, or wrong source ID type

MANE Select for Clinical Reporting

Goal: Use the single representative transcript per gene with identical exon/CDS in RefSeq AND Ensembl for clinical variant reporting.

Approach: Download the MANE TSV; join on Ensembl_Gene -> Ensembl_nuc (transcript) and RefSeq_nuc.

Morales J, Pujar S, Loveland JE et al. 2022 Nature 604:310-315 established MANE Select. ~19,000+ protein-coding genes have a single agreed transcript with matched coordinates across RefSeq (NM_xxxxxx) and Ensembl/GENCODE (ENST00000xxxxxxx). MANE Plus Clinical adds extra transcripts at loci where Select misses clinical variants.

For clinical reports with HGVS notation like NM_000546.6:c.215C>G, use the MANE Select RefSeq accession. The MANE TSV (downloadable from NCBI) provides the Ensembl crosswalk.

Cross-Species Orthologs

Goal: Map mouse <-> human (or any pair) for cross-species integration or pathway transfer.

Approach: Ensembl Compara via biomaRt getLDS; filter to orthology type appropriate to use.

library(biomaRt)

human <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', version = 110)
mouse <- useEnsembl(biomart = 'genes', dataset = 'mmusculus_gene_ensembl', version = 110)

orthologs <- getLDS(
    attributes  = c('hgnc_symbol', 'ensembl_gene_id'),
    filters     = 'ensembl_gene_id',
    values      = human_gene_ids,
    mart        = human,
    attributesL = c('mgi_symbol', 'ensembl_gene_id', 'mmusculus_homolog_orthology_type'),
    martL       = mouse
)

| Strategy | When | Trade-off | |----------|------|-----------| | one2one orthologs only | Cross-species scRNA-seq integration; conservative DE comparison | Loses genes with paralog expansions; lower coverage | | Include one2many | Broader gene coverage needed | Must select within group (highest confidence; highest expression) | | Include many2many | Maximum inclusivity | Introduces ambiguity; use with caution |

The "homology threshold" problem: no automatic threshold reliably separates true orthologs from paralogs across all gene families. For pathway transfer (mouse signature -> human), filter to one2one and accept the coverage loss.

Alternative sources: OMA (Hierarchical Orthologous Groups, cleaner one2one when present, smaller coverage); OrthoDB (hierarchical at multiple taxonomic levels). OrthoFinder for custom genomes.

PAR Gene Complications

Pseudo-autosomal region (PAR) genes exist on both X and Y with identical sequences. In GENCODE 25-43, the chrY copy has a _PAR_Y suffix. In GENCODE 44+ (Ensembl 110+), chrY PAR genes get their own ENSG accessions.

par_genes_human = ['SHOX', 'IL3RA', 'SLC25A6', 'P2RY8', 'AKAP17A', 'ASMT', 'DHRSX']
dup_ids = counts.index[counts.index.duplicated()].unique()
if len(dup_ids) > 0:
    print(f'Duplicate gene entries: {len(dup_ids)}')
    counts = counts.groupby(counts.index).sum()

Reads from PAR regions cannot be unambiguously assigned to X or Y. Some references mask the Y-chromosome PAR to avoid double-counting; verify what the alignment reference does before building the matrix.

Build tx2gene for tximport

Goal: Create the transcript-to-gene mapping needed by tximport for gene-level summarization.

Approach: Build from the SAME GTF used to construct the Salmon/kallisto index, OR pull from biomaRt with version pinning.

library(GenomicFeatures)

txdb <- makeTxDbFromGFF('annotation.gtf.gz')
k <- keys(txdb, keytype = 'TXNAME')
tx2gene <- AnnotationDbi::select(txdb, k, 'GENEID', 'TXNAME')

library(biomaRt)

mart <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', version = 110)
tx2gene <- getBM(
    attributes = c('ensembl_transcript_id_version', 'ensembl_gene_id_version'),
    mart       = mart
)
colnames(tx2gene) <- c('TXNAME', 'GENEID')

import pandas as pd

def tx2gene_from_gtf(gtf_path):
    records = []
    with open(gtf_path) as f:
        for line in f:
            if line.startswith('#') or '\ttranscript\t' not in line:
                continue
            attrs = line.strip().split('\t')[8]
            gene_id = [a.split('"')[1] for a in attrs.split(';') if 'gene_id' in a][0]
            tx_id   = [a.split('"')[1] for a in attrs.split(';') if 'transcript_id' in a][0]
            records.append({'TXNAME': tx_id, 'GENEID': gene_id})
    return pd.DataFrame(records).drop_duplicates()

CRITICAL: the tx2gene MUST use the same versioning convention as the Salmon/kallisto index. If the index used ENST00000269305.9 and tx2gene has ENST00000269305 (unversioned), tximport drops the transcripts. Mismatched versions silently lose data.

ID Type Reference

| Type | Example | Stability | Use case | |------|---------|-----------|----------| | Ensembl Gene | ENSG00000141510 | Stable across releases; versioned | RNA-seq, GTFs, primary computational key | | Ensembl Transcript | ENST00000269305 | Stable; versioned | Transcript-level analysis | | Entrez Gene | 7157 | Stable; never reused | NCBI databases, KEGG pathways | | HGNC Symbol | TP53 | Changes (see SEPT/MARCH renames) | Display labels only | | UniProt | P04637 | Stable; versioned releases | Protein databases | | RefSeq mRNA | NM_000546 | Stable; versioned | Clinical reports, HGVS notation | | MANE Select | NM_000546.6 / ENST00000269305.9 | Stable consensus | Clinical variant reporting |

Per-Method Failure Modes

`_PAR_Y` stripped, chrY duplicates collapsed

Trigger: GENCODE v40 count matrix; rownames(counts) <- sub('\\..*', '', rownames(counts)); duplicate row indices and inflated chrY PAR gene counts.

Mechanism: Default regex strips _PAR_Y along with the version suffix. Two distinct rows (chrX and chrY copies) become the same ENSG ID; aggregate sums them.

Symptom: Counts for PAR genes double; sex check shows females expressing chrY genes; downstream rowGroupBy returns warnings.

Fix: Use the preserving regex: sub('\\.[0-9]+(_PAR_Y)?$', '\\1', x). Or upgrade quantification to GENCODE 44+ where _PAR_Y is retired.

`biomaRt` returned 0 rows without warning

Trigger: getBM(attributes=..., filter='ensembl_gene_id', values=ids, mart=mart) -- note singular filter.

Mechanism: R's partial matching usually resolves filter -> filters, but in some package versions or with conflicting argument names, the call silently passes nothing.

Symptom: Empty result data frame; no error.

Fix: Always spell filters= and values= fully.

HGNC SEPT/MARCH symbols silently dropped

Trigger: Code copies a pre-2020 list of septin genes (SEPT1, SEPT2, ...); current org.db / biomaRt returns no matches.

Mechanism: HGNC renamed all SEPT# to SEPTIN# in 2020.

Symptom: 0% mapping rate for septin genes; functional analyses missing septin pathways.

Fix: Use mygene scopes='symbol,alias,prev_symbol'; or update the input list to current symbols.

biomaRt drift between runs

Trigger: A 2023 analysis used useEnsembl() without version=; rerun in 2026 produces 200 fewer significant genes.

Mechanism: Without version=, biomaRt floats to the current release. Symbols, biotypes, and gene boundaries change between releases.

Symptom: Non-reproducible results across runs of the same script.

Fix: Pin useEnsembl(version=N) where N is the release used in the original analysis. Cache the mapping table.

tx2gene version mismatch with Salmon index

Trigger: tximport(files, type='salmon', tx2gene) runs but the gene-level counts have far fewer genes than expected.

Mechanism: Salmon index built with versioned transcript IDs (ENST00000269305.9) but tx2gene has unversioned IDs (ENST00000269305). Transcripts silently drop during the mapping step.

Symptom: Lower-than-expected gene count; warning from tximport about missing transcript IDs.

Fix: Match versioning convention: rebuild tx2gene with the same versioning as the index. GenomicFeatures::makeTxDbFromGFF on the same GTF as the index is the safest path.

Cross-species mapping reports many2many, user picks one arbitrarily

Trigger: Mouse-to-human mapping returns 1.3 mouse genes per human gene on average; user takes the first row of each duplicate.

Mechanism: Many2many orthology is genuinely ambiguous; "first row" is unprincipled and irreproducible across biomaRt API versions.

Symptom: Different mappings on rerun; conflicting downstream gene sets.

Fix: Either filter to mmusculus_homolog_orthology_type == 'ortholog_one2one' (conservative) or aggregate via highest homology confidence score (mmusculus_homolog_perc_id_r1).

Common errors

| Error / symptom | Cause | Fix | |-----------------|-------|-----| | filters returns empty | Singular filter= partial-matched against another argument | Spell filters= fully | | 1-Mar in gene column | Excel autocorrected MARCH1 | Re-import with explicit string type; map back to MARCHF1 | | pyensembl ValueError: gene not found | ID not in pinned release; or unversioned ID against versioned database | Strip version before lookup; verify release | | Duplicate rownames after aggregate | Collapsed multiple source IDs to one target; OR _PAR_Y stripped | Sum-collapse expected; for PAR_Y use preserving regex | | biomaRt timeout for >5k IDs | Query too large | Chunk into batches of 1000 | | Wrong species mapping | Default species='human' in mygene; mouse query returns nothing | Pass species='mouse' explicitly | | ENSEMBL keytype not available | Older org.db package or non-human/mouse | keytypes(orgdb) to verify |

References

Ziemann M, Eren Y, El-Osta A. 2016. Gene name errors are widespread in the scientific literature. Genome Biol 17:177. doi:10.1186/s13059-016-1044-7
Bruford EA, Braschi B, Denny P, Jones TEM, Seal RL, Tweedie S. 2020. Guidelines for human gene nomenclature. Nat Genet 52:754-758. doi:10.1038/s41588-020-0669-3
Morales J, Pujar S, Loveland JE, et al. 2022. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604:310-315. doi:10.1038/s41586-022-04558-8
Durinck S, Spellman PT, Birney E, Huber W. 2009. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc 4(8):1184-1191. doi:10.1038/nprot.2009.97
Carlson M, Falcon S, Pages H, Li N. 2019. org.Hs.eg.db: Genome wide annotation for Human. R package. Bioconductor.
Wu C et al. 2013. BioGPS and MyGene.info: organizing online, gene-centric information. Nucleic Acids Res 41(D1):D561-D565. doi:10.1093/nar/gks1114
Frankish A et al. 2021. GENCODE 2021. Nucleic Acids Res 49(D1):D916-D923. doi:10.1093/nar/gkaa1087
Howe KL et al. 2021. Ensembl 2021. Nucleic Acids Res 49(D1):D884-D891. doi:10.1093/nar/gkaa942
Soneson C, Love MI, Robinson MD. 2015. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res 4:1521. doi:10.12688/f1000research.7563.2

Related Skills

counts-ingest - Building count matrices and tx2gene
metadata-joins - Joining annotation with sample tables
normalization - Biotype filtering before normalization
sparse-handling - Single-cell row metadata in AnnData
differential-expression/de-results - Annotating DE results; gene-symbol display
rna-quantification/tximport-workflow - Detailed tximport + tx2gene workflow
pathway-analysis/go-enrichment - Entrez IDs required
pathway-analysis/kegg-pathways - Entrez IDs; strain-specific organism codes
database-access/biomart-queries - General biomaRt patterns
database-access/uniprot-access - UniProt mapping details
database-access/ortholog-inference - De novo ortholog inference for custom genomes

Version Compatibility

Before using code patterns, verify installed versions match. If versions differ:

R: packageVersion('<pkg>') then ?function_name to verify parameters
Python: pip show <package> then help(module.function) to check signatures

If code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.

Gene ID Mapping

The Single Most Important Modern Insight -- Excel autocorrect renamed real genes

Two related insights that determine half the practical work:

Ensembl version suffixes matter sometimes and not others. ENSG00000123456.7 is release-specific; the unversioned ENSG00000123456 is the stable cross-release ID. STRIP for cross-release joins, MSigDB lookups, gene-set databases. KEEP for intra-release reproducibility and clinical reports. CRITICAL: the naive sub('\\..*', '', x) regex ALSO strips the GENCODE _PAR_Y suffix in releases 25-43, collapsing chrY PAR duplicates onto their chrX counterparts. Use sub('\\.[0-9]+(_PAR_Y)?$', '\\1', x).
Never use HGNC symbols as the primary computational key. Symbols change. Use Ensembl or Entrez as keys; carry symbols only as display labels in the final results table.

Algorithmic Taxonomy

Decision Tree by Scenario

AnnotationDbi + org.db

Goal: Map Ensembl gene IDs to symbols, Entrez IDs, or descriptions using a local Bioconductor annotation package.

Approach: mapIds() with the source keytype and target column; handle one-to-many via multiVals.

library(org.Hs.eg.db)
library(AnnotationDbi)

ensembl_ids <- sub('\\.[0-9]+(_PAR_Y)?$', '\\1', rownames(counts))

symbols  <- mapIds(org.Hs.eg.db, keys = ensembl_ids,
                    keytype = 'ENSEMBL', column = 'SYMBOL',
                    multiVals = 'first')
entrez   <- mapIds(org.Hs.eg.db, keys = ensembl_ids,
                    keytype = 'ENSEMBL', column = 'ENTREZID',
                    multiVals = 'first')
descrips <- mapIds(org.Hs.eg.db, keys = ensembl_ids,
                    keytype = 'ENSEMBL', column = 'GENENAME',
                    multiVals = 'first')

keytypes(org.Hs.eg.db)

multiVals options: 'first' (silent), 'asNA' (NA for ambiguous), 'list' (preserve all). For DE results tables, 'first' is typical but the mapping rate should be reported.

For mouse: org.Mm.eg.db. For other organisms: check Bioconductor AnnotationData -> OrgDb list.

biomaRt with Version Pinning

Goal: Query Ensembl BioMart with the EXACT release version, for reproducibility.

Approach: useEnsembl(version=N) pins; listEnsemblArchives() lists available archives.

library(biomaRt)

ensembl <- useEnsembl(biomart = 'genes',
                       dataset = 'hsapiens_gene_ensembl',
                       version = 110)

ensembl_grch37 <- useEnsembl(biomart = 'genes',
                              dataset = 'hsapiens_gene_ensembl',
                              GRCh = 37)

mapping <- getBM(
    attributes = c('ensembl_gene_id', 'hgnc_symbol', 'entrezgene_id',
                   'gene_biotype', 'description'),
    filters    = 'ensembl_gene_id',
    values     = ensembl_ids,
    mart       = ensembl
)

The filters= argument is PLURAL. The singular filter= may work via R's partial matching but breaks unpredictably if another argument starts with f. Always spell filters= and values= fully.

Multiple filters:

genes_in_region <- getBM(
    attributes = c('ensembl_gene_id', 'hgnc_symbol'),
    filters    = c('chromosome_name', 'start', 'end'),
    values     = list('16', 1100000, 1250000),
    mart       = ensembl
)

listEnsemblArchives() shows the available historical releases.

mygene (Python)

Goal: Map between any identifier systems using the curated MyGene.info meta-database with alias fallback.

Approach: MyGeneInfo().querymany(ids, scopes, fields, species); auto-batches at 1000 IDs server-side.

import mygene

mg = mygene.MyGeneInfo()

results = mg.querymany(['ENSG00000141510', 'ENSG00000012048', 'ENSG00000141736'],
                        scopes='ensembl.gene', fields='symbol,entrezgene,uniprot',
                        species='human')

mapping = {r['query']: r.get('symbol', None) for r in results}

results = mg.querymany(['SEPT1', 'MARCH1', 'OCT4'],
                        scopes='symbol,alias,prev_symbol',
                        fields='symbol,entrezgene,ensembl.gene',
                        species='human')

pyensembl

from pyensembl import EnsemblRelease

ensembl = EnsemblRelease(110, species='human')

gene = ensembl.gene_by_id('ENSG00000141510')
gene.gene_name

gene = ensembl.genes_by_name('TP53')[0]
gene.gene_id

mapping = {}
for eid in ensembl_ids:
    try:
        gene = ensembl.gene_by_id(eid.split('.')[0])
        mapping[eid] = gene.gene_name
    except ValueError:
        mapping[eid] = None

pyensembl downloads and caches the release database on first use; thereafter offline and version-locked.

Apply Mapping to Count Matrix -- Handling One-to-Many

Goal: Convert the gene index of a count matrix to a different ID type, summing reads from multiple source IDs that map to the same target.

Approach: Look up mapping, replace index, aggregate duplicates by SUM (not mean -- counts add).

import pandas as pd
import mygene

def map_count_matrix_ids(counts, from_type='ensembl.gene', to_type='symbol',
                         species='human'):
    '''Map gene IDs in count matrix index, summing reads when multiple source map to one target.'''
    mg = mygene.MyGeneInfo()
    clean = [g.split('.')[0] for g in counts.index]
    results = mg.querymany(clean, scopes=from_type, fields=to_type, species=species)
    mapping = {r['query']: r[to_type] for r in results if to_type in r}
    new_index = [mapping.get(g.split('.')[0], g) for g in counts.index]
    counts_mapped = counts.copy()
    counts_mapped.index = new_index
    counts_mapped = counts_mapped.groupby(counts_mapped.index).sum()
    return counts_mapped

mapped = map_count_matrix_ids(counts, 'ensembl.gene', 'symbol')

Counts ADD when collapsing multiple source genes to one target. Means or medians would be wrong (they understate library size for the merged target).

library(biomaRt)

ensembl <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', version = 110)

clean <- sub('\\.[0-9]+(_PAR_Y)?$', '\\1', rownames(counts))

mapping <- getBM(
    attributes = c('ensembl_gene_id', 'hgnc_symbol'),
    filters    = 'ensembl_gene_id',
    values     = clean,
    mart       = ensembl
)

counts_df <- as.data.frame(counts)
counts_df$ensembl <- clean
merged <- merge(counts_df, mapping, by.x = 'ensembl', by.y = 'ensembl_gene_id')
counts_by_symbol <- aggregate(. ~ hgnc_symbol,
                              data = merged[, setdiff(colnames(merged), 'ensembl')],
                              FUN = sum)
rownames(counts_by_symbol) <- counts_by_symbol$hgnc_symbol
counts_by_symbol$hgnc_symbol <- NULL

Handle Unmapped IDs

def robust_id_mapping(gene_ids, from_type, to_type, species='human'):
    import mygene
    mg = mygene.MyGeneInfo()
    clean = [g.split('.')[0] for g in gene_ids]
    results = mg.querymany(clean, scopes=from_type, fields=to_type, species=species)
    mapping, unmapped = {}, []
    for r in results:
        original = gene_ids[clean.index(r['query'])]
        if to_type in r:
            mapping[original] = r[to_type]
        else:
            mapping[original] = original
            unmapped.append(original)
    print(f'Mapped: {len(gene_ids) - len(unmapped)}/{len(gene_ids)}')
    return mapping, unmapped

Unmapped fraction is a QC signal:

<5% unmapped: normal (rare genes, recent deprecations)
5-20% unmapped: check Ensembl release alignment; check for HGNC renames
20% unmapped: wrong annotation release, wrong species, or wrong source ID type

MANE Select for Clinical Reporting

Goal: Use the single representative transcript per gene with identical exon/CDS in RefSeq AND Ensembl for clinical variant reporting.

Approach: Download the MANE TSV; join on Ensembl_Gene -> Ensembl_nuc (transcript) and RefSeq_nuc.

For clinical reports with HGVS notation like NM_000546.6:c.215C>G, use the MANE Select RefSeq accession. The MANE TSV (downloadable from NCBI) provides the Ensembl crosswalk.

Cross-Species Orthologs

Goal: Map mouse <-> human (or any pair) for cross-species integration or pathway transfer.

Approach: Ensembl Compara via biomaRt getLDS; filter to orthology type appropriate to use.

library(biomaRt)

human <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', version = 110)
mouse <- useEnsembl(biomart = 'genes', dataset = 'mmusculus_gene_ensembl', version = 110)

orthologs <- getLDS(
    attributes  = c('hgnc_symbol', 'ensembl_gene_id'),
    filters     = 'ensembl_gene_id',
    values      = human_gene_ids,
    mart        = human,
    attributesL = c('mgi_symbol', 'ensembl_gene_id', 'mmusculus_homolog_orthology_type'),
    martL       = mouse
)

Alternative sources: OMA (Hierarchical Orthologous Groups, cleaner one2one when present, smaller coverage); OrthoDB (hierarchical at multiple taxonomic levels). OrthoFinder for custom genomes.

PAR Gene Complications

par_genes_human = ['SHOX', 'IL3RA', 'SLC25A6', 'P2RY8', 'AKAP17A', 'ASMT', 'DHRSX']
dup_ids = counts.index[counts.index.duplicated()].unique()
if len(dup_ids) > 0:
    print(f'Duplicate gene entries: {len(dup_ids)}')
    counts = counts.groupby(counts.index).sum()

Build tx2gene for tximport

Goal: Create the transcript-to-gene mapping needed by tximport for gene-level summarization.

Approach: Build from the SAME GTF used to construct the Salmon/kallisto index, OR pull from biomaRt with version pinning.

library(GenomicFeatures)

txdb <- makeTxDbFromGFF('annotation.gtf.gz')
k <- keys(txdb, keytype = 'TXNAME')
tx2gene <- AnnotationDbi::select(txdb, k, 'GENEID', 'TXNAME')

library(biomaRt)

mart <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', version = 110)
tx2gene <- getBM(
    attributes = c('ensembl_transcript_id_version', 'ensembl_gene_id_version'),
    mart       = mart
)
colnames(tx2gene) <- c('TXNAME', 'GENEID')

import pandas as pd

def tx2gene_from_gtf(gtf_path):
    records = []
    with open(gtf_path) as f:
        for line in f:
            if line.startswith('#') or '\ttranscript\t' not in line:
                continue
            attrs = line.strip().split('\t')[8]
            gene_id = [a.split('"')[1] for a in attrs.split(';') if 'gene_id' in a][0]
            tx_id   = [a.split('"')[1] for a in attrs.split(';') if 'transcript_id' in a][0]
            records.append({'TXNAME': tx_id, 'GENEID': gene_id})
    return pd.DataFrame(records).drop_duplicates()

ID Type Reference

Per-Method Failure Modes

`_PAR_Y` stripped, chrY duplicates collapsed

Trigger: GENCODE v40 count matrix; rownames(counts) <- sub('\\..*', '', rownames(counts)); duplicate row indices and inflated chrY PAR gene counts.

Mechanism: Default regex strips _PAR_Y along with the version suffix. Two distinct rows (chrX and chrY copies) become the same ENSG ID; aggregate sums them.

Symptom: Counts for PAR genes double; sex check shows females expressing chrY genes; downstream rowGroupBy returns warnings.

Fix: Use the preserving regex: sub('\\.[0-9]+(_PAR_Y)?$', '\\1', x). Or upgrade quantification to GENCODE 44+ where _PAR_Y is retired.

`biomaRt` returned 0 rows without warning

Trigger: getBM(attributes=..., filter='ensembl_gene_id', values=ids, mart=mart) -- note singular filter.

Mechanism: R's partial matching usually resolves filter -> filters, but in some package versions or with conflicting argument names, the call silently passes nothing.

Symptom: Empty result data frame; no error.

Fix: Always spell filters= and values= fully.

HGNC SEPT/MARCH symbols silently dropped

Trigger: Code copies a pre-2020 list of septin genes (SEPT1, SEPT2, ...); current org.db / biomaRt returns no matches.

Mechanism: HGNC renamed all SEPT# to SEPTIN# in 2020.

Symptom: 0% mapping rate for septin genes; functional analyses missing septin pathways.

Fix: Use mygene scopes='symbol,alias,prev_symbol'; or update the input list to current symbols.

biomaRt drift between runs

Trigger: A 2023 analysis used useEnsembl() without version=; rerun in 2026 produces 200 fewer significant genes.

Mechanism: Without version=, biomaRt floats to the current release. Symbols, biotypes, and gene boundaries change between releases.

Symptom: Non-reproducible results across runs of the same script.

Fix: Pin useEnsembl(version=N) where N is the release used in the original analysis. Cache the mapping table.

tx2gene version mismatch with Salmon index

Trigger: tximport(files, type='salmon', tx2gene) runs but the gene-level counts have far fewer genes than expected.

Mechanism: Salmon index built with versioned transcript IDs (ENST00000269305.9) but tx2gene has unversioned IDs (ENST00000269305). Transcripts silently drop during the mapping step.

Symptom: Lower-than-expected gene count; warning from tximport about missing transcript IDs.

Fix: Match versioning convention: rebuild tx2gene with the same versioning as the index. GenomicFeatures::makeTxDbFromGFF on the same GTF as the index is the safest path.

Cross-species mapping reports many2many, user picks one arbitrarily

Trigger: Mouse-to-human mapping returns 1.3 mouse genes per human gene on average; user takes the first row of each duplicate.

Mechanism: Many2many orthology is genuinely ambiguous; "first row" is unprincipled and irreproducible across biomaRt API versions.

Symptom: Different mappings on rerun; conflicting downstream gene sets.

Fix: Either filter to mmusculus_homolog_orthology_type == 'ortholog_one2one' (conservative) or aggregate via highest homology confidence score (mmusculus_homolog_perc_id_r1).

Common errors

References

Ziemann M, Eren Y, El-Osta A. 2016. Gene name errors are widespread in the scientific literature. Genome Biol 17:177. doi:10.1186/s13059-016-1044-7
Bruford EA, Braschi B, Denny P, Jones TEM, Seal RL, Tweedie S. 2020. Guidelines for human gene nomenclature. Nat Genet 52:754-758. doi:10.1038/s41588-020-0669-3
Morales J, Pujar S, Loveland JE, et al. 2022. A joint NCBI and EMBL-EBI transcript set for clinical genomics and research. Nature 604:310-315. doi:10.1038/s41586-022-04558-8
Durinck S, Spellman PT, Birney E, Huber W. 2009. Mapping identifiers for the integration of genomic datasets with the R/Bioconductor package biomaRt. Nat Protoc 4(8):1184-1191. doi:10.1038/nprot.2009.97
Carlson M, Falcon S, Pages H, Li N. 2019. org.Hs.eg.db: Genome wide annotation for Human. R package. Bioconductor.
Wu C et al. 2013. BioGPS and MyGene.info: organizing online, gene-centric information. Nucleic Acids Res 41(D1):D561-D565. doi:10.1093/nar/gks1114
Frankish A et al. 2021. GENCODE 2021. Nucleic Acids Res 49(D1):D916-D923. doi:10.1093/nar/gkaa1087
Howe KL et al. 2021. Ensembl 2021. Nucleic Acids Res 49(D1):D884-D891. doi:10.1093/nar/gkaa942
Soneson C, Love MI, Robinson MD. 2015. Differential analyses for RNA-seq: transcript-level estimates improve gene-level inferences. F1000Res 4:1521. doi:10.12688/f1000research.7563.2

Related Skills

counts-ingest - Building count matrices and tx2gene
metadata-joins - Joining annotation with sample tables
normalization - Biotype filtering before normalization
sparse-handling - Single-cell row metadata in AnnData
differential-expression/de-results - Annotating DE results; gene-symbol display
rna-quantification/tximport-workflow - Detailed tximport + tx2gene workflow
pathway-analysis/go-enrichment - Entrez IDs required
pathway-analysis/kegg-pathways - Entrez IDs; strain-specific organism codes
database-access/biomart-queries - General biomaRt patterns
database-access/uniprot-access - UniProt mapping details
database-access/ortholog-inference - De novo ortholog inference for custom genomes

Adoption

GPTomics/bio-expression-matrix-gene-id-mapping

$ install --global

Security Scan Results

SKILL.md

Version Compatibility

Gene ID Mapping

The Single Most Important Modern Insight -- Excel autocorrect renamed real genes

Algorithmic Taxonomy

Decision Tree by Scenario

AnnotationDbi + org.db

biomaRt with Version Pinning

mygene (Python)

pyensembl

Apply Mapping to Count Matrix -- Handling One-to-Many

Handle Unmapped IDs

MANE Select for Clinical Reporting

Cross-Species Orthologs

PAR Gene Complications

Build tx2gene for tximport

ID Type Reference

Per-Method Failure Modes

_PAR_Y stripped, chrY duplicates collapsed

biomaRt returned 0 rows without warning

HGNC SEPT/MARCH symbols silently dropped

biomaRt drift between runs

tx2gene version mismatch with Salmon index

Cross-species mapping reports many2many, user picks one arbitrarily

Common errors

References

Related Skills

Related Skills

GPTomics/bio-workflows-clip-pipeline

GPTomics/bio-comparative-genomics-whole-genome-duplication

GPTomics/bio-comparative-genomics-whole-genome-alignment

GPTomics/bio-comparative-genomics-synteny-analysis

GPTomics/bio-expression-matrix-gene-id-mapping

$ install --global

Security Scan Results

SKILL.md

Version Compatibility

Gene ID Mapping

The Single Most Important Modern Insight -- Excel autocorrect renamed real genes

Algorithmic Taxonomy

Decision Tree by Scenario

AnnotationDbi + org.db

biomaRt with Version Pinning

mygene (Python)

pyensembl

Apply Mapping to Count Matrix -- Handling One-to-Many

Handle Unmapped IDs

MANE Select for Clinical Reporting

Cross-Species Orthologs

PAR Gene Complications

Build tx2gene for tximport

ID Type Reference

Per-Method Failure Modes

_PAR_Y stripped, chrY duplicates collapsed

biomaRt returned 0 rows without warning

HGNC SEPT/MARCH symbols silently dropped

biomaRt drift between runs

tx2gene version mismatch with Salmon index

Cross-species mapping reports many2many, user picks one arbitrarily

Common errors

References

Related Skills

Related Skills

GPTomics/bio-workflows-clip-pipeline

GPTomics/bio-comparative-genomics-whole-genome-duplication

GPTomics/bio-comparative-genomics-whole-genome-alignment

GPTomics/bio-comparative-genomics-synteny-analysis

`_PAR_Y` stripped, chrY duplicates collapsed

`biomaRt` returned 0 rows without warning

`_PAR_Y` stripped, chrY duplicates collapsed

`biomaRt` returned 0 rows without warning