expression-matrix/gene-id-mapping/SKILL.md
Maps between gene identifier systems (Ensembl, Entrez, HGNC symbol, UniProt, RefSeq, MANE) using AnnotationDbi, biomaRt, mygene, pyensembl, and Ensembl REST. Encodes Ensembl version stripping with GENCODE _PAR_Y preservation, the Ziemann 2016 Excel autocorrect debacle and Bruford 2020 HGNC renames (SEPT*->SEPTIN*, MARCH*->MARCHF*, MARC*->MTARC*, DEC1->DELEC1), OCT4/POU5F1 alias resolution, biomaRt archive endpoints for release pinning, the `filters` (plural) gotcha, MANE Select for clinical reporting, cross-species orthology via Ensembl Compara / OMA / OrthoDB, and tx2gene construction for tximport. Use when converting gene IDs across systems, handling renamed symbols, building tx2gene, pinning to a specific Ensembl release for reproducibility, or mapping cross-species orthologs.
npx skillsauth add GPTomics/bioSkills bio-expression-matrix-gene-id-mappingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: biomaRt 2.58+, AnnotationDbi 1.66+, org.Hs.eg.db 3.18+, org.Mm.eg.db 3.18+, GenomicFeatures 1.54+, mygene 1.38+ (Python), pyensembl 2.3+, pandas 2.2+, rtracklayer 1.62+
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('<pkg>') then ?function_name to verify parameterspip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Convert gene IDs from X to Y" -> Query the appropriate annotation source (local org.db for speed, biomaRt for Ensembl-specific attributes, mygene for cross-database aliases, Ensembl REST for low-level access), with version pinning for reproducibility and explicit handling of one-to-many mappings, withdrawn symbols, and species-specific naming.
Ziemann, Eren, El-Osta 2016 Genome Biol 17:177 scanned 18 leading genomics journals and found ~20% of papers with Excel-attached supplementary gene lists had silently mangled symbols (SEPT2 -> 2-Sep, MARCH1 -> 1-Mar, ...). Five years later the problem persisted. HGNC's response (Bruford, Braschi, Denny, Jones, Seal, Tweedie 2020 Nat Genet 52:754) was to rename the affected genes:
| Old | New | Affected |
|-----|-----|----------|
| SEPT# | SEPTIN# | SEPT1 - SEPT14 -> SEPTIN1 - SEPTIN14 |
| MARCH# | MARCHF# | MARCH1 - MARCH11 -> MARCHF1 - MARCHF11 |
| MARC# | MTARC# | MARC1, MARC2 -> MTARC1, MTARC2 |
| DEC1 | DELEC1 | DEC1 -> DELEC1 |
Code that hard-codes old symbols silently drops these genes when joined against post-2020 annotations. Detection on import: if a gene column contains ^\d{1,2}-(Jan|Feb|Mar|...|Dec)$ patterns, the file was Excel-corrupted. Always read.csv(colClasses=c(gene='character')) (R) or pd.read_csv(dtype={'gene': str}) (Python) -- but the damage is at Excel-save time, not import time.
Two related insights that determine half the practical work:
Ensembl version suffixes matter sometimes and not others. ENSG00000123456.7 is release-specific; the unversioned ENSG00000123456 is the stable cross-release ID. STRIP for cross-release joins, MSigDB lookups, gene-set databases. KEEP for intra-release reproducibility and clinical reports. CRITICAL: the naive sub('\\..*', '', x) regex ALSO strips the GENCODE _PAR_Y suffix in releases 25-43, collapsing chrY PAR duplicates onto their chrX counterparts. Use sub('\\.[0-9]+(_PAR_Y)?$', '\\1', x).
Never use HGNC symbols as the primary computational key. Symbols change. Use Ensembl or Entrez as keys; carry symbols only as display labels in the final results table.
| Tool | Source | Speed | Strength | Use for |
|------|--------|-------|----------|---------|
| AnnotationDbi + org.Hs.eg.db / org.Mm.eg.db | NCBI Gene snapshot, pinned at Bioc install | Fast, local | Stable, version-pinned | Default for Ensembl <-> Entrez <-> Symbol within Bioconductor |
| biomaRt | Ensembl BioMart over HTTP | Slow for >5k queries; timeouts | Ensembl-specific attributes (biotype, transcript versions, paralogs, orthologs) | Need Ensembl-specific fields; archive endpoints for release pinning |
| mygene.info / mygene (Python) | REST API to a curated meta-database | Server-side batching of 1000 IDs | Best for symbol/alias/prev_symbol resolution | Cross-database; HGNC withdrawn symbol resolution; non-R environments |
| Ensembl REST | Direct REST API to Ensembl | Rate-limited (15 req/sec) | Low-level access to variant consequence, sequence, etc. | Specialized queries not covered by biomaRt |
| pyensembl | Local Ensembl database (Python) | Fast, local, version-pinned | Reproducible offline; gene objects with transcript and exon access | Python pipelines needing rich annotation |
| HGNC API direct | https://rest.genenames.org | REST | Authoritative source for HGNC | Symbol provenance, prev/alias detection |
| Scenario | Recommended approach | Why |
|----------|---------------------|-----|
| R Bioconductor pipeline, Ensembl <-> Entrez <-> Symbol | AnnotationDbi::mapIds(org.Hs.eg.db, ...) | Fastest, version-pinned, stable |
| Need Ensembl-only attributes (biotype, paralog, ortholog) | biomaRt::useEnsembl(version=N) | Only biomaRt exposes these |
| Cross-database with alias and withdrawn-symbol fallback | mygene querymany(scopes='symbol,alias,prev_symbol') | Designed for this case |
| Python pipeline, reproducible | pyensembl with pinned release | Offline, version-locked |
| Clinical report needing canonical transcript per gene | MANE Select (Morales 2022 Nature 604:310) | Cross-database consensus (RefSeq + Ensembl) |
| Cross-species mouse <-> human | Ensembl Compara getLDS filtered to one2one | Compara has best coverage; one2one most defensible |
| Building tx2gene for tximport | GenomicFeatures::makeTxDbFromGFF on the SAME GTF used in quantification | Annotation pinning matters |
| Need to reproduce a 2023 analysis exactly | useEnsembl(version=109) (or whichever release was used) | Without version=, biomaRt floats to current release |
| GRCh37 (legacy clinical) | useEnsembl(GRCh=37) dedicated permanent endpoint | GRCh37 -> GRCh38 mappings are not 1:1 |
Goal: Map Ensembl gene IDs to symbols, Entrez IDs, or descriptions using a local Bioconductor annotation package.
Approach: mapIds() with the source keytype and target column; handle one-to-many via multiVals.
library(org.Hs.eg.db)
library(AnnotationDbi)
ensembl_ids <- sub('\\.[0-9]+(_PAR_Y)?$', '\\1', rownames(counts))
symbols <- mapIds(org.Hs.eg.db, keys = ensembl_ids,
keytype = 'ENSEMBL', column = 'SYMBOL',
multiVals = 'first')
entrez <- mapIds(org.Hs.eg.db, keys = ensembl_ids,
keytype = 'ENSEMBL', column = 'ENTREZID',
multiVals = 'first')
descrips <- mapIds(org.Hs.eg.db, keys = ensembl_ids,
keytype = 'ENSEMBL', column = 'GENENAME',
multiVals = 'first')
keytypes(org.Hs.eg.db)
multiVals options: 'first' (silent), 'asNA' (NA for ambiguous), 'list' (preserve all). For DE results tables, 'first' is typical but the mapping rate should be reported.
For mouse: org.Mm.eg.db. For other organisms: check Bioconductor AnnotationData -> OrgDb list.
Goal: Query Ensembl BioMart with the EXACT release version, for reproducibility.
Approach: useEnsembl(version=N) pins; listEnsemblArchives() lists available archives.
library(biomaRt)
ensembl <- useEnsembl(biomart = 'genes',
dataset = 'hsapiens_gene_ensembl',
version = 110)
ensembl_grch37 <- useEnsembl(biomart = 'genes',
dataset = 'hsapiens_gene_ensembl',
GRCh = 37)
mapping <- getBM(
attributes = c('ensembl_gene_id', 'hgnc_symbol', 'entrezgene_id',
'gene_biotype', 'description'),
filters = 'ensembl_gene_id',
values = ensembl_ids,
mart = ensembl
)
The filters= argument is PLURAL. The singular filter= may work via R's partial matching but breaks unpredictably if another argument starts with f. Always spell filters= and values= fully.
Multiple filters:
genes_in_region <- getBM(
attributes = c('ensembl_gene_id', 'hgnc_symbol'),
filters = c('chromosome_name', 'start', 'end'),
values = list('16', 1100000, 1250000),
mart = ensembl
)
Without version=, biomaRt floats to the current release -- a script written in 2023 against Ensembl 109 produces different mappings in 2026 against Ensembl 113. ALWAYS pin for any published analysis. Cache the mapping table alongside the analysis for reproducibility.
listEnsemblArchives() shows the available historical releases.
Goal: Map between any identifier systems using the curated MyGene.info meta-database with alias fallback.
Approach: MyGeneInfo().querymany(ids, scopes, fields, species); auto-batches at 1000 IDs server-side.
import mygene
mg = mygene.MyGeneInfo()
results = mg.querymany(['ENSG00000141510', 'ENSG00000012048', 'ENSG00000141736'],
scopes='ensembl.gene', fields='symbol,entrezgene,uniprot',
species='human')
mapping = {r['query']: r.get('symbol', None) for r in results}
results = mg.querymany(['SEPT1', 'MARCH1', 'OCT4'],
scopes='symbol,alias,prev_symbol',
fields='symbol,entrezgene,ensembl.gene',
species='human')
For paper-derived gene lists where symbols may be old or aliases (OCT4 vs POU5F1, MARCH1 vs MARCHF1, SEPT2 vs SEPTIN2), scopes='symbol,alias,prev_symbol' handles the resolution. The MyGene database aggregates HGNC's prev/alias columns.
OCT4 is the common usage; POU5F1 is the official HGNC symbol; in MSigDB the gene is POU5F1; in a Western blot legend it's "Oct4". For mapping a stem-cell paper to an Ensembl-quantified matrix, scope to aliases.
from pyensembl import EnsemblRelease
ensembl = EnsemblRelease(110, species='human')
gene = ensembl.gene_by_id('ENSG00000141510')
gene.gene_name
gene = ensembl.genes_by_name('TP53')[0]
gene.gene_id
mapping = {}
for eid in ensembl_ids:
try:
gene = ensembl.gene_by_id(eid.split('.')[0])
mapping[eid] = gene.gene_name
except ValueError:
mapping[eid] = None
pyensembl downloads and caches the release database on first use; thereafter offline and version-locked.
Goal: Convert the gene index of a count matrix to a different ID type, summing reads from multiple source IDs that map to the same target.
Approach: Look up mapping, replace index, aggregate duplicates by SUM (not mean -- counts add).
import pandas as pd
import mygene
def map_count_matrix_ids(counts, from_type='ensembl.gene', to_type='symbol',
species='human'):
'''Map gene IDs in count matrix index, summing reads when multiple source map to one target.'''
mg = mygene.MyGeneInfo()
clean = [g.split('.')[0] for g in counts.index]
results = mg.querymany(clean, scopes=from_type, fields=to_type, species=species)
mapping = {r['query']: r[to_type] for r in results if to_type in r}
new_index = [mapping.get(g.split('.')[0], g) for g in counts.index]
counts_mapped = counts.copy()
counts_mapped.index = new_index
counts_mapped = counts_mapped.groupby(counts_mapped.index).sum()
return counts_mapped
mapped = map_count_matrix_ids(counts, 'ensembl.gene', 'symbol')
Counts ADD when collapsing multiple source genes to one target. Means or medians would be wrong (they understate library size for the merged target).
library(biomaRt)
ensembl <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', version = 110)
clean <- sub('\\.[0-9]+(_PAR_Y)?$', '\\1', rownames(counts))
mapping <- getBM(
attributes = c('ensembl_gene_id', 'hgnc_symbol'),
filters = 'ensembl_gene_id',
values = clean,
mart = ensembl
)
counts_df <- as.data.frame(counts)
counts_df$ensembl <- clean
merged <- merge(counts_df, mapping, by.x = 'ensembl', by.y = 'ensembl_gene_id')
counts_by_symbol <- aggregate(. ~ hgnc_symbol,
data = merged[, setdiff(colnames(merged), 'ensembl')],
FUN = sum)
rownames(counts_by_symbol) <- counts_by_symbol$hgnc_symbol
counts_by_symbol$hgnc_symbol <- NULL
def robust_id_mapping(gene_ids, from_type, to_type, species='human'):
import mygene
mg = mygene.MyGeneInfo()
clean = [g.split('.')[0] for g in gene_ids]
results = mg.querymany(clean, scopes=from_type, fields=to_type, species=species)
mapping, unmapped = {}, []
for r in results:
original = gene_ids[clean.index(r['query'])]
if to_type in r:
mapping[original] = r[to_type]
else:
mapping[original] = original
unmapped.append(original)
print(f'Mapped: {len(gene_ids) - len(unmapped)}/{len(gene_ids)}')
return mapping, unmapped
Unmapped fraction is a QC signal:
20% unmapped: wrong annotation release, wrong species, or wrong source ID type
Goal: Use the single representative transcript per gene with identical exon/CDS in RefSeq AND Ensembl for clinical variant reporting.
Approach: Download the MANE TSV; join on Ensembl_Gene -> Ensembl_nuc (transcript) and RefSeq_nuc.
Morales J, Pujar S, Loveland JE et al. 2022 Nature 604:310-315 established MANE Select. ~19,000+ protein-coding genes have a single agreed transcript with matched coordinates across RefSeq (NM_xxxxxx) and Ensembl/GENCODE (ENST00000xxxxxxx). MANE Plus Clinical adds extra transcripts at loci where Select misses clinical variants.
For clinical reports with HGVS notation like NM_000546.6:c.215C>G, use the MANE Select RefSeq accession. The MANE TSV (downloadable from NCBI) provides the Ensembl crosswalk.
Goal: Map mouse <-> human (or any pair) for cross-species integration or pathway transfer.
Approach: Ensembl Compara via biomaRt getLDS; filter to orthology type appropriate to use.
library(biomaRt)
human <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', version = 110)
mouse <- useEnsembl(biomart = 'genes', dataset = 'mmusculus_gene_ensembl', version = 110)
orthologs <- getLDS(
attributes = c('hgnc_symbol', 'ensembl_gene_id'),
filters = 'ensembl_gene_id',
values = human_gene_ids,
mart = human,
attributesL = c('mgi_symbol', 'ensembl_gene_id', 'mmusculus_homolog_orthology_type'),
martL = mouse
)
| Strategy | When | Trade-off |
|----------|------|-----------|
| one2one orthologs only | Cross-species scRNA-seq integration; conservative DE comparison | Loses genes with paralog expansions; lower coverage |
| Include one2many | Broader gene coverage needed | Must select within group (highest confidence; highest expression) |
| Include many2many | Maximum inclusivity | Introduces ambiguity; use with caution |
The "homology threshold" problem: no automatic threshold reliably separates true orthologs from paralogs across all gene families. For pathway transfer (mouse signature -> human), filter to one2one and accept the coverage loss.
Alternative sources: OMA (Hierarchical Orthologous Groups, cleaner one2one when present, smaller coverage); OrthoDB (hierarchical at multiple taxonomic levels). OrthoFinder for custom genomes.
Pseudo-autosomal region (PAR) genes exist on both X and Y with identical sequences. In GENCODE 25-43, the chrY copy has a _PAR_Y suffix. In GENCODE 44+ (Ensembl 110+), chrY PAR genes get their own ENSG accessions.
par_genes_human = ['SHOX', 'IL3RA', 'SLC25A6', 'P2RY8', 'AKAP17A', 'ASMT', 'DHRSX']
dup_ids = counts.index[counts.index.duplicated()].unique()
if len(dup_ids) > 0:
print(f'Duplicate gene entries: {len(dup_ids)}')
counts = counts.groupby(counts.index).sum()
Reads from PAR regions cannot be unambiguously assigned to X or Y. Some references mask the Y-chromosome PAR to avoid double-counting; verify what the alignment reference does before building the matrix.
Goal: Create the transcript-to-gene mapping needed by tximport for gene-level summarization.
Approach: Build from the SAME GTF used to construct the Salmon/kallisto index, OR pull from biomaRt with version pinning.
library(GenomicFeatures)
txdb <- makeTxDbFromGFF('annotation.gtf.gz')
k <- keys(txdb, keytype = 'TXNAME')
tx2gene <- AnnotationDbi::select(txdb, k, 'GENEID', 'TXNAME')
library(biomaRt)
mart <- useEnsembl(biomart = 'genes', dataset = 'hsapiens_gene_ensembl', version = 110)
tx2gene <- getBM(
attributes = c('ensembl_transcript_id_version', 'ensembl_gene_id_version'),
mart = mart
)
colnames(tx2gene) <- c('TXNAME', 'GENEID')
import pandas as pd
def tx2gene_from_gtf(gtf_path):
records = []
with open(gtf_path) as f:
for line in f:
if line.startswith('#') or '\ttranscript\t' not in line:
continue
attrs = line.strip().split('\t')[8]
gene_id = [a.split('"')[1] for a in attrs.split(';') if 'gene_id' in a][0]
tx_id = [a.split('"')[1] for a in attrs.split(';') if 'transcript_id' in a][0]
records.append({'TXNAME': tx_id, 'GENEID': gene_id})
return pd.DataFrame(records).drop_duplicates()
CRITICAL: the tx2gene MUST use the same versioning convention as the Salmon/kallisto index. If the index used ENST00000269305.9 and tx2gene has ENST00000269305 (unversioned), tximport drops the transcripts. Mismatched versions silently lose data.
| Type | Example | Stability | Use case | |------|---------|-----------|----------| | Ensembl Gene | ENSG00000141510 | Stable across releases; versioned | RNA-seq, GTFs, primary computational key | | Ensembl Transcript | ENST00000269305 | Stable; versioned | Transcript-level analysis | | Entrez Gene | 7157 | Stable; never reused | NCBI databases, KEGG pathways | | HGNC Symbol | TP53 | Changes (see SEPT/MARCH renames) | Display labels only | | UniProt | P04637 | Stable; versioned releases | Protein databases | | RefSeq mRNA | NM_000546 | Stable; versioned | Clinical reports, HGVS notation | | MANE Select | NM_000546.6 / ENST00000269305.9 | Stable consensus | Clinical variant reporting |
_PAR_Y stripped, chrY duplicates collapsedTrigger: GENCODE v40 count matrix; rownames(counts) <- sub('\\..*', '', rownames(counts)); duplicate row indices and inflated chrY PAR gene counts.
Mechanism: Default regex strips _PAR_Y along with the version suffix. Two distinct rows (chrX and chrY copies) become the same ENSG ID; aggregate sums them.
Symptom: Counts for PAR genes double; sex check shows females expressing chrY genes; downstream rowGroupBy returns warnings.
Fix: Use the preserving regex: sub('\\.[0-9]+(_PAR_Y)?$', '\\1', x). Or upgrade quantification to GENCODE 44+ where _PAR_Y is retired.
biomaRt returned 0 rows without warningTrigger: getBM(attributes=..., filter='ensembl_gene_id', values=ids, mart=mart) -- note singular filter.
Mechanism: R's partial matching usually resolves filter -> filters, but in some package versions or with conflicting argument names, the call silently passes nothing.
Symptom: Empty result data frame; no error.
Fix: Always spell filters= and values= fully.
Trigger: Code copies a pre-2020 list of septin genes (SEPT1, SEPT2, ...); current org.db / biomaRt returns no matches.
Mechanism: HGNC renamed all SEPT# to SEPTIN# in 2020.
Symptom: 0% mapping rate for septin genes; functional analyses missing septin pathways.
Fix: Use mygene scopes='symbol,alias,prev_symbol'; or update the input list to current symbols.
Trigger: A 2023 analysis used useEnsembl() without version=; rerun in 2026 produces 200 fewer significant genes.
Mechanism: Without version=, biomaRt floats to the current release. Symbols, biotypes, and gene boundaries change between releases.
Symptom: Non-reproducible results across runs of the same script.
Fix: Pin useEnsembl(version=N) where N is the release used in the original analysis. Cache the mapping table.
Trigger: tximport(files, type='salmon', tx2gene) runs but the gene-level counts have far fewer genes than expected.
Mechanism: Salmon index built with versioned transcript IDs (ENST00000269305.9) but tx2gene has unversioned IDs (ENST00000269305). Transcripts silently drop during the mapping step.
Symptom: Lower-than-expected gene count; warning from tximport about missing transcript IDs.
Fix: Match versioning convention: rebuild tx2gene with the same versioning as the index. GenomicFeatures::makeTxDbFromGFF on the same GTF as the index is the safest path.
Trigger: Mouse-to-human mapping returns 1.3 mouse genes per human gene on average; user takes the first row of each duplicate.
Mechanism: Many2many orthology is genuinely ambiguous; "first row" is unprincipled and irreproducible across biomaRt API versions.
Symptom: Different mappings on rerun; conflicting downstream gene sets.
Fix: Either filter to mmusculus_homolog_orthology_type == 'ortholog_one2one' (conservative) or aggregate via highest homology confidence score (mmusculus_homolog_perc_id_r1).
| Error / symptom | Cause | Fix |
|-----------------|-------|-----|
| filters returns empty | Singular filter= partial-matched against another argument | Spell filters= fully |
| 1-Mar in gene column | Excel autocorrected MARCH1 | Re-import with explicit string type; map back to MARCHF1 |
| pyensembl ValueError: gene not found | ID not in pinned release; or unversioned ID against versioned database | Strip version before lookup; verify release |
| Duplicate rownames after aggregate | Collapsed multiple source IDs to one target; OR _PAR_Y stripped | Sum-collapse expected; for PAR_Y use preserving regex |
| biomaRt timeout for >5k IDs | Query too large | Chunk into batches of 1000 |
| Wrong species mapping | Default species='human' in mygene; mouse query returns nothing | Pass species='mouse' explicitly |
| ENSEMBL keytype not available | Older org.db package or non-human/mouse | keytypes(orgdb) to verify |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.