pathway-analysis/go-enrichment/SKILL.md
Gene Ontology over-representation analysis using clusterProfiler enrichGO. Use when identifying biological functions enriched in a gene list from differential expression or other analyses. Supports all three ontologies (BP, MF, CC), multiple ID types, and customizable statistical thresholds.
npx skillsauth add GPTomics/bioSkills bio-pathway-go-enrichmentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: R stats (base), clusterProfiler 4.10+
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('<pkg>') then ?function_name to verify parametersIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
| Scenario | Method | Why | |----------|--------|-----| | Clear DE gene list with arbitrary cutoff (padj + FC) | ORA, but consider GSEA instead | ORA discards magnitude; GSEA uses all genes ranked by statistic | | Genes from co-expression module, GWAS loci, screen hits | ORA | No ranking available; ORA is appropriate | | All genes with DE statistics available | GSEA (gseGO) | Avoids arbitrary cutoff; detects subtle coordinated changes | | Very few DE genes (< 20) | GSEA | ORA has no power with small lists | | RNA-seq with known length bias | GOseq (goseq package) | Standard ORA ignores length bias; longer genes are more likely DE |
ORA converts continuous measures into binary (significant/not), losing information. When in doubt, run both ORA and GSEA and compare.
Goal: Identify enriched Gene Ontology terms in a gene list from differential expression or similar analyses.
Approach: Test for over-representation of GO terms using the hypergeometric test via clusterProfiler enrichGO.
"Run GO enrichment on my gene list" -> Test whether biological process, molecular function, or cellular component terms are over-represented among significant genes.
library(clusterProfiler)
library(org.Hs.eg.db) # Human - change for other organisms
ego <- enrichGO(
gene = gene_list, # Character vector of gene IDs
OrgDb = org.Hs.eg.db, # Organism annotation database
keyType = 'ENTREZID', # ID type: ENSEMBL, SYMBOL, ENTREZID, etc.
ont = 'BP', # BP, MF, CC, or ALL
pAdjustMethod = 'BH', # p-value adjustment method
pvalueCutoff = 0.05,
qvalueCutoff = 0.2
)
Goal: Extract significant gene IDs from differential expression results and convert to the format required by enrichGO.
Approach: Filter DE results by adjusted p-value and fold change, then convert gene symbols to Entrez IDs using bitr.
library(dplyr)
de_results <- read.csv('de_results.csv')
sig_genes <- de_results %>%
filter(padj < 0.05, abs(log2FoldChange) > 1) %>%
pull(gene_id)
# If using gene symbols, convert to Entrez IDs
gene_ids <- bitr(sig_genes, fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)
gene_list <- gene_ids$ENTREZID
Goal: Convert between gene identifier types (Ensembl, Symbol, Entrez) for compatibility with enrichment tools.
Approach: Use clusterProfiler bitr to map between ID types using organism annotation databases.
# Check available key types
keytypes(org.Hs.eg.db)
# Convert between ID types
converted <- bitr(genes, fromType = 'ENSEMBL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)
# Multiple output types
converted <- bitr(genes, fromType = 'SYMBOL', toType = c('ENTREZID', 'ENSEMBL'), OrgDb = org.Hs.eg.db)
Goal: Improve enrichment specificity by restricting the background to genes actually tested in the experiment.
Approach: Pass all expressed genes (not just significant ones) as the universe parameter to enrichGO.
The background must be genes that could have appeared in the list. Getting this wrong is the single most common ORA error (95% of published analyses fail to specify an appropriate background). Using the whole genome (~20,000 genes) when only 12,000 were expressed inflates significance for tissue-specific pathways.
| Experiment Type | Correct Background | |----------------|-------------------| | RNA-seq | All genes with detectable expression (e.g., > 1 CPM in >= N samples) | | Microarray | All probes on the array (mapped to genes) | | Proteomics | All detected proteins | | Targeted panel | Only genes on the panel |
# Background = all genes that were tested (NOT the full genome)
# For DESeq2: genes with non-NA pvalue survived independent filtering
all_tested <- de_results$gene_id[!is.na(de_results$pvalue)]
universe_ids <- bitr(all_tested, fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)
ego <- enrichGO(
gene = gene_list,
universe = universe_ids$ENTREZID,
OrgDb = org.Hs.eg.db,
keyType = 'ENTREZID',
ont = 'BP',
pAdjustMethod = 'BH',
pvalueCutoff = 0.05
)
Warning: clusterProfiler silently drops unannotated genes from the background. To prevent this: options(enrichment_force_universe = TRUE) before running enrichGO.
# Run all ontologies at once
ego_all <- enrichGO(
gene = gene_list,
OrgDb = org.Hs.eg.db,
keyType = 'ENTREZID',
ont = 'ALL', # BP, MF, and CC combined
pAdjustMethod = 'BH',
pvalueCutoff = 0.05
)
# Results include ONTOLOGY column
head(as.data.frame(ego_all))
# Convert Entrez IDs to gene symbols in results
ego_readable <- setReadable(ego, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID')
# Or use readable = TRUE directly (only works with ENTREZID input)
ego <- enrichGO(
gene = gene_list,
OrgDb = org.Hs.eg.db,
keyType = 'ENTREZID',
ont = 'BP',
readable = TRUE # Converts to symbols
)
# View top results
head(ego)
# Convert to data frame
results_df <- as.data.frame(ego)
# Key columns: ID, Description, GeneRatio, BgRatio, pvalue, p.adjust, qvalue, geneID, Count
# Export to CSV
write.csv(results_df, 'go_enrichment_results.csv', row.names = FALSE)
# Filter for specific criteria
sig_terms <- results_df[results_df$p.adjust < 0.01 & results_df$Count >= 5, ]
Goal: Remove highly similar GO terms to reduce redundancy in enrichment results.
Approach: Cluster GO terms by semantic similarity and retain representative terms using the simplify function.
GO terms form a DAG (directed acyclic graph), not a flat list. If "mitotic cell cycle" is enriched, parent terms ("cell cycle", "cell cycle process") will also be enriched because they contain supersets of the same genes. Always simplify before interpretation.
# Remove redundant GO terms (keeps representative terms)
ego_simplified <- simplify(ego, cutoff = 0.7, by = 'p.adjust', select_fun = min)
# measure options: 'Wang' (default, graph-based, stable across releases),
# 'Resnik', 'Lin', 'Jiang', 'Rel' (IC-based, depend on annotation version)
ego_simplified <- simplify(ego, cutoff = 0.7, measure = 'Wang')
Limitations: simplify() does NOT work with ont='ALL' -- run BP, MF, CC separately. Cutoff 0.7 is a reasonable default; lower retains more terms, higher is more aggressive.
# Mouse
library(org.Mm.eg.db)
ego_mouse <- enrichGO(gene = genes, OrgDb = org.Mm.eg.db, ont = 'BP')
# Zebrafish
library(org.Dr.eg.db)
ego_zfish <- enrichGO(gene = genes, OrgDb = org.Dr.eg.db, ont = 'BP')
# Yeast
library(org.Sc.sgd.db)
ego_yeast <- enrichGO(gene = genes, OrgDb = org.Sc.sgd.db, ont = 'BP', keyType = 'ORF')
Goal: Classify genes by broad GO slim categories for a high-level functional overview.
Approach: Use groupGO to assign genes to GO terms at a specific hierarchy level.
# Classify genes by GO slim categories
ggo <- groupGO(
gene = gene_list,
OrgDb = org.Hs.eg.db,
ont = 'BP',
level = 3, # GO hierarchy level
readable = TRUE
)
| Parameter | Default | Description | |-----------|---------|-------------| | gene | required | Vector of gene IDs | | OrgDb | required | Organism database | | keyType | ENTREZID | Input ID type | | ont | BP | BP, MF, CC, or ALL | | pvalueCutoff | 0.05 | P-value threshold | | qvalueCutoff | 0.2 | Q-value (FDR) threshold | | pAdjustMethod | BH | BH, bonferroni, etc. | | universe | NULL | Background genes | | minGSSize | 10 | Min genes per term | | maxGSSize | 500 | Max genes per term | | readable | FALSE | Convert to symbols |
Always examine effect size alongside p-values. A pathway with 500 genes can achieve p < 1e-15 with a modest 1.2x fold enrichment, while a 10-gene pathway with 4x enrichment at p = 0.01 is biologically more interesting.
minGSSize=10, maxGSSize=500 filters out uninformative extremes.bitr() to avoid counting genes multiple times.In RNA-seq, longer transcripts produce more fragments, increasing statistical power to detect DE. This systematically biases ORA toward pathways enriched in long genes (extracellular matrix, cell adhesion) and against short-gene pathways (ribosomal, mitochondrial). Standard normalization (RPKM, TMM) does NOT fix this.
For length-corrected GO enrichment, use GOseq:
library(goseq)
pwf <- nullp(de_vector, 'hg38', 'ensGene', bias.data = gene_lengths)
goseq_results <- goseq(pwf, 'hg38', 'ensGene', method = 'Wallenius')
testing
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.