workflows/expression-to-pathways/SKILL.md
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.
npx skillsauth add GPTomics/bioSkills bio-workflows-expression-to-pathwaysInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: clusterProfiler 4.10+, org.Hs.eg.db 3.18+, ReactomePA 1.46+, enrichplot 1.22+.
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('<pkg>') then ?function_name to verify parametersIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Find enriched pathways from my differential expression results" -> Decide the generation (ORA vs GSEA) FIRST, then convert IDs to the form each method needs, run enrichment against the chosen database, and collapse redundancy before interpreting - because the enrichment result is a claim conditioned on the method, the background universe, and the database version, not a discovery the algorithm hands back.
enrichGO(...) / gseGO(...) / enrichKEGG(...) / enrichPathway(...) (clusterProfiler, ReactomePA)Scope: the ORCHESTRATION of a DE-to-enrichment pipeline - the generation fork, per-method ID conversion, the universe decision, the live-vs-local database caveat, and the handoff to redundancy-collapsed visualization. This workflow does NOT re-teach each method. The null/universe/reproducibility theory and the master method-selection tree -> pathway-analysis/enrichment-foundations. ORA mechanics -> go-enrichment; GSEA mechanics -> gsea; per-database IDs and live-DB behavior -> kegg-pathways, reactome-pathways, wikipathways; the DE list and ranking statistic -> differential-expression/de-results; plotting -> enrichment-visualization.
Pathway analysis has three generations (Khatri 2012 PLoS Comput Biol 8:e1002375): over-representation analysis (ORA), functional class scoring / GSEA (FCS), and pathway topology. A workflow that "runs enrichment" without first deciding which generation applies has already made the choice silently - usually ORA, the worst-calibrated corner of the space. The fork is mechanical:
stat, or -sign(log2FC)*log10(p) for every tested gene -> GSEA (a NAMED vector sorted in DECREASING order; the ranking metric IS the experiment). No arbitrary cutoff; detects coordinated weak shifts that ORA misses.The dangerous default is running ORA on data that has a full ranking (binarizing away the signal) or running ORA against the genome (measuring expression, not enrichment). Decide the fork out loud, record it, and route the why to pathway-analysis/enrichment-foundations - this workflow owns the routing, not the derivation.
DE results (differential-expression/de-results)
|
v
[0. Decide the generation: ranking for all genes? -> GSEA | pre-selected list? -> ORA] (enrichment-foundations)
|
+--> ORA branch: define the TESTABLE-gene universe, convert IDs per method
| +--> enrichGO (OrgDb keyType) -> go-enrichment
| +--> enrichKEGG ('kegg' / 'ncbi-geneid', LIVE DB) -> kegg-pathways
| +--> enrichPathway (ENTREZ, local DB) -> reactome-pathways
| +--> enrichWP (ENTREZ, LIVE GMT) -> wikipathways
|
+--> GSEA branch: build a NAMED decreasing vector of ALL genes, set.seed
| +--> gseGO / gseKEGG / GSEA(+msigdbr) -> gsea
|
v
[Redundancy collapse + visualization: simplify, pairwise_termsim, dotplot/emapplot/gseaplot2] (enrichment-visualization)
|
v
A claim conditioned on universe + method + database version (record provenance)
| Stage | Goal | Owns the nuance | |-------|------|-----------------| | 0. Decide generation | ORA vs GSEA from the available input | pathway-analysis/enrichment-foundations | | 1. Prepare input | Build the gene list AND/OR the named ranked vector; define the universe | differential-expression/de-results (the stat); foundations (the universe rule) | | 2. Convert IDs | Map to the form each method needs (OrgDb keyType / kegg-id / ENTREZ) | go-enrichment, kegg-pathways, reactome-pathways | | 3a. ORA | Hypergeometric test of the list vs background | go-enrichment, kegg-pathways, reactome-pathways, wikipathways | | 3b. GSEA | Running-sum over the full ranking | gsea | | 4. Collapse + visualize | Reduce redundancy, then plot | enrichment-visualization |
| Scenario | Route | Why | |----------|-------|-----| | All genes carry a DE statistic, a cutoff would be arbitrary | GSEA (gseGO/gseKEGG) -> gsea | uses the full ranking; no cutoff; named decreasing vector | | Pre-selected list (module, GWAS loci, screen hits), no ranking | ORA (enrichGO/enrichKEGG) -> go-enrichment | no ranking available; define the universe | | Broad function annotation | enrichGO / gseGO -> go-enrichment, gsea | GO is the broadest LOCAL resource (reproducible) | | Metabolic / signaling pathways | enrichKEGG / gseKEGG -> kegg-pathways | KEGG maps query a LIVE DB (pin the date) | | Reaction-level, peer-reviewed, reproducible offline | enrichPathway -> reactome-pathways | local reactome.db, version-pinned | | Disease/drug sets the others miss, broad species | enrichWP -> wikipathways | community-curated; LIVE versioned GMT | | Bacterial / prokaryotic data | enrichKEGG with locus tags + KEGG organism code -> kegg-pathways | KEGG covers prokaryotes; OrgDb usually does not | | RNA-seq with strong gene-length bias | GOseq -> go-enrichment | length-aware ORA null | | Multiple conditions/clusters side by side | compareCluster -> any DB | one model, faceted dotplot; never compare p across separate runs | | The DE list / ranking statistic itself | -> differential-expression/de-results | that is upstream, not enrichment | | Why this null, which universe, version reporting | -> pathway-analysis/enrichment-foundations | the conceptual spine |
Goal: Turn a DE table into the two possible inputs - a gene LIST for ORA and a NAMED decreasing vector for GSEA - and define the background universe as the testable genes.
Approach: Read the DE result, derive the significant list, build the ranked vector from a signed statistic (not a bare log2FC), and set the universe to exactly the genes that entered the DE test. The DE mechanics (the $padj vs $adj.P.Val column, shrinkage) live at differential-expression/de-results - this is only input shaping.
library(clusterProfiler)
library(org.Hs.eg.db)
res <- read.csv('deseq2_results.csv', row.names = 1)
# ORA input: a pre-selected list (DESeq2 padj column; limma/edgeR name it differently)
sig_genes <- rownames(subset(res, padj < 0.05 & abs(log2FoldChange) > 1))
# Background universe = genes that were TESTABLE (entered the DE test), NOT the genome.
# Using the genome measures expression bias, not enrichment (foundations: the universe rule).
universe_genes <- rownames(res[!is.na(res$pvalue), ])
# GSEA input: a NAMED vector of ALL genes, sorted DECREASING by a signed metric.
# Prefer the Wald stat (magnitude + precision); a bare log2FC over-weights noisy low-count genes.
ranked <- res$stat
names(ranked) <- rownames(res)
ranked <- sort(ranked[!is.na(ranked)], decreasing = TRUE)
Goal: Map identifiers to the exact ID type each enrichment function expects, because a mismatch returns zero hits silently.
Approach: Use bitr (OrgDb) for SYMBOL/ENSEMBL -> ENTREZ, keep both list and ranked vector in the same ID space, deduplicate, and track the conversion rate. Per-method ID rules are owned by each DB skill; the table below is the routing summary.
# enrichGO accepts ENSEMBL/SYMBOL/ENTREZ via keyType=; ENTREZ is the safe lingua franca downstream
sig_entrez <- bitr(sig_genes, fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)
bg_entrez <- bitr(universe_genes, fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)
# Carry the ranking through conversion: name the kept stat by its ENTREZ id
ranked_map <- bitr(names(ranked), fromType = 'SYMBOL', toType = 'ENTREZID', OrgDb = org.Hs.eg.db)
ranked_list <- ranked[ranked_map$SYMBOL]
names(ranked_list) <- ranked_map$ENTREZID
ranked_list <- ranked_list[!duplicated(names(ranked_list))] # dedup or GSEA biases the score
conv_rate <- nrow(sig_entrez) / length(sig_genes) # report it; <0.85 -> wrong ID type/organism
| Method | keyType / ID required | Convert with | |--------|-----------------------|--------------| | enrichGO / gseGO | OrgDb keyType ('ENSEMBL', 'SYMBOL', 'ENTREZID') | bitr | | enrichKEGG / gseKEGG | 'kegg' or 'ncbi-geneid' (NOT ENSEMBL/OrgDb) | bitr to ENTREZID, pass keyType='ncbi-geneid' (bitr_kegg only converts among KEGG ID flavors) | | enrichPathway / gsePathway (ReactomePA) | ENTREZ | bitr | | enrichWP / gseWP (WikiPathways) | ENTREZ + organism string | bitr |
Goal: Test each gene set for over-representation of the list against the testable-gene background.
Approach: Always pass universe=; run GO ontologies separately; KEGG/Reactome/WikiPathways each need their own ID form. KEGG and WikiPathways query a LIVE database (internet-dependent, not reproducible across releases - pin the run date); GO and Reactome read local annotation (reproducible given the Bioconductor release).
# GO ORA - universe is the decision; simplify() collapses DAG redundancy (BP/MF/CC separately, not 'ALL')
go_bp <- enrichGO(sig_entrez$ENTREZID, universe = bg_entrez$ENTREZID, OrgDb = org.Hs.eg.db,
ont = 'BP', pAdjustMethod = 'BH', pvalueCutoff = 0.05, readable = TRUE)
go_bp <- simplify(go_bp, cutoff = 0.7, by = 'p.adjust')
# KEGG ORA - LIVE KEGG REST API; needs internet; record the access date for reproducibility
kegg <- enrichKEGG(sig_entrez$ENTREZID, organism = 'hsa', keyType = 'ncbi-geneid', pvalueCutoff = 0.05) # Entrez input; 'kegg' default is the prokaryote locus-tag path
kegg <- setReadable(kegg, OrgDb = org.Hs.eg.db, keyType = 'ENTREZID')
# Reactome ORA - ENTREZ required; LOCAL reactome.db so reproducible given the release
library(ReactomePA)
reactome <- enrichPathway(sig_entrez$ENTREZID, organism = 'human', pvalueCutoff = 0.05, readable = TRUE)
Goal: Find gene sets whose genes shift coordinately across the full ranking, without a significance cutoff.
Approach: Run on the named decreasing ranked_list, fix the permutation seed so p-values are reproducible, then read the leading edge as the interpretable core. clusterProfiler GSEA is preranked / gene-permutation (the inter-gene-correlation-UNcorrected null) - a discovery screen; see pathway-analysis/gsea and enrichment-foundations for the calibration caveat (CAMERA/ROAST).
set.seed(123) # permutation reproducibility; without it p-values drift across runs
gsea_go <- gseGO(ranked_list, OrgDb = org.Hs.eg.db, ont = 'BP',
minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05, verbose = FALSE)
gsea_kegg <- gseKEGG(ranked_list, organism = 'hsa',
minGSSize = 10, maxGSSize = 500, pvalueCutoff = 0.05, verbose = FALSE)
Goal: Reduce overlapping terms to distinct findings before drawing conclusions, then plot deliberately.
Approach: A list of 40 significant GO terms is often a few biological stories told many times (shared genes via the GO true-path rule). Collapse with simplify/pairwise_termsim, then plot - emapplot/cnetplot require pairwise_termsim() first, and gseaplot2 is for a gseaResult not an enrichResult. Encoding choice and required pre-steps are owned by pathway-analysis/enrichment-visualization.
library(enrichplot)
go_bp <- pairwise_termsim(go_bp) # required before emapplot/treeplot
dotplot(go_bp, showCategory = 20) # GeneRatio vs Count: pick the encoding deliberately
emapplot(go_bp, showCategory = 30) # redundancy-collapsed term-similarity map
gseaplot2(gsea_go, geneSetID = 1:3) # gseaResult only, not enrichResult
Goal: Compare enrichment across conditions in one model instead of comparing p-values from separate runs.
Approach: compareCluster fits all gene lists together and facets the dotplot; never compare raw -log10(p) across separate enrichments (it scales with set size and sample size). For GSEA, compare NES, not p.
gene_clusters <- list(A = sig_A, B = sig_B, C = sig_C)
cc <- compareCluster(gene_clusters, fun = 'enrichKEGG', organism = 'hsa')
dotplot(cc, showCategory = 10)
Trigger: universe= left at default while only ~12k genes were expressed. Mechanism: the hypergeometric p-value is fully determined by the denominator; the genome inflates any set whose members are expressed in the tissue. Symptom: many tissue-specific terms enrich with tiny p. Fix: set universe to the genes that entered the DE test (the testable set).
Trigger: filtering all-gene DE results to a list and running ORA. Mechanism: binarizing at an arbitrary cutoff discards magnitude and the coordinated-weak signal. Symptom: GSEA finds sets ORA missed. Fix: if a ranking exists for all genes, run GSEA; reserve ORA for genuinely unranked lists.
Trigger: ENSEMBL/SYMBOL passed to enrichKEGG/enrichPathway/enrichWP. Mechanism: those expect kegg-id/ENTREZ; unmatched IDs are dropped. Symptom: zero terms, no error. Fix: bitr/bitr_kegg to the required ID; check conv_rate.
Trigger: no seed, or a list that is not named and decreasing. Mechanism: permutation p-values drift run to run; an unsorted/unnamed vector errors or mis-ranks. Symptom: different leading edges each run, or a names error. Fix: build the named decreasing vector and set.seed.
Trigger: KEGG/WikiPathways result with no recorded date. Mechanism: those query the current data release; the same code returns different pathways later. Symptom: a collaborator cannot reproduce the figure. Fix: record the access date and data version; prefer local GO/Reactome when reproducibility is paramount.
Trigger: interpreting 40 overlapping GO terms as 40 findings. Mechanism: the true-path rule and pathway overlap mean shared genes drive many sets. Symptom: the same 3-5 genes explain the top 20 terms. Fix: simplify/pairwise_termsim, inspect the leading-edge/geneID core, report clusters of terms.
| Threshold | Source | Rationale |
|-----------|--------|-----------|
| pvalueCutoff = 0.05 | clusterProfiler default | filters on p.adjust by default in enrichResult; standard FDR gate |
| qvalueCutoff = 0.2 | clusterProfiler default | secondary q-value gate |
| pAdjustMethod = 'BH' | Benjamini-Hochberg | valid FDR control under positive dependence (overlapping sets); Bonferroni over-corrects |
| minGSSize = 10 | enrichGO/gseGO default | drop tiny sets that overfit |
| maxGSSize = 500 | enrichGO/gseGO default | drop overly broad sets that always "enrich" |
| simplify(cutoff = 0.7) | GOSemSim semantic similarity | GO DAG redundancy cutoff; lower keeps more terms |
| conversion rate > 0.85 | practical QC | <85% ID conversion flags a wrong ID type/organism |
| set.seed(123) | reproducibility | any fixed seed; the point is to fix the permutation draw |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| enrichKEGG returns 0 terms | ENSEMBL passed (needs kegg-id/ENTREZ), wrong organism code, or KEGG API down | convert with bitr_kegg; check organism; retry (live DB) |
| --> No gene can be mapped | wrong keyType/OrgDb for the input IDs | match keyType to the actual ID type |
| gseGO error about names | vector not named or not sorted decreasing | build a named vector sorted decreasing = TRUE |
| cnetplot/emapplot empty or errors | pairwise_termsim() not run first | run pairwise_termsim() before the plot |
| simplify fails on ont='ALL' | simplify needs one ontology | run BP/MF/CC separately, then simplify each |
| different results each run | no set.seed, or the live KEGG/WP DB changed | set.seed; pin and record the DB version/date |
| all terms have NA Description | readable/setReadable not applied | set readable = TRUE or call setReadable |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
End-to-end pooled and single-cell CRISPR screen analysis from FASTQ to hit genes. Orchestrates library design QC, guide counting, six-stage screen QC (plasmid Gini, replicate Pearson, CEGv2 PR-AUC, copy-number artifact), method-appropriate hit calling across MAGeCK RRA/MLE, BAGEL2, drugZ, JACKS, and Chronos, cancer-cell-line copy-number correction (CRISPRcleanR / Chronos), batch correction for multi-batch screens, and the specialized branches for combinatorial paralog screens, single-cell Perturb-seq, base-editor variant-function screens, prime-editor screens, and in vivo bottleneck-aware screens. Use when analyzing any pooled CRISPR screen end-to-end, choosing the correct hit-calling method by experimental design, integrating copy-number correction into the pipeline, or branching the workflow for single-cell, combinatorial, base-editor, prime-editor, or in vivo variants.