pathway-analysis/enrichment-foundations/SKILL.md
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
npx skillsauth add GPTomics/bioSkills bio-pathway-enrichment-foundationsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: clusterProfiler 4.18+, org.Hs.eg.db 3.22+.
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('<pkg>') then ?function_name to verify parametersIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
This skill is the conceptual spine for the category; it routes the per-tool calls to the sibling skills rather than running them, so the binding versions are the clusterProfiler stack the siblings share. Note that KEGG and WikiPathways query LIVE databases (internet-dependent, not reproducible across data releases) while GO and Reactome use local version-pinned annotation - record the database release and access date for any result.
"Which pathways are enriched here?" -> Pick the enrichment GENERATION from the input shape, make the null explicit, and run the trustworthiness checklist - because an enrichment result is not a discovery, it is a claim about a gene list versus a specific database version under a null that was probably not chosen on purpose.
enrichGO/enrichKEGG) vs GSEA (gseGO/GSEA) by whether a ranking exists, then defend the background, FDR, and leading edgeScope: the meta-decision - which generation (ORA/FCS/topology) is valid, which null each tool computes, which database fits the question, and whether the result survives the checklist. ORA mechanics -> go-enrichment. GSEA mechanics, ranking metric, CAMERA/ROAST -> gsea. Per-database content, IDs, topology -> kegg-pathways, reactome-pathways, wikipathways. Redundancy collapse and plots -> enrichment-visualization. The gene list / ranking statistic itself -> differential-expression/de-results.
The same gene list, run against a different database version, with a different background universe, under a competitive instead of a self-contained null, with gene-permutation instead of phenotype-permutation, gives a different list of "enriched pathways" - and all of them can be technically valid p-values, because they answer DIFFERENT questions (Goeman & Buhlmann 2007 Bioinformatics 23:980). The decision-grade posture treats the deliverable as a defended claim, not a printout, and three facts drive every choice:
The classification every method falls into, set by the available input (Khatri 2012 PLoS Comput Biol 8:e1002375).
| Generation | Input required | Question / mechanism | Owns / route to | |------------|----------------|----------------------|-----------------| | 1st - Over-Representation (ORA) | a gene LIST + a background universe | is set S over-represented among the hits more than chance? hypergeometric / Fisher on a 2x2 table | enrichGO/enrichKEGG/enrichPathway/enrichWP -> go-enrichment, kegg-pathways, reactome-pathways, wikipathways | | 2nd - Functional Class Scoring (FCS) | a per-gene STATISTIC for (nearly) all genes, no cutoff | are set S genes systematically shifted along the ranking? running-sum ES + permutation | gseGO/gseKEGG/GSEA/fgsea -> gsea | | 3rd - Pathway Topology (PT) | ranked list + a curated signed/directed graph | is the pathway IMPACTED given where in the cascade the change lands? perturbation propagation | SPIA/graphite (KEGG/Reactome, model organisms) -> kegg-pathways |
ORA throws away magnitude (binarizes at a cutoff), assumes gene independence, and is acutely sensitive to the background. FCS fixes the first two: it uses the full ranking and detects coordinated WEAK signals ORA misses. PT is the most biologically faithful and the least general (needs a curated directed graph). The agent's first move is to name the generation from the input, never from preference.
| Source / method | Citation | Mechanism / role | When | |-----------------|----------|------------------|------| | GO ORA (enrichGO) | Ashburner 2000 Nat Genet 25:25; Wu 2021 The Innovation 2:100141 | hypergeometric vs the GO DAG; local OrgDb | function annotation of a gene LIST; broadest coverage | | Preranked GSEA (gseGO/GSEA) | Subramanian 2005 PNAS 102:15545 | running-sum ES over a ranked vector; gene permutation | all genes carry a statistic; no cutoff; subtle coordinated shifts | | fgsea engine | (engine behind gseGO/gseKEGG) | fast preranked permutation | the default engine; preranked = gene-sampling | | CAMERA | Wu & Smyth 2012 Nucleic Acids Res 40:e133 | competitive test that estimates and corrects inter-gene correlation (VIF) | a CALIBRATED competitive test when gene-sampling anti-conservatism matters -> gsea | | ROAST / fry | (limma) | self-contained rotation test; reports directional vs mixed | "is this set involved at all" on the real matrix -> gsea | | GSVA / ssGSEA | (transform, not a test) | per-sample pathway activity score | features for clustering/modeling; never a p-value -> gsea | | GOseq | Young 2010 Genome Biol 11:R14 | length-bias-corrected ORA (Wallenius null) | RNA-seq with gene-length selection bias -> go-enrichment | | SPIA / graphite | Tarca 2009 Bioinformatics 25:75 | over-representation + perturbation on the signed KEGG graph | signaling directionality / causality -> kegg-pathways | | MSigDB (H/C2/C5) | Liberzon 2015 Cell Syst 1:417; Subramanian 2005 PNAS 102:15545 | curated gene-set collections fed via TERM2GENE | Hallmark H is the low-redundancy default; C2:CP/C5 double-count KEGG/GO -> gsea |
| Scenario | Recommended | Why | |----------|-------------|-----| | A ranked statistic (t, signed -log10 p, shrunken log2FC) for nearly all genes | GSEA (gseGO/gseKEGG/GSEA) -> gsea | uses the full ranking; no arbitrary cutoff; catches coordinated weak signals | | A pre-selected list (co-expression module, GWAS loci, screen hits) + a defensible background | ORA (enrichGO/enrichKEGG) -> go-enrichment, kegg-pathways | no ranking exists; binarized membership is all that is available | | Broad function annotation | enrichGO / gseGO -> go-enrichment, gsea | GO is the broadest local resource | | Metabolic / signaling pathway maps | enrichKEGG / gseKEGG -> kegg-pathways | KEGG pathway maps (live DB) | | Reaction-level, peer-reviewed, reproducible offline | enrichPathway -> reactome-pathways | local version-pinned reactome.db | | Community-curated, broad species coverage | enrichWP -> wikipathways | versioned GMT; variable curation depth | | A signed signaling topology + fold changes, human/mouse | SPIA / graphite -> kegg-pathways | directionality and cascade position, not just membership | | RNA-seq with strong gene-length bias | GOseq -> go-enrichment | length-aware null | | Per-sample pathway activity SCORE (a model feature, not a test) | ssGSEA / GSVA -> gsea | a transform; its scores are not enrichment p-values | | A calibrated competitive test (correlation-aware) on the expression matrix | CAMERA -> gsea | corrects the gene-sampling anti-conservatism | | The DE list / ranking statistic itself | -> differential-expression/de-results | that is upstream, not enrichment | | FDR / p.adjust method theory | -> experimental-design/multiple-testing | the rationale for BH vs Bonferroni lives there |
Default when uncertain: if a ranking exists for almost all genes, run GSEA; otherwise run ORA with the testable-gene universe, FDR-correct, collapse redundancy before interpreting, and inspect the leading-edge genes.
The conceptual core a tutorial skips. Two orthogonal distinctions (Goeman & Buhlmann 2007 Bioinformatics 23:980) classify essentially every method and expose why the common ones are mis-calibrated.
What is being tested (competitive vs self-contained). A SELF-CONTAINED null is "no gene in set S is associated with the phenotype" - S against zero effect, genes outside S irrelevant; rejecting it means "something in S is differential." A COMPETITIVE null is "the genes in S are no more associated than the genes not in S" - S against its complement; rejecting it means "S is MORE enriched than background." These are different scientific questions: a competitive test can be non-significant simply because everything is changing, and a self-contained test can be significant for a set no more interesting than average. State which one the result answers.
Where the randomness comes from (gene-sampling vs subject-sampling). SUBJECT sampling permutes the phenotype labels across samples; each permuted dataset preserves the gene-gene covariance, so this null respects the real correlation structure but needs a sample-level design with adequate replication. GENE sampling treats genes as exchangeable (draw random same-size sets, or permute gene labels); it is the only option with a bare ranked list and no sample data, and it assumes genes are INDEPENDENT.
The load-bearing consequence. A competitive test calibrated by gene sampling has the WRONG variance: the variance of the mean of n positively-correlated scores is (sigma^2/n)[1 + (n-1)*rho-bar], and gene-sampling implicitly sets rho-bar = 0, so it underestimates the variance and the p-values are anti-conservative (too many false positives) exactly on the correlated pathway sets of interest. Every clusterProfiler GSEA (gseGO/gseKEGG/GSEA) and fgsea is preranked = gene-permutation = this uncorrected competitive null. clusterProfiler does NOT expose phenotype permutation, so in this ecosystem the calibrated subject-sampling GSEA is not available - which is acceptable for a discovery screen but must be stated. CAMERA (Wu & Smyth 2012 Nucleic Acids Res 40:e133) estimates the inter-gene correlation and corrects the statistic; ROAST/fry give a self-contained subject-sampling test on the real matrix. Phenotype permutation preserves correlation but needs adequate n.
Run before reporting any enrichment, in this order. Each item maps to a documented, published failure (Wijesooriya 2022; Reimand 2019 Nat Protoc 14:482).
universe= at clusterProfiler's default measures expression bias, not enrichment.p.adjust/qvalue, never nominal pvalue. pAdjustMethod='BH' is the clusterProfiler default and is valid under positive dependence.simplify) or EnrichmentMap BEFORE interpreting, and report clusters of terms, not the raw count -> enrichment-visualization.core_enrichment, ORA geneID). If the same 3-5 multifunctional hub genes explain the top 20 sets, that is ONE finding wearing twenty pathway costumes, not twenty (Gillis & Pavlidis 2011 PLoS One 6:e17258).pAdjustMethod, and the universe. Without this the result is unreproducible.GO and MSigDB are the cross-cutting collections everything else samples; KEGG/Reactome/WikiPathways are pathway sources owned by their own skills. Route the per-database mechanics OUT; do not re-teach them here.
msigdbr and feed a TERM2GENE frame to GSEA. -> gseaNever treat "GO enrichment" and "KEGG enrichment" as the same kind of object: GO has thousands of redundant terms (a heavy multiple-testing burden), pathway DBs have hundreds of mechanistic sets.
Trigger: running enrichGO/enrichKEGG with universe= at the default while only ~12k genes were expressed/testable. Mechanism: the hypergeometric p-value is fully set by the denominator; an over-large universe inflates over-representation of any set whose members happen to be expressed in the tissue. Symptom: a long list of tissue-specific terms with tiny p-values that runs without warning. Fix: set universe to the genes that passed the same filter as the DE list.
Trigger: reporting a gseGO/fgsea p.adjust as evidence the set is specially involved. Mechanism: preranked GSEA is competitive + gene-sampling, the inter-gene-correlation-uncorrected null with the wrong variance. Symptom: an inflated false-positive rate under permuted phenotypes (Geistlinger 2021 Brief Bioinform 22:545). Fix: treat it as a discovery screen; run CAMERA for a calibrated competitive test, or ROAST/fry for a self-contained one.
Trigger: reading 40 overlapping significant terms as 40 independent findings. Mechanism: GO's true-path rule and pathway overlap mean the same genes drive many sets; BH is valid but controls the FDR of tests, not the redundancy of findings. Symptom: the top terms are obvious parent/child variants of one theme. Fix: collapse with simplify/EnrichmentMap and report term clusters -> enrichment-visualization.
Trigger: a few highly-expressed hub genes (easily called DE) belong to dozens of sets. Mechanism: multifunctionality alone predicts apparent enrichment (Gillis & Pavlidis 2011). Symptom: the same 3-5 genes appear in the core_enrichment/geneID of most top sets. Fix: inspect the leading edge; if a handful of genes explain the top sets, report one finding, not many.
Trigger: "cancer / immune / apoptosis pathways enriched" across unrelated experiments. Mechanism: genes are annotated in proportion to how studied they are, not how important (Haynes 2018 Sci Rep 8:1362), so well-annotated sets light up everywhere. Symptom: the same canonical pathways enrich regardless of the biology. Fix: treat densely-annotated-set enrichment as the field's null result; weight specific, less-studied hits and confirm with the leading edge.
Trigger: no record of the GO/KEGG/Reactome release or access date. Mechanism: annotations evolve between releases and the interpretation drifts (Tomczak 2018). Symptom: the same code returns different pathways next year and the paper cannot be reproduced. Fix: record package + database release + access date in a provenance block.
| Threshold | Source | Rationale | |-----------|--------|-----------| | Universe = the testable genes (same filter as the DE list), not the genome | Wijesooriya 2022 PLoS Comput Biol 18:e1009935 | the ORA p-value is fully determined by the denominator; the genome inflates it | | pvalueCutoff=0.05 (filters on p.adjust by default) | clusterProfiler default | standard FDR gate; the enrichResult cutoff applies to the adjusted p, not nominal | | qvalueCutoff=0.2 | clusterProfiler default | secondary q-value gate on top of p.adjust | | pAdjustMethod='BH' | clusterProfiler default; Benjamini-Hochberg | valid under positive dependence (overlapping sets); Bonferroni is over-conservative here | | minGSSize=10, maxGSSize=500 | enrichGO/gseGO defaults | drop tiny sets that overfit and huge sets that always enrich | | GSEA permutation seed fixed (e.g. set.seed(123)) | reproducibility | permutation p-values drift run-to-run without a fixed seed | | Leading-edge concentration: if 3-5 genes explain the top ~20 sets | Gillis & Pavlidis 2011 PLoS One 6:e17258 | a multifunctionality artifact, not independent findings | | No single method is best | Tarca 2013 PLoS One 8:e79217; Geistlinger 2021 Brief Bioinform 22:545 | benchmarks find no method dominates sensitivity + prioritization + specificity; do not claim one beats another by a number |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Implausibly many tissue-specific terms enriched | whole-genome background | set universe to the testable genes |
| GSEA hit fails to replicate | preranked gene-sampling null (discovery screen) | confirm with CAMERA/ROAST; report which null was run |
| 40 significant GO terms, all variants of one theme | GO redundancy read as replication | simplify/EnrichmentMap; report clusters |
| Top 20 sets all driven by the same few genes | multifunctional hub genes | inspect leading edge; report one finding |
| Same canonical pathways enrich regardless of biology | annotation/study bias | weight specific hits; treat well-studied sets as the null |
| Result not reproducible next year | live DB changed / version unrecorded | pin and report database release + access date |
| Reporting a ssGSEA/GSVA score as a p-value | a transform, not a hypothesis test | use the scores as model features only |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.
development
End-to-end pooled and single-cell CRISPR screen analysis from FASTQ to hit genes. Orchestrates library design QC, guide counting, six-stage screen QC (plasmid Gini, replicate Pearson, CEGv2 PR-AUC, copy-number artifact), method-appropriate hit calling across MAGeCK RRA/MLE, BAGEL2, drugZ, JACKS, and Chronos, cancer-cell-line copy-number correction (CRISPRcleanR / Chronos), batch correction for multi-batch screens, and the specialized branches for combinatorial paralog screens, single-cell Perturb-seq, base-editor variant-function screens, prime-editor screens, and in vivo bottleneck-aware screens. Use when analyzing any pooled CRISPR screen end-to-end, choosing the correct hit-calling method by experimental design, integrating copy-number correction into the pipeline, or branching the workflow for single-cell, combinatorial, base-editor, prime-editor, or in vivo variants.