differential-expression/de-results/SKILL.md
Extracts, filters, annotates, and exports differential expression results from DESeq2 or edgeR with proper handling of padj=NA (independent filtering, Cook's outliers, all-zero), multiple-testing correction choice (BH vs Storey q-value vs IHW vs lfsr), TREAT vs post-hoc fold-change filtering, p-value histogram diagnostics, gene annotation via org.db/biomaRt/mygene, GSEA preranked input, ORA background construction, replication reality (Schurch 2016 small-n result), and SABV/sex-stratified reporting. Use when extracting and interpreting DE results, troubleshooting padj=NA, choosing FDR method, preparing ranked lists for pathway analysis, annotating gene IDs, or comparing DESeq2 vs edgeR outputs.
npx skillsauth add GPTomics/bioSkills bio-differential-expression-de-resultsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: DESeq2 1.42+, edgeR 4.0+, IHW 1.34+, qvalue 2.34+, ashr 2.2+, AnnotationDbi 1.66+, org.Hs.eg.db 3.18+, biomaRt 2.58+, mygene 1.38+ (Python), dplyr 1.1+, openxlsx 4.2+
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('<pkg>') then ?function_name to verify parameterspip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"What are my significant genes?" -> Extract DE estimates and p-values from the fitted model, handle missing padj correctly, apply FDR control appropriate to the design, and produce the table or ranked list the downstream tool actually needs.
padj = NA has three distinct meaningsA NA in the padj column is not a missing value; it is a flag indicating which filter excluded the gene. The three causes -- independent filtering, Cook's distance outlier, and all-zero in a group -- have completely different remediations. Dropping all NA rows blindly silently discards real signal, most often from low-count master regulators (transcription factors expressed at ~10 counts) that pass biology but fail the data-driven baseMean threshold.
| padj = NA cause | DESeq2 detection | What it means | Fix if undesired |
|-------------------|------------------|---------------|-------------------|
| Independent filtering | finite pvalue, NA padj, baseMean below auto threshold | Removed before BH adjustment to maximize rejections at alpha | results(dds, independentFiltering = FALSE) OR filterFun = ihw |
| Cook's distance outlier | NA pvalue, NA padj, baseMean > 0, group has >=3 reps | One sample has Cook's > qf(0.99, p, m-p) | results(dds, cooksCutoff = FALSE) |
| All-zero or near-zero in a group | NA pvalue AND baseMean very low | Insufficient information to test | Filter at preprocess time; or accept |
Independent filtering (Bourgon, Gentleman, Huber 2010 PNAS 107:9546) chooses the baseMean threshold to maximize rejections. The filter MUST be independent of the test statistic under the null -- this is why baseMean (the across-sample mean) is the canonical choice. Using "min count in treatment group" as a filter VIOLATES the independence requirement and inflates type-I error. Most pipelines unknowingly do this; do not.
A second axis: at n>=7 per group, DESeq() also REPLACES outlier counts via replaceOutliers() and refits (default minReplicatesForReplace = 7). Cook's filtering is NOT computed for continuous covariates -- a continuous-covariate analysis has effectively no outlier filtering.
| Method | What it computes | When to use | Failure mode |
|--------|------------------|-------------|--------------|
| BH (p.adjust(method='BH'), DESeq2 default pAdjustMethod='BH') | FDR at fixed alpha; Benjamini-Hochberg 1995 | Default for most RNA-seq DE | Assumes independence or PRDS; many overlapping tests violate |
| Storey q-value (qvalue::qvalue) | q-value using estimated pi_0 | Genome-scale with many true nulls | Pi_0 estimation can fail at small test counts |
| IHW (results(filterFun=ihw), Ignatiadis 2016 Nat Methods 13:577) | Weighted BH with covariate-informed weights | Modern default for DESeq2; +5-20% discoveries at same FDR | Covariate MUST be independent under null |
| ashr local false sign rate (lfsr / svalue, with svalue=TRUE) | P(sign of estimate is wrong) | When effect-direction certainty is what matters | Conservative lower bound on FDR; not interchangeable with padj |
| BY (p.adjust(method='BY')) | Benjamini-Yekutieli; arbitrary dependence | Strongly correlated tests | Uniformly conservative; rarely needed for DE |
| Holm / Bonferroni | FWER | Small confirmatory test sets | Far too conservative for genome-scale |
| TREAT / lfcThreshold= | FDR for "|LFC| > tau" hypothesis | Pre-specified biologically meaningful threshold | tau must be set BEFORE looking at data |
| Scenario | Recommended approach | Why |
|----------|---------------------|-----|
| Standard bulk DE, two groups | DESeq2 results with default BH; report padj < 0.05 | Default works |
| Want more power at same FDR | results(dds, filterFun = ihw) | IHW typically gains 5-20% |
| Pre-specified biological fold-change matters | results(dds, lfcThreshold = log2(1.5), altHypothesis = 'greaterAbs') OR glmTreat(fit, lfc = log2(1.5)) | Post-hoc padj<0.05 & abs(LFC)>1 does NOT control FDR for the magnitude claim |
| Ranking for GSEA preranked | stat (Wald Z) for DESeq2 OR shrunken LFC | Never use unshrunken LFC -- low-count noise dominates |
| ORA input | Subset by padj<0.05; background = ALL TESTED genes (post-independent-filtering) | Background = "all genes in genome" is wrong; pre-filtering already excluded many |
| Many NA padj including biologically interesting genes | Diagnose: independent filtering vs Cook's vs all-zero; turn off the offending filter only for that gene set | Blanket na.omit discards signal |
| Multi-condition design | LRT for "any change" first; pairwise per-level Wald for effect sizes | LRT padj is omnibus; LRT LFC is one specific coefficient |
| Small n (<=3/group) | Report as exploratory, top hits only | Schurch 2016: tools miss 20-40% of true positives at n=3 |
| Human / mouse with mixed sexes | Include sex as covariate; run sex-stratified sensitivity | SABV mandate; sex effect is real and chromosomal |
| Prokaryotic | Use Prokka/Bakta GFF, KEGG strain code | Ensembl/org.db are eukaryote-only |
Goal: Pull DE estimates and p-values from a fitted DESeq2 or edgeR object into a usable data frame with explicit contrast naming.
Approach: results() (DESeq2) or topTags() (edgeR) with explicit name= / coef=; convert to data.frame; preserve row order if planning to join with annotation.
library(DESeq2)
library(dplyr)
resultsNames(dds)
res <- results(dds, name = 'condition_treated_vs_control', alpha = 0.05)
res_shrunk <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'apeglm')
res_df <- as.data.frame(res)
res_df$gene <- rownames(res_df)
library(edgeR)
tt <- topTags(qlf, n = Inf, sort.by = 'none')$table
tt$gene <- rownames(tt)
sort.by = 'none' in topTags preserves the original gene order -- critical when joining with an annotation table by row index. Default is sort by p-value.
Column name reminder (a recurring cross-tool bug):
| Tool | LFC column | Adjusted p-value column |
|------|------------|------------------------|
| DESeq2 | log2FoldChange | padj |
| edgeR | logFC | FDR |
| limma topTable | logFC | adj.P.Val |
| limma topTreat | logFC | adj.P.Val (post-TREAT) |
Goal: Make a defensible FDR claim about "biologically meaningful fold change" genes.
Approach: Use TREAT or lfcThreshold= to test a magnitude hypothesis with proper FDR control. Post-hoc filtering of padj<0.05 & abs(LFC)>tau does NOT control FDR for the magnitude claim.
res_treat <- results(dds, lfcThreshold = log2(1.5), altHypothesis = 'greaterAbs', alpha = 0.05)
# edgeR equivalent
tr <- glmTreat(fit, coef = 2, lfc = log2(1.5))
What a reviewer is really probing with a "200 genes >2x changed at FDR 5%" claim: is the FDR for the change>2x claim or for the change-non-zero claim? Post-hoc filtering controls FDR only for the latter. TREAT (or lfcThreshold=) controls FDR for the former. McCarthy & Smyth 2009 Bioinformatics 25:765 is the canonical citation.
Goal: Gain 5-20% more discoveries at the same FDR by weighting p-values with a covariate (typically baseMean) that informs power but is independent of the null.
Approach: results(dds, filterFun = ihw) replaces independent filtering with Ignatiadis 2016 hypothesis weighting.
library(IHW)
res_ihw <- results(dds, filterFun = ihw, alpha = 0.05)
When IHW does NOT help:
Storey q-value as an alternative (different framework -- estimates pi_0 fraction of true nulls):
library(qvalue)
qv <- qvalue(res$pvalue[!is.na(res$pvalue)])
res$qvalue <- NA
res$qvalue[!is.na(res$pvalue)] <- qv$qvalues
ashr lfsr (local false sign rate -- probability the estimated direction is wrong):
res_ashr <- lfcShrink(dds, coef = 'condition_treated_vs_control',
type = 'ashr', svalue = TRUE)
res_ashr$svalue # FDR-like, based on lfsr; requires svalue=TRUE
svalue=TRUE is required to populate the svalue column; the default returns the standard pvalue/padj columns only. lfsr and padj are NOT interchangeable. lfsr asks "P(sign wrong)"; padj asks "expected fraction of false discoveries". When reporting, state which.
Goal: Diagnose model misspecification, hidden batch effects, or over-correction by inspecting the raw p-value distribution.
Approach: Plot raw p-values; under a correctly specified null, the histogram is uniform with an upward spike near zero (the true DE genes).
library(ggplot2)
ggplot(res_df, aes(x = pvalue)) +
geom_histogram(bins = 50, fill = 'steelblue', color = 'white') +
labs(x = 'P-value', y = 'Frequency', title = 'P-value distribution') +
theme_bw()
| Shape | Meaning | Action |
|-------|---------|--------|
| Uniform + spike near 0 | Correct: null genes uniform, true DE near 0 | Proceed |
| Anti-conservative (U-shape; both ends spiked) | Hidden batch effect, unmodeled confounder, dispersion misspecified | Inspect PCA for batch; add covariate; check plotDispEsts |
| Conservative (depleted near 0, spike near 1) | Over-correction; too many covariates; wrong dispersion | Simplify model; check dispersion plot for excess shrinkage |
| Spike only at p = 1 | Discrete artifact from very-low-count genes | Pre-filter more aggressively |
| Bimodal with spike at 0.5 | Unusual; suggests a discrete categorical test masquerading | Investigate |
The histogram is one of the cheapest sanity checks in a DE pipeline; always plot it before believing the gene list.
Goal: Subset to significant genes and rank by p-value, fold change, or expression level for downstream use.
Approach: dplyr-style filter + arrange; handle NA padj explicitly per the three-meanings table at the top.
sig <- res_df %>%
filter(!is.na(padj), padj < 0.05, abs(log2FoldChange) > 1, baseMean > 10) %>%
arrange(padj)
# Up- vs down-regulated
up <- sig %>% filter(log2FoldChange > 0)
down <- sig %>% filter(log2FoldChange < 0)
# Summary
n_tested <- sum(!is.na(res$padj))
n_sig <- sum(res$padj < 0.05, na.rm = TRUE)
cat(sprintf('Tested: %d Significant (padj<0.05): %d Up: %d Down: %d\n',
n_tested, n_sig, sum(sig$log2FoldChange > 0), sum(sig$log2FoldChange < 0)))
Goal: Map gene IDs to symbols, descriptions, and cross-database identifiers for human-readable results.
Approach: Prefer AnnotationDbi::mapIds with org.db (fast, local, version-pinned); fall back to biomaRt or mygene for symbols/aliases not in org.db; for prokaryotes, use Prokka/Bakta GFF.
library(org.Hs.eg.db)
library(AnnotationDbi)
res_df$symbol <- mapIds(org.Hs.eg.db, keys = sub('\\..*', '', res_df$gene),
keytype = 'ENSEMBL', column = 'SYMBOL', multiVals = 'first')
res_df$entrez <- mapIds(org.Hs.eg.db, keys = sub('\\..*', '', res_df$gene),
keytype = 'ENSEMBL', column = 'ENTREZID', multiVals = 'first')
The sub('\\..*', '', ...) strips the Ensembl version. CAUTION: this regex destroys the _PAR_Y suffix in GENCODE 25-43 PAR genes -- use sub('\\.[0-9]+(_PAR_Y)?$', '\\1', ...) to preserve. See expression-matrix/gene-id-mapping for full details.
For HGNC symbols changed since 2020 (SEPT1 -> SEPTIN1, MARCH1 -> MARCHF1, MARC1 -> MTARC1, DEC1 -> DELEC1) old symbol-keyed downstream tools silently drop genes. Always join on stable Ensembl or Entrez IDs; use symbols as display labels only.
For prokaryotes:
library(rtracklayer)
gff <- import('annotation.gff3')
gene_info <- as.data.frame(gff[gff$type == 'gene',
c('locus_tag', 'Name', 'product')])
res_annotated <- merge(res_df, gene_info, by.x = 'gene',
by.y = 'locus_tag', all.x = TRUE)
Goal: Produce a ranked list of all genes (no significance filter) for fgsea / clusterProfiler GSEA.
Approach: Rank by Wald statistic (DESeq2 stat) or shrunken LFC. NEVER use a filtered set as GSEA input -- GSEA's permutation null requires the full background.
gsea_ranks <- res_df$stat
names(gsea_ranks) <- res_df$gene
gsea_ranks <- sort(gsea_ranks[!is.na(gsea_ranks)], decreasing = TRUE)
# edgeR equivalent
gsea_ranks_edger <- sign(tt$logFC) * -log10(tt$PValue)
names(gsea_ranks_edger) <- rownames(tt)
gsea_ranks_edger <- sort(gsea_ranks_edger[is.finite(gsea_ranks_edger)],
decreasing = TRUE)
stat (Wald Z) is preferred over raw LFC for GSEA because it combines effect and precision in one number. Unshrunken LFC is dominated by low-count noise.
Goal: Run over-representation analysis (enrichGO, enrichKEGG) on a significant gene list with the correct background.
Approach: Subset to padj<0.05; background = ALL TESTED genes (post-independent-filtering), NOT the genome.
library(clusterProfiler)
sig_entrez <- na.omit(res_df$entrez[res_df$padj < 0.05])
bg_entrez <- na.omit(res_df$entrez[!is.na(res_df$padj)])
ora <- enrichGO(gene = sig_entrez,
universe = bg_entrez,
OrgDb = org.Hs.eg.db,
keyType = 'ENTREZID',
ont = 'BP',
pAdjustMethod = 'BH')
Common mistake: omitting universe= lets clusterProfiler default to "all annotated genes for this organism" -- which includes thousands of genes never tested. The resulting enrichment p-values are wrong (too small). The background MUST be the tested set.
deseq2_sig <- rownames(subset(deseq2_res, padj < 0.05))
edger_sig <- rownames(subset(edger_tt, FDR < 0.05))
common <- intersect(deseq2_sig, edger_sig)
deseq2_only <- setdiff(deseq2_sig, edger_sig)
edger_only <- setdiff(edger_sig, deseq2_sig)
cat(sprintf('DESeq2 sig: %d edgeR sig: %d Common: %d (%.1f%%)\n',
length(deseq2_sig), length(edger_sig), length(common),
100 * length(common) / min(length(deseq2_sig), length(edger_sig))))
Concordance >70% at the top 500: robust. <60%: suspect filtering, normalization, or design difference -- not a tool difference. Run both pipelines with the same filtering and design to isolate.
Trigger: Pipeline does res_df <- na.omit(res_df); downstream gene of interest is missing from results.
Mechanism: Gene was flagged by independent filtering OR Cook's distance; padj is NA but the biology is real.
Symptom: A gene with clear differential expression in the count matrix is absent from the results table.
Fix: Diagnose which filter fired (independent filtering vs Cook's vs all-zero); rerun results() with the appropriate filter off (independentFiltering = FALSE or cooksCutoff = FALSE).
Trigger: Methods section says "genes with padj < 0.05 and abs(LFC) > 1 (FDR < 5%)".
Mechanism: BH controls FDR for the |LFC| > 0 hypothesis, not the |LFC| > 1 hypothesis. The post-hoc filter adds no FDR control.
Symptom: Reviewer challenges the FDR claim; replication studies show many of the filtered genes are not the magnitude expected.
Fix: Use TREAT (glmTreat) or lfcThreshold= to test the magnitude hypothesis with proper FDR control. Re-do the methods sentence to match what was actually computed.
Trigger: ORA p-values look implausibly small for a small significant gene set.
Mechanism: universe= argument omitted; clusterProfiler defaulted to all annotated genes in the organism, including thousands never in the tested set.
Symptom: Many enriched pathways at strict thresholds; results don't replicate; reviewer questions the background.
Fix: Explicitly pass universe = bg_entrez where bg_entrez is the set of tested gene IDs (i.e., those with non-NA padj).
Trigger: Small RNA-seq study finds 200 DE genes; validation in independent cohort recovers 60.
Mechanism: Schurch 2016 RNA 22:839: at n=3/group, all tools miss 20-40% of true positives compared to n=30. Variability of the gene list itself is high.
Symptom: 30-50% replication of the gene list across independent runs of the SAME data.
Fix: Frame the small-n DE list as hypothesis-generating, not as a stable set of facts. Validate top hits orthogonally before drawing conclusions. Use TREAT for biologically meaningful thresholds to require larger effects.
Trigger: Mixed-sex cohort; sex not in the design; many chrY genes call as DE.
Mechanism: Sex distribution differs across the experimental groups; the "treatment effect" partially captures sex.
Symptom: chrY genes (DDX3Y, RPS4Y1, UTY) and XIST dominate the top DE list.
Fix: Include sex in the design (~ sex + condition); rerun. For chrX/chrY-specific analyses, sex MUST be in the model or the analysis is uninterpretable. Mauvais-Jarvis et al. 2020 Lancet 396:565 reviews the SABV requirement.
| Error / symptom | Cause | Fix |
|-----------------|-------|-----|
| $FDR not found on DESeq2 result | DESeq2 uses padj; edgeR uses FDR | Check tool, use correct column |
| summary(res) shows different cutoff than results(alpha=) | summary(res, alpha=) defaults to 0.1 | Pass alpha explicitly to summary() |
| All padj NA | All genes filtered (rare; usually a data problem) | Check independentFilteringResults(res); inspect baseMean distribution |
| Direction of LFC reversed | Reference level not set; alphabetical default | relevel() BEFORE DESeq() |
| Gene symbol mapping rate <50% | Mixed Ensembl versions; recent HGNC renames | Verify Ensembl release, check for SEPT/MARCH/MARC renames |
| enrichGO reports thousands of pathways | Wrong universe= | Pass universe = bg_entrez (tested set, not genome) |
stat or shrunken LFCtools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.