differential-expression/deseq2-basics/SKILL.md
Performs differential expression on bulk RNA-seq count data with DESeq2's negative-binomial GLM, Wald and LRT testing, apeglm/ashr/normal LFC shrinkage, independent filtering, Cook's outlier handling, VST/rlog transforms, and design formulas including paired, batch, and interaction terms. Use when running bulk DE, choosing DESeq2 over edgeR or limma-voom, building a paired or interaction design, applying LFC shrinkage for ranking or GSEA, choosing Wald vs LRT, troubleshooting padj=NA, picking VST vs rlog, importing salmon/kallisto via tximport, or analyzing prokaryotic RNA-seq.
npx skillsauth add GPTomics/bioSkills bio-differential-expression-deseq2-basicsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: DESeq2 1.42+, apeglm 1.28+, ashr 2.2+, IHW 1.34+, tximport 1.30+, edgeR 4.0+ (for cross-comparison), PyDESeq2 0.5+
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('<pkg>') then ?function_name to verify parameterspip show <package> then help(module.function) to check signaturesIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Find differentially expressed genes between conditions" -> Fit a negative-binomial GLM per gene with shared dispersion shrinkage, test the coefficient of interest (Wald) or the joint effect of a factor (LRT), and report a shrunken effect-size estimate for ranking.
lfcShrink() returns LFCs from a Bayesian posterior with apeglm/ashr/normal priors, BUT the p-value column it carries forward is still the unshrunken Wald p-value from results(). This is a deliberate design choice (Zhu, Ibrahim, Love 2019 Bioinformatics 35:2084) -- the shrunken estimate is for ranking and visualization; the p-value is for inference. Reporting "shrunken LFC = 0.4, padj = 1e-8" mixes two models, which is fine because both are correct for their stated purpose. What is NOT fine: using the shrunken LFC in a downstream filter and then claiming FDR control on that filter (it has none). For threshold-based FDR claims, use lfcThreshold= or TREAT (glmTreat in edgeR).
A second consequence: results(dds) with no name= or contrast= argument silently returns the last coefficient in resultsNames(dds) -- which depends on factor level order and design formula order. Always specify the contrast explicitly. Tutorials that hard-code results(dds) are setting an example that breaks the moment another factor is added.
| Test / estimator | What it tests | When mandatory | Failure mode |
|------------------|---------------|----------------|--------------|
| Wald | One coefficient = 0 | Two-level factor or single contrast | Anti-conservative with many low-count outliers |
| LRT (test='LRT', reduced=) | Joint effect of dropped terms (>=1 df) | Multi-level factor, omnibus, interaction with >1 df | Reports LFC of the LAST coefficient, not omnibus -- read p-value but never report the LFC as "the effect" |
| lfcShrink(type='apeglm') (Zhu 2019) | Posterior LFC under heavy-tailed Cauchy prior | DEFAULT for ranking and visualization | Requires coef=; cannot use contrast= or numeric vectors |
| lfcShrink(type='ashr') (Stephens 2017) | Posterior LFC under unimodal prior; reports lfsr/svalue with svalue=TRUE | Arbitrary contrasts via contrast= | Slightly different inferential frame (sign-error rather than null-FDR) |
| lfcShrink(type='normal') | Posterior LFC under zero-centered normal; accepts coef= or contrast= (with res=) | Quasi-deprecated since v1.16; only path to get shrunken p-values | Cannot be used with formulas containing interaction terms |
| TREAT / lfcThreshold= (McCarthy & Smyth 2009) | LFC magnitude exceeds threshold tau | Want FDR control for "|LFC| > 1.5x" claims | Conservative; use only when threshold is biologically pre-specified |
| Scenario | Recommended path | Why |
|----------|------------------|-----|
| Two-group bulk RNA-seq, n>=3/group | DESeq() + results(name=...) + lfcShrink(coef=..., type='apeglm') | Modern default; apeglm is the right prior |
| Factor with 3+ levels, "any change" question | DESeq(test='LRT', reduced=~1); read padj only | Wald + p-value combining is wrong |
| Interaction ~ genotype * treatment | Build combined factor group = paste(genotype, treatment) and design ~ 0 + group; contrast pairs of interest | Avoids the resultsNames trap; works with apeglm via relevel |
| Paired design (tumor/normal same patient) | ~ patient + tissue; pairing variable FIRST | Absorbs subject variability; n_paired effective sample size |
| Salmon/kallisto input | tximport() -> DESeqDataSetFromTximport() | Carries length offsets automatically; DESeqDataSetFromMatrix(round(...)) loses length correction |
| n=2/group, no choice | Continue but report results as exploratory; consider edgeR QL F-test as sensitivity | Schurch 2016 RNA 22:839: all tools miss 20-40% of true positives at n=3 |
| Single-cell pseudobulk (counts aggregated per donor) | DESeq2 standard pipeline on pseudobulk matrix | Crowell 2020 Nat Commun 11:6077: pseudobulk avoids the FDR inflation of cell-level DE |
| Many DE genes expected (>50% of genome) | estimateSizeFactors(controlGenes=stable) or spike-in normalization | Median-of-ratios assumes most genes unchanged |
| GSEA preranked input | Shrunken LFC OR stat (Wald Z) as the rank | Unshrunken LFC dominated by low-count noise |
| Cross-sample heatmap, PCA, ML feature | vst(dds, blind=FALSE) (or rlog if n<30 and library sizes vary >4x) | Raw counts make PC1 = library size |
Goal: Take a raw integer count matrix and a sample table to a ranked, shrunken DE result table.
Approach: Construct DESeqDataSet with the design formula, set reference levels explicitly, run the pipeline, extract by explicit contrast, shrink for downstream use.
library(DESeq2)
library(apeglm)
dds <- DESeqDataSetFromMatrix(countData = counts, colData = coldata, design = ~ condition)
dds$condition <- relevel(dds$condition, ref = 'control')
keep <- rowSums(counts(dds)) >= 10
dds <- dds[keep, ]
dds <- DESeq(dds)
resultsNames(dds)
res <- results(dds, name = 'condition_treated_vs_control', alpha = 0.05)
res_shrunk <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'apeglm')
summary(res)
sig <- subset(res, padj < 0.05)
The reference level fix is non-cosmetic: DESeq2 picks alphabetically if not told otherwise, so c('Treated','Untreated') makes 'Treated' the reference and the LFC reads inverted. Set it BEFORE DESeq().
For salmon/kallisto/RSEM input, use DESeqDataSetFromTximport() which carries the per-sample length matrix as an offset automatically:
library(tximport)
txi <- tximport(files, type = 'salmon', tx2gene = tx2gene)
dds <- DESeqDataSetFromTximport(txi, colData = samples, design = ~ condition)
dds <- DESeq(dds)
The tximport(..., countsFromAbundance='lengthScaledTPM') form is for limma-voom (no offset mechanism). Full mechanics, the four countsFromAbundance options, RSEM zero-length traps, and tximeta provenance: see expression-matrix/counts-ingest.
Goal: Encode batch, paired, and interaction structure correctly and extract the intended contrast.
Approach: Put the variable of interest LAST for readability, but never trust the default results(dds) -- inspect resultsNames(dds) and pass name= or contrast= explicitly.
design(dds) <- ~ batch + condition
dds <- DESeq(dds)
resultsNames(dds)
# "Intercept" "batch_B_vs_A" "condition_treated_vs_control"
res <- results(dds, name = 'condition_treated_vs_control')
Interaction design with the canonical trap:
design(dds) <- ~ genotype + treatment + genotype:treatment
dds <- DESeq(dds)
resultsNames(dds)
# "Intercept" "genotype_KO_vs_WT" "treatment_drug_vs_vehicle" "genotypeKO.treatmentdrug"
results(dds, name='treatment_drug_vs_vehicle') returns the drug effect IN THE WT REFERENCE only, NOT a marginal average. This is the single most common misinterpretation.results(dds, name='genotypeKO.treatmentdrug') returns the DIFFERENCE in drug effect between KO and WT.results(dds, contrast=list(c('treatment_drug_vs_vehicle','genotypeKO.treatmentdrug'))).Cleaner alternative for interactions when many pairwise contrasts are needed: combined factor + ~ 0 + group.
dds$group <- factor(paste(dds$genotype, dds$treatment, sep = '_'))
design(dds) <- ~ 0 + group
dds <- DESeq(dds)
res_drug_in_ko <- results(dds, contrast = c('group', 'KO_drug', 'KO_vehicle'))
Goal: Choose Wald for single-coefficient hypotheses, LRT for joint hypotheses involving more than 1 df.
Approach: Wald is default. LRT is mandatory for multi-level factors tested as "any change", interactions with >1 df, and ANOVA-style omnibus tests. With LRT, the reported LFC is for the LAST coefficient in resultsNames(dds) -- not the omnibus effect.
dds <- DESeq(dds, test = 'LRT', reduced = ~ batch)
res_lrt <- results(dds)
# res_lrt$padj is the LRT joint p-value (correct).
# res_lrt$log2FoldChange is for the last coefficient in resultsNames -- NOT an omnibus summary.
When reporting LRT results, name what the LFC actually represents or extract specific Wald coefficients per level for the effect-size table.
Goal: Get a stable effect-size estimate appropriate for ranking, GSEA input, volcano-plot x-axis, and reporting.
Approach: Default to apeglm. Switch to ashr when arbitrary contrasts are needed. Never use normal for new analyses.
res_apeglm <- lfcShrink(dds, coef = 'condition_treated_vs_control', type = 'apeglm')
res_ashr <- lfcShrink(dds, contrast = c('condition','treated','control'), type = 'ashr')
| Method | Prior | Accepts | Use when |
|--------|-------|---------|----------|
| apeglm | Cauchy (heavy-tailed) | coef= only | Default; preserves large effects, suppresses low-count noise |
| ashr | Unimodal scale-mixture | coef= or contrast= (incl. numeric) | Need contrast= for interaction sums or pairwise from ~ 0 + group |
| normal | Zero-centered normal | coef= only; not interaction designs | Only when the old shrunken p-value is required (legacy) |
The apeglm-cannot-use-contrast footgun: if the question is "drug effect in KO" from ~ genotype * treatment, apeglm cannot directly shrink that contrast. Workarounds: (a) rebuild as combined factor ~ 0 + group and relevel so the desired comparison is a coefficient; (b) use ashr; (c) accept the unshrunken LFC for that one comparison.
p-values do NOT change when shrinking. lfcShrink() preserves the Wald p-value from results(). See the Single Most Important Insight at the top.
padj = NA has three distinct causes (independent filtering, Cook's outlier, all-zero in a group), each with a different remediation -- see de-results for the full diagnostic table, IHW alternative, and recovery code.
Two DESeq2-specific points worth knowing at this layer:
results(dds, cooksCutoff = FALSE)). At n>=7 per group, DESeq() REPLACES outliers via replaceOutliers() and refits (minReplicatesForReplace = 7).rowSums(counts(dds)) >= 10) is for memory and speed ONLY. It does NOT replace independent filtering, which operates downstream at results() time. Independent filtering can be swapped for IHW via results(dds, filterFun = ihw).Goal: Produce homoskedastic log2-scale counts for PCA, heatmaps, clustering, ML features.
Approach: Use vst() by default. Switch to rlog() only for n<30 with size factors varying >4x. Never use normTransform() for distance-based plots. Never use VST/rlog values as input to DE.
vsd <- vst(dds, blind = FALSE)
rld <- rlog(dds, blind = FALSE)
The vst() function default is blind=TRUE, but the current DESeq2 vignette recommends blind=FALSE for any downstream visualization AFTER the model is fit (it uses the design when fitting dispersions; appropriate when the design is already settled). Reserve blind=TRUE for unsupervised QC where the design should not influence the transformation (e.g., "is this sample consistent with its group?"). The vignette has flip-flopped over the years on which to recommend by default -- pass blind= explicitly.
vst() uses 1000 most-variable genes to fit the dispersion trend by default. With <1000 genes after filtering, set nsub lower.
DESeq2 v1.0-v1.15: betaPrior = TRUE was the default. Shrinkage was baked in and p-values were computed on the shrunken estimates. v1.16 (2017) flipped the default to FALSE and introduced lfcShrink(). Today: betaPrior = TRUE is quasi-deprecated and is the ONLY path to a p-value of the shrunken estimate. Most users do not want this. lfcShrink() with apeglm and the unshrunken Wald p-value is the modern compromise.
The contrast= vs name= numerical difference that older tutorials describe was specific to betaPrior = TRUE. With the current default, name= and contrast= for the same comparison return identical LFCs.
estimateSizeFactors() defaults to type='ratio' (median-of-ratios). Edge cases:
| Situation | Use |
|-----------|-----|
| Zero counts in some samples for many genes | type='poscounts' -- uses only positive entries per gene |
| Very small libraries, hard to converge | type='iterate' |
| Spike-ins (ERCC) or known stable housekeeping genes | controlGenes = indices |
| Majority-DE biology (prokaryotic stress, viral host shutoff, MYC amplification) | controlGenes with curated stable genes; or spike-in SBN (Jiang 2011 Genome Res 21:1543) |
Single-cell pseudobulk: most genes have zeros across donors, so type='poscounts' is often required for pseudobulk DESeq2.
Trigger: Question of the form "drug effect in KO genotype" from ~ genotype * treatment; user tries lfcShrink(dds, contrast=list(...), type='apeglm').
Mechanism: apeglm fits per-coefficient priors. A numeric or list contrast is a linear combination of coefficients, not a coefficient -- no prior to apply.
Symptom: Error: "type='apeglm' shrinkage only for use with 'coef'"
Fix: Rebuild design as ~ 0 + group with group = paste(genotype, treatment), relevel so the comparison is a coefficient, refit. Or use type='ashr' which accepts contrasts.
Trigger: Multi-level factor analyzed with DESeq(dds, test='LRT', reduced=~1); user reports the log2FoldChange column as "the effect".
Mechanism: The LRT p-value is the omnibus test of "any difference among levels". The LFC reported by results() after LRT is for the LAST coefficient in resultsNames(dds), which is one specific level-vs-reference comparison.
Symptom: A 4-level factor produces a single LFC value per gene; reviewer asks "the effect of which condition?"
Fix: Treat LRT padj as a screen for "any change". For effect sizes, extract per-level Wald coefficients individually via results(dds, name='<specific coefficient>') for each non-reference level.
Trigger: Rare-disease cohort or CNV-amplified patient where ONE sample drives the biology; that gene has padj=NA.
Mechanism: Cook's distance flagged the patient as an outlier and zeroed the gene's p-value (or replaced its counts if n>=7).
Symptom: A gene of clear biological interest comes back as NA in the results table even though the count matrix shows the expected pattern.
Fix: results(dds, cooksCutoff = FALSE). Optionally cross-validate the result by running a sensitivity analysis with and without the outlier sample.
Trigger: A transcription factor expressed at ~10 counts but consistently across all samples shows padj=NA despite obvious biology.
Mechanism: baseMean is below the data-driven independent filtering threshold; the gene was excluded from FDR adjustment.
Symptom: Low-count genes with clean signal end up NA.
Fix: results(dds, independentFiltering = FALSE); or filterFun = ihw from the IHW package (often less aggressive on low-count genes).
Trigger: plotDispEsts(dds) shows the red parametric trend curve nowhere near the cloud of gene-wise (blue) and final (black) dispersion estimates.
Mechanism: Default fitType='parametric' assumes dispersion ~ a/mean + b. Fails when the experiment has very few samples per group with highly heterogeneous biology, many very-low-count genes pulling the trend, or a continuous covariate driving large variability.
Symptom: Curved trend that doesn't match the cloud; mismatch shows up most clearly in low-baseMean genes.
Fix: Refit with DESeq(dds, fitType='local') (local regression) or fitType='mean' (flat trend); compare plotDispEsts between fits and pick the one tracking the cloud. Falling back to 'mean' is a sign the data is unusual; investigate before trusting results.
Trigger: Bacterial RNA-seq under stress where >50% of genes change in one direction; PCA shows huge global shift.
Mechanism: Median-of-ratios assumes most genes are not DE. Under massive global perturbation, the reference is dominated by DE genes; size factors absorb the biology.
Symptom: MA plot shows the bulk cloud shifted off zero; reported fold changes don't match qPCR.
Fix: Use controlGenes with curated stable housekeeping genes; or spike-in normalization (Jiang 2011); or RUVg with negative controls.
from pydeseq2.dds import DeseqDataSet
from pydeseq2.ds import DeseqStats
dds = DeseqDataSet(counts=count_df, metadata=metadata, design='~condition')
dds.deseq2()
stat_res = DeseqStats(dds, contrast=('condition', 'treated', 'control'))
stat_res.summary()
results_df = stat_res.results_df
PyDESeq2 0.5+ supports Wald, multi-factor designs, and apeglm shrinkage. No LRT yet. Results are numerically close to R DESeq2 but with small differences from MLE solver choice.
controlGenes or spike-ins.pae for P. aeruginosa PAO1): clusterProfiler::search_kegg_organism().| Error / symptom | Cause | Fix |
|-----------------|-------|-----|
| design matrix not full rank | Confounded covariates | alias(model.matrix(design, coldata))$Complete to find the redundant column |
| counts matrix should be integers | Salmon/kallisto/RSEM counts are fractional | Use DESeqDataSetFromTximport() -- it rounds AND carries length offsets |
| Wrong sign of LFC vs expected | Reference level set alphabetically | relevel() BEFORE DESeq() |
| padj = NA for biologically meaningful gene | Independent filtering or Cook's outlier | See padj=NA section |
| LRT LFC doesn't match Wald LFC for the same comparison | LRT reports last coefficient; Wald reports the named coefficient | Extract specific Wald per level for the effect size |
| summary(res) shows fewer DE genes than expected | summary() default alpha=0.1, NOT the alpha passed to results() | summary(res, alpha = 0.05) |
tools
--- name: bio-phasing-imputation-foundations description: Frames the phasing/imputation pipeline before any tool runs: phasing and imputation are one Li-Stephens copying HMM (recombination is the transition, mutation the emission, the genetic map and Ne set the rates), imputation's honest output is a dosage with a self-estimated quality (INFO/R2/DR2) not a hard genotype, and the stages are ordered and each fails silently (QC, align build and strand to the panel, phase, impute per chromosome, fil
tools
Chooses the enrichment generation before any tool runs, mapping the input shape to a method class - a pre-selected gene list plus a background to over-representation analysis (ORA, hypergeometric), a ranked statistic for all genes to gene set enrichment (GSEA), a signed signaling topology to pathway-topology (SPIA) - then making the null explicit (competitive vs self-contained, gene vs subject sampling) and running a trustworthiness checklist (testable-gene universe, FDR, redundancy collapse, leading-edge check, version reporting). Covers why every clusterProfiler GSEA is the inter-gene-correlation-uncorrected competitive null, why the background not the gene list decides ORA significance, and why no method is universally best. Use when deciding ORA vs GSEA vs topology, which gene-set DB, whether a result is trustworthy, or which null a tool computes. For ORA see go-enrichment, GSEA see gsea, databases kegg-pathways/reactome-pathways/wikipathways; the ranking comes from differential-expression/de-results.
testing
End-to-end GWAS workflow from VCF to association results. Covers PLINK QC, population structure correction, and association testing for case-control or quantitative traits. Use when running genome-wide association studies.
development
Orchestrates the full path from differential expression results to redundancy-collapsed functional enrichment: choose ORA vs GSEA, convert gene IDs per method, run enrichGO/enrichKEGG/enrichPathway/enrichWP or gseGO/gseKEGG (clusterProfiler, ReactomePA, rWikiPathways), and visualize. Routes the ORA-vs-GSEA generation fork and the null/universe/reproducibility theory to pathway-analysis/enrichment-foundations. Use when a DESeq2/edgeR/limma result must become enriched GO terms, KEGG/Reactome/WikiPathways pathways, or a GSEA leading edge; when deciding whether a ranking exists for all genes (GSEA, named decreasing vector) or only a pre-selected list (ORA plus a defensible background universe); or when assembling DE-to-pathway end to end. The DE list and ranking statistic come from differential-expression/de-results; per-method nuance lives in the pathway-analysis skills.