differential-expression/batch-correction/SKILL.md
Handles batch effects in bulk RNA-seq via design-matrix inclusion (the correct path for DE), ComBat/ComBat-seq for visualization, SVA for unknown latent factors, RUVSeq for negative-control-gene-anchored unwanted variation, and limma::removeBatchEffect for plotting only. Encodes the Nygaard 2016 cardinal sin against testing on a batch-corrected matrix, the choice between SVA/RUVg/RUVs/RUVr, the confounding non-identifiability problem, the single-cell boundary (Harmony/MNN are NOT for bulk), and the Goh 2017 harmonization critique. Use when designing a DE analysis with batch structure, troubleshooting batch-dominated PCA, choosing ComBat vs ComBat-seq, handling unknown batch via SVA, integrating across studies, or deciding when (rarely) to subtract batch.
npx skillsauth add GPTomics/bioSkills bio-differential-expression-batch-correctionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: sva 3.50+ (includes ComBat + ComBat_seq), DESeq2 1.42+, edgeR 4.0+, limma 3.58+, RUVSeq 1.36+, ggplot2 3.5+, harmony 1.2+ (single-cell context only)
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('<pkg>') then ?function_name to verify parametersIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Remove the batch effect before DE" -> Almost always WRONG. Include batch as a covariate in the design formula (~ batch + condition) so DESeq2/edgeR/limma model it without subtracting. Subtraction is for visualization only.
Nygaard, Rødland, Hovig 2016 Biostatistics 17(1):29-39, "Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses." Translation: never run ComBat (or ComBat-seq, or removeBatchEffect, or SVA-subtract-then-test) and then run DE on the corrected matrix.
Mechanism: batch-correction methods fit a model y_ij = alpha + X_ij beta + gamma_i + delta_i epsilon_ij and subtract the batch terms. The downstream DE test then computes p-values as if those degrees of freedom had never been spent. Residual df is lower than what DESeq2/edgeR/limma assume. Type-I error inflates -- the gene list looks more significant than it should.
The right approach: include batch in the design. ~ batch + condition. DESeq2/edgeR/limma will properly partial out the batch effect from the condition estimate and account for the spent degrees of freedom in the inference. The batch is corrected at the inference stage, not by mutating counts.
removeBatchEffect (limma) is for visualization only -- the function's help page says so. ComBat/ComBat-seq output is for visualization, clustering, or downstream tools that cannot take a design matrix (rare; mostly ML).
A second clarification: structural confounding (every "treated" sample in batch 1, every "control" in batch 2) is non-identifiable. No method fixes this -- the "treatment effect" and "batch effect" are mathematically the same vector. The fix is experimental design (randomize batches in advance). ComBat/SVA on a fully-confounded design silently removes the treatment effect along with the batch effect.
| Method | Input | Mechanism | Use for |
|--------|-------|-----------|---------|
| Design-matrix inclusion (~ batch + condition) | Raw counts; known batch | Partial out batch in the GLM, spent df accounted | DE testing -- the correct path |
| ComBat (Johnson, Li, Rabinovic 2007) | Log-transformed / continuous expression | Empirical-Bayes location and scale shifts per batch | Visualization of microarray / continuous data |
| ComBat-seq (Zhang, Parmigiani, Johnson 2020) | Raw RNA-seq counts | NB-GLM equivalent of ComBat; returns integer counts | Visualization of RNA-seq counts; cross-study harmonization for ML (with caveat) |
| limma::removeBatchEffect | Normalized expression | Linear regression subtraction with design protection | Visualization only -- explicit help-page warning |
| SVA (Leek, Storey 2007 + Leek 2012) | Normalized expression | Estimate latent surrogate variables explaining residual variance independent of variable-of-interest | Unknown batch / hidden technical structure -- add SVs to design |
| svaseq (Leek 2014) | Counts | Count-data variant of SVA | Counts version |
| RUVg (Risso, Ngai, Speed, Dudoit 2014) | Counts; negative control genes | Factor analysis of control-gene residual; W as covariate | Strong negative controls (ERCC, housekeeping) -- add W to design |
| RUVs | Counts; replicate samples | Factor analysis within replicate groups; assumes replicates differ only by unwanted variation | Multi-condition with biological replicates as "controls" |
| RUVr | Counts; design only | Factor analysis of residuals from initial fit | Most data-driven; most likely to absorb biology -- caution |
| Harmony, MNN, Scanorama, BBKNN | Single-cell embeddings | Iterative alignment of clusters across samples | Single-cell ONLY; not bulk |
| Scenario | Recommended approach | Why |
|----------|---------------------|-----|
| Known batch, want DE | ~ batch + condition in design; do NOT subtract | Cardinal sin avoidance |
| PCA shows batch separation | Include batch in design; for the figure, removeBatchEffect is OK (visualization only) | Two purposes, two tools |
| Unknown batch structure | sva / svaseq; add SVs as covariates in design | Captures latent technical factors |
| Have ERCC spike-ins or trusted housekeeping | RUVSeq::RUVg with control gene indices; add W to design | Most principled UV removal |
| Have replicate samples (technical reps within biological) | RUVSeq::RUVs | Replicate structure indicates "this differs only by UV" |
| Cross-study integration (TCGA + ICGC + own data) | ComBat-seq for counts, ComBat for log-expression, THEN meta-analysis -- do NOT pool then DE | Goh 2017 warning |
| Fully confounded batch and condition | Re-collect samples; no method fixes this | Non-identifiable |
| Visualizing batch removal for a figure | removeBatchEffect(expr, batch = batch, design = model.matrix(~condition)) | Visualization is what it's for |
| Single-cell data | Harmony / MNN / Scanorama in the single-cell category, not here | Different problem |
Goal: Test the condition effect while accounting for known batch structure with proper degrees-of-freedom accounting.
Approach: Include batch as a covariate in the design formula; DESeq2/edgeR/limma will partial it out and compute correct p-values.
library(DESeq2)
dds <- DESeqDataSetFromMatrix(countData = counts, colData = coldata,
design = ~ batch + condition)
dds <- DESeq(dds)
res <- results(dds, name = 'condition_treated_vs_control')
library(edgeR)
y <- DGEList(counts = counts, group = coldata$condition)
keep <- filterByExpr(y, design = model.matrix(~ batch + condition, coldata))
y <- y[keep, , keep.lib.sizes = FALSE]
y <- normLibSizes(y)
design <- model.matrix(~ batch + condition, coldata)
y <- estimateDisp(y, design, robust = TRUE)
fit <- glmQLFit(y, design, robust = TRUE)
qlf <- glmQLFTest(fit, coef = 'conditiontreated')
When known confounders are continuous (RIN, library prep date as days), include them as continuous covariates -- no batch correction needed for those:
design = ~ RIN + library_prep_day + condition
Goal: Adjust raw counts to remove known batch effects while preserving biology, for VISUALIZATION or for downstream tools that need a single corrected matrix.
Approach: ComBat_seq(counts, batch, group) returns batch-adjusted integer counts via NB-GLM. Use the output for PCA, clustering, ML -- NOT for DE testing.
library(sva)
corrected_counts <- ComBat_seq(counts = as.matrix(counts),
batch = coldata$batch,
group = coldata$condition,
full_mod = TRUE)
vsd_corrected <- vst(DESeqDataSetFromMatrix(corrected_counts, coldata, ~1))
plotPCA(vsd_corrected, intgroup = 'condition')
full_mod = TRUE keeps biological covariates protected (the group argument). With full_mod = FALSE, ComBat-seq removes batch AND group differences -- a serious failure mode.
ComBat (the non-seq version) is for log-transformed or microarray data, NOT raw counts. Running ComBat on raw counts produces fractional values and assumes Gaussian residuals where counts are NB. Wrong tool, silent failure.
Goal: Discover latent technical factors when batch is not recorded, add them as covariates.
Approach: Estimate the number of surrogate variables; estimate the SVs themselves; add them to the design and re-run DE.
library(sva)
library(DESeq2)
dds <- DESeqDataSetFromMatrix(counts, coldata, design = ~ condition)
dds <- estimateSizeFactors(dds)
norm_counts <- counts(dds, normalized = TRUE)
mod <- model.matrix(~ condition, coldata)
mod0 <- model.matrix(~ 1, coldata)
n_sv <- num.sv(norm_counts, mod, method = 'leek')
svobj <- svaseq(norm_counts, mod, mod0, n.sv = n_sv)
for (i in seq_len(ncol(svobj$sv))) {
colData(dds)[[paste0('SV', i)]] <- svobj$sv[, i]
}
sv_formula <- as.formula(paste('~', paste(paste0('SV', seq_len(ncol(svobj$sv))),
collapse = ' + '), '+ condition'))
design(dds) <- sv_formula
dds <- DESeq(dds)
CRITICAL CHECK: compute the correlation between each SV and the variable of interest. If any SV correlates with treatment at r > 0.3, do NOT include it -- doing so would partial out biology, deflating the effect of interest.
sapply(seq_len(ncol(svobj$sv)),
function(i) cor(svobj$sv[, i], as.numeric(coldata$condition)))
svaseq is the count-data version (Leek 2014); sva is for log-transformed / microarray data.
Goal: Estimate unwanted variation using ERCC spike-ins, housekeeping genes, replicate samples, or post-fit residuals; add as covariates in design.
Approach: Pick the RUV variant matching the available controls; add the resulting W matrix to the design.
library(RUVSeq)
library(edgeR)
control_idx <- which(rownames(counts) %in% ercc_genes)
set <- newSeqExpressionSet(as.matrix(counts), phenoData = coldata)
set <- betweenLaneNormalization(set, which = 'upper')
ruv <- RUVg(set, control_idx, k = 2)
design <- model.matrix(~ pData(ruv)$W_1 + pData(ruv)$W_2 + condition,
data = coldata)
y <- DGEList(counts = counts, group = coldata$condition)
y <- normLibSizes(y)
y <- estimateDisp(y, design, robust = TRUE)
fit <- glmQLFit(y, design, robust = TRUE)
qlf <- glmQLFTest(fit, coef = 'conditiontreated')
| Variant | Control source | Most appropriate when | Failure mode | |---------|----------------|----------------------|--------------| | RUVg | Negative control genes (ERCC, housekeeping) | Strong, validated controls exist | Bad controls -> partials out biology | | RUVs | Replicate samples / technical reps | Multi-condition with reps that should differ only by UV | Wrong replicate structure absorbs condition effect | | RUVr | Residuals from an initial GLM | No external controls; most data-driven | Most likely to absorb true biology -- use last |
Choice of k (number of unwanted factors): no automatic procedure. Try k = 1, 2, 3 and inspect PCA after RUV correction. Treatment effect should remain visible; if it disappears, k is too high. 5-10 is typical; >10 is suspicious.
library(limma)
design <- model.matrix(~ condition, coldata)
log_expr_corrected <- removeBatchEffect(log_expr,
batch = coldata$batch,
design = design)
library(ggplot2)
pca <- prcomp(t(log_expr_corrected), scale. = TRUE)
ggplot(data.frame(PC1 = pca$x[,1], PC2 = pca$x[,2],
condition = coldata$condition, batch = coldata$batch),
aes(PC1, PC2, color = condition, shape = batch)) +
geom_point(size = 3) +
ggtitle('After removeBatchEffect (visualization only)')
The design argument protects condition while removing batch. Common error: passing batch ALSO in the design ("Coefficients not estimable" warning) -- pass batch via batch= only, NOT also in design=.
DO NOT feed log_expr_corrected back into limma lmFit for DE testing. The function's own help page says so. The right approach: lmFit(log_expr, model.matrix(~ batch + condition, coldata)) -- include batch as a covariate.
ct <- table(coldata$condition, coldata$batch)
ct
| Pattern | Status | Action | |---------|--------|--------| | All cells > 0 with roughly equal proportions | Balanced | Include batch as covariate | | Some cells low (e.g., 4/4/2/0) | Partially confounded | Include batch; report reduced power | | Some cells zero AND one factor entirely in one batch | Perfectly confounded | UNFIXABLE -- re-collect or drop the affected comparison |
alias() reveals collinear columns in a design matrix:
alias(model.matrix(~ batch + condition, coldata))$Complete
Harmony (Korsunsky et al. 2019 Nat Methods 16:1289), MNN (Haghverdi et al. 2018 Nat Biotechnol 36:421), Scanorama, BBKNN are designed for single-cell integration where the goal is aligning cell-type clusters across samples/batches. They modify the embedding (PCA coordinates) for downstream UMAP/clustering -- they assume the structural alignment problem of single-cell (many similar cells; align clusters).
DO NOT apply to bulk RNA-seq. Bulk lacks the cluster structure these methods assume.
Conversely, DO NOT use ComBat / ComBat-seq for single-cell. ComBat assumes much more homogeneity than scRNA exhibits; it does not handle zeros well.
See single-cell/batch-integration for the single-cell methods.
Goh, Wang, Wong 2017 Trends Biotechnol 35(6):498-507 warn that batch-correction methods can:
Their recommendation for cross-study work: per-cohort DE first, then meta-analyze the effect sizes (metafor, limma treat across cohorts). DO NOT pool data, harmonize, then re-test.
"Harmonization" in clinical genomics often means ComBat across cohorts then DE on the harmonized matrix -- the exact failure mode Nygaard 2016 describes. Same problem, different vocabulary.
| Pattern | Likely cause | Action |
|---------|--------------|--------|
| Design-inclusion (~ batch + condition) and SVA give different gene lists | SVA captured biology (high SV-condition correlation) OR known batch was incomplete | Check SV-vs-condition correlations; if any |r|>0.3, drop that SV. If SVA still differs after pruning, hidden technical factor exists beyond known batch -- combine ~ batch + SV_clean + condition |
| Design-inclusion and RUVg disagree | RUVg control genes (ERCC, housekeeping) had real biological variation | Validate control set; switch RUVg -> RUVs (replicate-based) or back to design only |
| ComBat-seq corrected gene list larger than design-included gene list | Nygaard 2016 cardinal sin: spent df not accounted for in downstream DE on corrected matrix | DISCARD the ComBat-seq DE result; report design-included only |
| Cross-cohort meta-analysis shows different DE per cohort | Real biological heterogeneity OR cohort-specific technical drift | Per-cohort DE then meta-analyze effect sizes (metafor); do NOT pool then test (Goh 2017) |
| All methods (design, SVA, RUVg) agree on top 100 genes | Robust signal | Report the intersection as high-confidence |
| All methods disagree | Confounded design OR weak signal at high noise floor | Suspect non-identifiability; verify alias() and balance |
The design-inclusion result (~ batch + condition or ~ confounders + condition) is the reference. SVA/RUV results are sensitivity analyses; when they agree with the design result, confidence rises. When they diverge, investigate before believing either.
Trigger: Pipeline: ComBat_seq(counts, batch) -> DESeq2 on corrected_counts. Many more "significant" genes than expected.
Mechanism: ComBat-seq removed batch terms; DESeq2 then computed inference as if those df had not been spent. Type-I error inflates.
Symptom: Implausibly long DE gene list; replication in independent cohort recovers <50%; p-value histogram leans anti-conservative.
Fix: Re-run with ~ batch + condition design on RAW counts. Discard the corrected-counts DE result.
Trigger: sva returned 5 SVs; SV2 correlates with condition at r = 0.7; user added all SVs to design.
Mechanism: When biology is correlated with the hidden factor (e.g., disease severity drives both expression AND blood-draw timing), SVA's SVs capture biology along with technical noise. Partialling them out deflates the effect of interest.
Symptom: Condition effect that was clear in raw PCA disappears after SV adjustment; few or no DE genes.
Fix: Compute SV-vs-condition correlations; exclude any SV with |r| > 0.3 from the design. Re-fit.
Trigger: Pipeline uses ComBat(counts, batch) (not ComBat_seq); output is fractional.
Mechanism: ComBat is for log-transformed / Gaussian data. Counts are NB. The Gaussian assumption is wrong and the output is uninterpretable as counts.
Symptom: Corrected matrix has fractional values; DESeq2 errors out ("counts matrix should be integers"); naive round() gives garbage.
Fix: Use ComBat_seq() for counts. Or better, include batch as a design covariate.
Trigger: All treated samples in batch 1, all control in batch 2; user runs SVA hoping it will rescue the design.
Mechanism: Batch and treatment vectors are identical (or nearly so). SVA estimates "the unwanted factor"; "the unwanted factor" is treatment.
Symptom: SV1 correlates with treatment at r ~ 1; after adjustment, no DE genes.
Fix: Acknowledge the design is non-identifiable. No statistical method fixes structural confounding. Re-collect with randomized batches.
Trigger: Bulk RNA-seq with batch effect; user reaches for Harmony because they've used it for single-cell.
Mechanism: Harmony aligns cluster centroids in single-cell embeddings. Bulk has no cluster structure -- typically 6-30 samples, not 10000+ cells.
Symptom: Harmony "succeeds" but the result is meaningless; downstream DE is nonsense.
Fix: Use design-matrix inclusion (~ batch + condition) or ComBat-seq for visualization. Harmony belongs to single-cell/batch-integration.
| Error / symptom | Cause | Fix |
|-----------------|-------|-----|
| Coefficients not estimable from removeBatchEffect | Batch included in both batch= and design= | Pass batch via batch= only |
| ComBat output has fractional values | Wrong tool for counts | Use ComBat_seq |
| SV adjustment kills condition effect | SVs captured biology | Check correlation; exclude high-correlation SVs |
| Inconsistent dimensions in RUVg | control_idx is symbols vs ENSEMBL rownames | Match index type |
| full_mod not specified in ComBat-seq | Defaults; biology may not be protected | Set full_mod = TRUE and pass group = |
~ batch + condition designtesting
Analyze multi-modal single-cell data (CITE-seq, Multiome, spatial). Use when working with data that measures multiple modalities per cell like RNA + protein or RNA + ATAC. Use when analyzing CITE-seq, Multiome, or other multi-modal single-cell data.
data-ai
Analyze metabolite-mediated cell-cell communication using MeboCost for metabolic signaling inference between cell types. Predict metabolite secretion and sensing patterns from scRNA-seq data. Use when studying metabolic crosstalk between cell populations or metabolite-receptor interactions.
development
Find marker genes and annotate cell types in single-cell RNA-seq using Seurat (R) and Scanpy (Python). Use for differential expression between clusters, identifying cluster-specific markers, scoring gene sets, and assigning cell type labels. Use when finding marker genes and annotating clusters.
development
Reconstruct cell lineage trees from CRISPR barcode tracing or mitochondrial mutations. Use when studying clonal dynamics, cell fate decisions, or developmental trajectories.