Version Compatibility

Reference examples tested with: designit 0.5+, OSAT 1.50+ (Bioconductor), sva 3.50+, RUVSeq 1.36+, limma 3.58+, edgeR 4.0+.

Before using code patterns, verify installed versions match. If versions differ:

R: packageVersion('<pkg>') then ?function_name to verify parameters

If code throws an error, introspect the installed package and adapt to the actual API. Notes: OSAT uses optimal.shuffle() on a setup object (there is no bare osat() function); designit is R6 (BatchContainer$new(), optimize_design(), *_score_generator()) and its signatures drift between releases; sva::ComBat_seq() is for integer counts while ComBat() expects log-normalized values. Confirm each against the installed vignette before relying on it.

Batch Design

"Design my experiment so batch effects don't ruin it" -> Assign samples to batches/lanes/plates so the biological variable is balanced against (orthogonal to) every technical nuisance factor, making batch estimable rather than confounded — because no post-hoc correction recovers a design where batch and condition are aliased.

R: designit::optimize_design(), OSAT::optimal.shuffle() — constrained assignment at design time
R: sva::sva()/num.sv() — detect hidden batches; sva::ComBat_seq(), RUVSeq::RUVg() — DOWNSTREAM correction (executed in differential-expression/batch-correction)

The Single Most Important Modern Insight -- No Post-Hoc Method Recovers a Confounded Design

When a technical factor is perfectly aliased with the biological factor (all treated in batch 1, all controls in batch 2), batch and condition occupy the same column space and are mathematically non-identifiable; ComBat, SVA, and RUV cannot separate them, and "removing the batch effect" removes the biology with it. Worse, on partially confounded or merely unbalanced designs, mean-centering batches can manufacture false positives and inflate downstream confidence — Nygaard, Rødland & Hovig 2016 Biostatistics 17:29 showed a pipeline that returned >1000 spurious DE probes where the honest analysis (batch kept in the model) found 11. The operative rules: (1) balance the biological variable across batches at design time because correction is not a rescue (Leek 2010 Nat Rev Genet 11:733); (2) for inference, keep batch in the model so its degrees of freedom are charged honestly — reserve a batch-"cleaned" matrix for visualization and clustering only. The experimental unit, not the measurement, still defines replication (see experimental-design/randomization-blocking).

Confounding vs Blocking vs Nuisance -- the Causal-Graph View

Batch is the genomics face of confounding, and not all metadata should be "adjusted for". A confounder is a common cause of treatment and outcome (adjust for it); a mediator lies on the causal path (adjusting removes signal); a collider is a common effect (adjusting induces spurious association). "Adjust for everything measured" is therefore wrong in general — conditioning on a collider opens a backdoor path. In a properly randomized design the biological variable has no confounders by construction, so batch is handled by balanced assignment + a block/covariate term, not by scrubbing every measured variable.

Algorithmic Taxonomy -- Design Strategies for Technical Variation

| Strategy | What it does | When to use | Fails when | |----------|--------------|-------------|------------| | Balanced (orthogonal) assignment | every condition appears equally in every batch | always achievable when batches hold >=1 of each condition | not all conditions fit per batch | | Block randomization across batches | randomize condition within each batch | batch = a block; conditions fit per batch | batch variance is genuinely zero (rare) | | Incomplete block + batch in model | conditions split across smaller batches, batch term retained | plate/chip smaller than #conditions | unbalanced split inflates artifacts (Nygaard 2016) | | Reference / bridge sample per batch | shared anchor measured in every batch | cross-batch normalization (TMT proteomics, large cohorts) | anchor not representative | | Multiplexing + demultiplexing | pool biological units in one lane, split by barcode/genotype | breaking the donor<->lane confound (scRNA-seq) | insufficient SNPs/hashes to assign | | Run-order randomization | randomize processing/injection order | position/time gradients (LC-MS, plate edge) | order set by convenience |

Decision Tree by Scenario

| Scenario | Recommended design | Why | |----------|--------------------|-----| | 24 samples, 3 batches, 2 conditions | balanced: 4 of each condition per batch | batch orthogonal to condition; estimable | | Conditions outnumber batch capacity | incomplete block; keep batch in the DE model | preserves estimability; no scrubbing | | Large cohort across many runs | include a shared reference sample per batch | enables cross-batch normalization | | scRNA-seq, several donors, few lanes | pool donors per lane, demultiplex (demuxlet / hashing) | removes donor<->lane confound (Kang 2018) | | Hidden/unknown technical structure suspected | estimate surrogate variables (SVA), include in model | captures unmodeled variation (Leek & Storey 2007) | | Design already confounds batch with condition | redesign; no correction will rescue it | non-identifiable (Nygaard 2016; Leek 2010) | | General randomization / blocking / unit choice | -> experimental-design/randomization-blocking | foundational design structure | | Running ComBat-seq / RUVSeq / SVA on real data | -> differential-expression/batch-correction | execution lives there; this skill decides |

Confounded vs Balanced -- the Canonical Contrast

Goal: Make batch effects correctable by keeping the biological variable orthogonal to batch.

Approach: Never place all of one condition in one batch. Distribute conditions (and known covariates such as sex) equally across batches so a linear model can estimate batch and condition separately.

# BAD (confounded): batch is aliased with condition -> non-identifiable
#   batch 1: treat, treat, treat, treat       batch 2: ctrl, ctrl, ctrl, ctrl
# GOOD (balanced): batch is orthogonal to condition -> batch effect estimable, removable
#   batch 1: 2 treat + 2 ctrl                 batch 2: 2 treat + 2 ctrl

Constrained Sample-to-Batch Assignment

Goal: Allocate samples to batches/lanes/plates to minimize correlation between batch and the biological variables of interest.

Approach: Use a block-randomization-with-optimization tool that scores assignments by how evenly the biological factors spread across batches and returns a near-optimal layout.

library(designit)                              # verify API against installed vignette
samples <- data.frame(id = sprintf('S%02d', 1:24),
                      condition = rep(c('ctrl', 'treat'), each = 12),
                      sex = rep(c('M', 'F'), 12))
bc <- BatchContainer$new(dimensions = list(batch = 3, position = 8))
bc <- assign_in_order(bc, samples = samples)
bc <- optimize_design(
  bc,
  scoring = osat_score_generator(batch_vars = 'batch',
                                 feature_vars = c('condition', 'sex')))   # balance both factors
assignment <- bc$get_samples()                # R6 method on the container (no standalone get_samples())

# OSAT alternative (Bioconductor): build a setup object, then optimal.shuffle() -- NOT a bare osat().

Downstream Correction -- Choose by Design, Execute Elsewhere

Correction method selection is a design decision; the execution lives in differential-expression/batch-correction (and single-cell/batch-integration for scRNA-seq). Prefer keeping batch in the analysis model over producing a "cleaned" matrix for inference.

| Method | When it applies | Assumption / caveat | Owner of execution | |--------|-----------------|---------------------|--------------------| | Batch as a model covariate | batch known, balanced | charges df honestly; the default for inference | differential-expression | | ComBat-seq | known batch, integer counts | batch ~ orthogonal to biology; Nygaard caveat if unbalanced | differential-expression/batch-correction | | ComBat (parametric eB) | known batch, log-normalized data | Gaussian; not for raw counts | differential-expression/batch-correction | | RUVSeq (RUVg/RUVs/RUVr) | negative-control genes/samples available | controls must be truly null to the biology | differential-expression/batch-correction | | SVA | hidden/unknown structure | surrogate variables can absorb biology if confounded | this skill estimates; DE consumes | | limma removeBatchEffect | visualization/clustering ONLY | not for the hypothesis test | data-visualization | | Harmony / scVI / Seurat anchors | scRNA-seq integration | integration, not DE inference | single-cell/batch-integration |

Detecting Hidden Batch Effects (SVA)

Goal: Estimate unmodeled technical structure (hidden batches) so it can be included in the downstream model.

Approach: Fit a model matrix for the biological variable and a null matrix, estimate the number of surrogate variables, then compute them for inclusion as covariates in the DE analysis.

library(sva)
mod  <- model.matrix(~ condition, data = colData)   # full model
mod0 <- model.matrix(~ 1, data = colData)           # null model
n_sv <- num.sv(expr_normalized, mod)                # estimate number of hidden batches
svobj <- sva(expr_normalized, mod, mod0, n.sv = n_sv)
# Add svobj$sv to the design used by differential-expression/de-results; do NOT subtract them
# from the data for the test (subtracting is for visualization only).

Per-Method Failure Modes

Batch confounded with condition

Trigger: all of one condition processed in one batch/run.
Mechanism: batch and condition are aliased -> non-identifiable.
Symptom: condition effect vanishes (or an artifact appears) after correction.
Fix: redesign with balanced assignment; no post-hoc method recovers it (Leek 2010).

ComBat on unbalanced groups

Trigger: ComBat applied when condition is partially confounded with batch.
Mechanism: mean-centering batches injects between-group differences and understates residual variance.
Symptom: inflated DE counts and over-confident downstream inference (Nygaard 2016).
Fix: keep batch in the model (ComBat-seq with the biological covariate, or batch as a DE covariate); best is balanced design.

Subtracting surrogate variables before testing

Trigger: feeding an SV-"cleaned" matrix into the DE test.
Mechanism: double-counts the adjustment and loses degrees of freedom.
Symptom: anti-conservative p-values.
Fix: include SVs as covariates in the model; reserve cleaned matrices for plots.

Adjusting for a collider

Trigger: "adjust for everything measured" includes a downstream/common-effect variable.
Mechanism: conditioning on a collider opens a spurious path.
Symptom: associations that appear only after adjustment.
Fix: adjust for confounders (common causes), not mediators or colliders.

Quantitative Thresholds

| Threshold | Source | Rationale | |-----------|--------|-----------| | Balance every condition equally across batches | Leek 2010 Nat Rev Genet 11:733 | makes batch estimable and removable | | Unbalanced ComBat can inflate DE (>1000 vs 11 in one case) | Nygaard 2016 Biostatistics 17:29 | mean-centering injects group differences | | ~50 SNPs/cell suffice to demultiplex pooled donors | Kang 2018 Nat Biotechnol 36:89 | breaks donor<->lane confound | | Keep batch in the model for inference; clean only for viz | Nygaard 2016; Leek 2010 | honest degrees of freedom |

Common Errors

| Error / symptom | Cause | Solution | |-----------------|-------|----------| | Condition effect disappears after ComBat | batch confounded with condition | balance at design time | | Inflated DE list after batch correction | unbalanced ComBat | keep batch in model; ComBat-seq with covariate | | scRNA-seq donor effect equals lane effect | one donor per lane | pool + demultiplex (demuxlet / hashing) | | Spurious associations after "adjusting for all metadata" | conditioning on a collider/mediator | adjust only for confounders | | Cannot reconstruct who/when/which-lot | no metadata captured | record date, lot, operator, lane, position |

Reproducibility Metadata

Record for every sample, because these become the batch/blocking variables: processing date, reagent and kit lot numbers, operator, instrument/flow-cell/lane and well/plate position, library prep batch, and any protocol deviations. Unrecorded technical variation cannot be modeled or balanced after the fact, and is a leakage source for any downstream machine-learning model (see machine-learning/model-validation).

References

Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA. 2010. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 11:733-739.
Nygaard V, Rødland EA, Hovig E. 2016. Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics 17:29-39.
Johnson WE, Li C, Rabinovic A. 2007. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8:118-127.
Zhang Y, Parmigiani G, Johnson WE. 2020. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom Bioinform 2:lqaa078.
Leek JT, Storey JD. 2007. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 3:e161.
Gagnon-Bartsch JA, Speed TP. 2012. Using control genes to correct for unwanted variation in microarray data. Biostatistics 13:539-552.
Kang HM, Subramaniam M, Targ S, et al. 2018. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat Biotechnol 36:89-94.
Yan L, Ma C, Wang D, Hu Q, Qin M, Conroy JM, Sucheston LE, Ambrosone CB, Johnson CS, Wang J, Liu S. 2012. OSAT: a tool for sample-to-batch allocations in genomics experiments. BMC Genomics 13:689.

Related Skills

randomization-blocking - The experimental unit, randomization, and blocking concepts behind a good batch layout
power-analysis - Account for blocking/batch factors in the power calculation
sample-size - Balanced designs assume equal n per group
multiple-testing - Surrogate variables change the effective number of tests
differential-expression/batch-correction - Executes ComBat-seq/RUVSeq/SVA on real data
single-cell/batch-integration - scRNA-seq integration (Harmony, scVI, Seurat anchors)
machine-learning/model-validation - Batch confounding is a data-leakage source for ML
clinical-biostatistics/power-and-sample-size - Trial randomization and design (regulated regime)

Version Compatibility

Reference examples tested with: designit 0.5+, OSAT 1.50+ (Bioconductor), sva 3.50+, RUVSeq 1.36+, limma 3.58+, edgeR 4.0+.

Before using code patterns, verify installed versions match. If versions differ:

R: packageVersion('<pkg>') then ?function_name to verify parameters

Batch Design

R: designit::optimize_design(), OSAT::optimal.shuffle() — constrained assignment at design time
R: sva::sva()/num.sv() — detect hidden batches; sva::ComBat_seq(), RUVSeq::RUVg() — DOWNSTREAM correction (executed in differential-expression/batch-correction)

The Single Most Important Modern Insight -- No Post-Hoc Method Recovers a Confounded Design

Confounding vs Blocking vs Nuisance -- the Causal-Graph View

Algorithmic Taxonomy -- Design Strategies for Technical Variation

Decision Tree by Scenario

Confounded vs Balanced -- the Canonical Contrast

Goal: Make batch effects correctable by keeping the biological variable orthogonal to batch.

# BAD (confounded): batch is aliased with condition -> non-identifiable
#   batch 1: treat, treat, treat, treat       batch 2: ctrl, ctrl, ctrl, ctrl
# GOOD (balanced): batch is orthogonal to condition -> batch effect estimable, removable
#   batch 1: 2 treat + 2 ctrl                 batch 2: 2 treat + 2 ctrl

Constrained Sample-to-Batch Assignment

Goal: Allocate samples to batches/lanes/plates to minimize correlation between batch and the biological variables of interest.

Approach: Use a block-randomization-with-optimization tool that scores assignments by how evenly the biological factors spread across batches and returns a near-optimal layout.

library(designit)                              # verify API against installed vignette
samples <- data.frame(id = sprintf('S%02d', 1:24),
                      condition = rep(c('ctrl', 'treat'), each = 12),
                      sex = rep(c('M', 'F'), 12))
bc <- BatchContainer$new(dimensions = list(batch = 3, position = 8))
bc <- assign_in_order(bc, samples = samples)
bc <- optimize_design(
  bc,
  scoring = osat_score_generator(batch_vars = 'batch',
                                 feature_vars = c('condition', 'sex')))   # balance both factors
assignment <- bc$get_samples()                # R6 method on the container (no standalone get_samples())

# OSAT alternative (Bioconductor): build a setup object, then optimal.shuffle() -- NOT a bare osat().

Downstream Correction -- Choose by Design, Execute Elsewhere

Detecting Hidden Batch Effects (SVA)

Goal: Estimate unmodeled technical structure (hidden batches) so it can be included in the downstream model.

Approach: Fit a model matrix for the biological variable and a null matrix, estimate the number of surrogate variables, then compute them for inclusion as covariates in the DE analysis.

library(sva)
mod  <- model.matrix(~ condition, data = colData)   # full model
mod0 <- model.matrix(~ 1, data = colData)           # null model
n_sv <- num.sv(expr_normalized, mod)                # estimate number of hidden batches
svobj <- sva(expr_normalized, mod, mod0, n.sv = n_sv)
# Add svobj$sv to the design used by differential-expression/de-results; do NOT subtract them
# from the data for the test (subtracting is for visualization only).

Per-Method Failure Modes

Batch confounded with condition

Trigger: all of one condition processed in one batch/run.
Mechanism: batch and condition are aliased -> non-identifiable.
Symptom: condition effect vanishes (or an artifact appears) after correction.
Fix: redesign with balanced assignment; no post-hoc method recovers it (Leek 2010).

ComBat on unbalanced groups

Trigger: ComBat applied when condition is partially confounded with batch.
Mechanism: mean-centering batches injects between-group differences and understates residual variance.
Symptom: inflated DE counts and over-confident downstream inference (Nygaard 2016).
Fix: keep batch in the model (ComBat-seq with the biological covariate, or batch as a DE covariate); best is balanced design.

Subtracting surrogate variables before testing

Trigger: feeding an SV-"cleaned" matrix into the DE test.
Mechanism: double-counts the adjustment and loses degrees of freedom.
Symptom: anti-conservative p-values.
Fix: include SVs as covariates in the model; reserve cleaned matrices for plots.

Adjusting for a collider

Trigger: "adjust for everything measured" includes a downstream/common-effect variable.
Mechanism: conditioning on a collider opens a spurious path.
Symptom: associations that appear only after adjustment.
Fix: adjust for confounders (common causes), not mediators or colliders.

Quantitative Thresholds

Common Errors

Reproducibility Metadata

References

Leek JT, Scharpf RB, Bravo HC, Simcha D, Langmead B, Johnson WE, Geman D, Baggerly K, Irizarry RA. 2010. Tackling the widespread and critical impact of batch effects in high-throughput data. Nat Rev Genet 11:733-739.
Nygaard V, Rødland EA, Hovig E. 2016. Methods that remove batch effects while retaining group differences may lead to exaggerated confidence in downstream analyses. Biostatistics 17:29-39.
Johnson WE, Li C, Rabinovic A. 2007. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8:118-127.
Zhang Y, Parmigiani G, Johnson WE. 2020. ComBat-seq: batch effect adjustment for RNA-seq count data. NAR Genom Bioinform 2:lqaa078.
Leek JT, Storey JD. 2007. Capturing heterogeneity in gene expression studies by surrogate variable analysis. PLoS Genet 3:e161.
Gagnon-Bartsch JA, Speed TP. 2012. Using control genes to correct for unwanted variation in microarray data. Biostatistics 13:539-552.
Kang HM, Subramaniam M, Targ S, et al. 2018. Multiplexed droplet single-cell RNA-sequencing using natural genetic variation. Nat Biotechnol 36:89-94.
Yan L, Ma C, Wang D, Hu Q, Qin M, Conroy JM, Sucheston LE, Ambrosone CB, Johnson CS, Wang J, Liu S. 2012. OSAT: a tool for sample-to-batch allocations in genomics experiments. BMC Genomics 13:689.

Related Skills

randomization-blocking - The experimental unit, randomization, and blocking concepts behind a good batch layout
power-analysis - Account for blocking/batch factors in the power calculation
sample-size - Balanced designs assume equal n per group
multiple-testing - Surrogate variables change the effective number of tests
differential-expression/batch-correction - Executes ComBat-seq/RUVSeq/SVA on real data
single-cell/batch-integration - scRNA-seq integration (Harmony, scVI, Seurat anchors)
machine-learning/model-validation - Batch confounding is a data-leakage source for ML
clinical-biostatistics/power-and-sample-size - Trial randomization and design (regulated regime)

Adoption

GPTomics/bio-experimental-design-batch-design

$ install --global

Security Scan Results

SKILL.md

Version Compatibility

Batch Design

The Single Most Important Modern Insight -- No Post-Hoc Method Recovers a Confounded Design

Confounding vs Blocking vs Nuisance -- the Causal-Graph View

Algorithmic Taxonomy -- Design Strategies for Technical Variation

Decision Tree by Scenario

Confounded vs Balanced -- the Canonical Contrast

Constrained Sample-to-Batch Assignment

Downstream Correction -- Choose by Design, Execute Elsewhere

Detecting Hidden Batch Effects (SVA)

Per-Method Failure Modes

Batch confounded with condition

ComBat on unbalanced groups

Subtracting surrogate variables before testing

Adjusting for a collider

Quantitative Thresholds

Common Errors

Reproducibility Metadata

References

Related Skills

Related Skills

GPTomics/bio-workflows-clip-pipeline

GPTomics/bio-comparative-genomics-whole-genome-duplication

GPTomics/bio-comparative-genomics-whole-genome-alignment

GPTomics/bio-comparative-genomics-synteny-analysis

GPTomics/bio-experimental-design-batch-design

$ install --global

Security Scan Results

SKILL.md

Version Compatibility

Batch Design

The Single Most Important Modern Insight -- No Post-Hoc Method Recovers a Confounded Design

Confounding vs Blocking vs Nuisance -- the Causal-Graph View

Algorithmic Taxonomy -- Design Strategies for Technical Variation

Decision Tree by Scenario

Confounded vs Balanced -- the Canonical Contrast

Constrained Sample-to-Batch Assignment

Downstream Correction -- Choose by Design, Execute Elsewhere

Detecting Hidden Batch Effects (SVA)

Per-Method Failure Modes

Batch confounded with condition

ComBat on unbalanced groups

Subtracting surrogate variables before testing

Adjusting for a collider

Quantitative Thresholds

Common Errors

Reproducibility Metadata

References

Related Skills

Related Skills

GPTomics/bio-workflows-clip-pipeline

GPTomics/bio-comparative-genomics-whole-genome-duplication

GPTomics/bio-comparative-genomics-whole-genome-alignment

GPTomics/bio-comparative-genomics-synteny-analysis