experimental-design/sample-size/SKILL.md
Estimates the minimum biological replicates (or cells/events) for a target power at a target FDR in genomics experiments using ssizeRNA, PROPER, powsimR for scRNA-seq, and pilot-data dispersion estimation from DESeq2/edgeR. Covers the biological-versus-technical replication distinction (technical replicates do not add degrees of freedom for biological inference), replicate-number-versus-sequencing-depth budgeting, scRNA-seq sample-versus-cell allocation under a pseudobulk model, and the critique that "n=3" is a publication convention rather than a power calculation. Use when budgeting a sequencing experiment, writing the sample-size justification in a grant, estimating replicates from pilot data, allocating a fixed budget between samples and depth, or planning scRNA-seq cohort size. For clinical-trial sample size see clinical-biostatistics/power-and-sample-size; for the power-given-n direction see experimental-design/power-analysis.
npx skillsauth add GPTomics/bioSkills bio-experimental-design-sample-sizeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: ssizeRNA 1.3+, PROPER 1.34+, powsimR 1.2+ (GitHub), DESeq2 1.42+, edgeR 4.0+.
Before using code patterns, verify installed versions match. If versions differ:
packageVersion('<pkg>') then ?function_name to verify parametersIf code throws an error, introspect the installed package and adapt to the actual API. Notes: ssizeRNA provides ssizeRNA_single() (one mean/dispersion for all genes), ssizeRNA_vary() (genes vary), and check.power() (average power and true FDR for a given n); powsimR is GitHub-only with drifting signatures. Verify against the installed help before use.
"How many samples do I need?" -> Find the smallest number of biological replicates per group that achieves a target marginal power at a target FDR, given the dispersion and effect-size distribution expected for the assay — counting biological units, not measurements.
ssizeRNA::ssizeRNA_vary(), ssizeRNA::check.power() — FDR-aware NB sample size; pilot dispersions from DESeq2/edgeRSample size is a count of biological replicates — independent experimental units (animals, donors, cultures from independent passages), not measurements. Technical replicates (one library split across lanes, one RNA split into preps) reduce measurement noise but add no degrees of freedom for biological inference; averaging them into their biological unit is correct, and selling "n = 3 samples x 3 technical reps = 9" as biological power is a standard error (Blainey, Krzywinski & Altman 2014 Nat Methods 11:879). The ubiquitous "n=3" is a publication convention, not a calculation: in the 48-vs-48 yeast benchmark, >=6 biological replicates were needed to recover most true DE genes at realistic effect sizes, and below that the choice of DE tool mattered more than at higher n (Schurch 2016 RNA 22:839). Human and primary material, with higher dispersion, need more. For single-cell, the corollary is sharp: population-level DE power is set by the number of donors, not the number of cells, because cells are pseudoreplicates — pseudobulk per donor is the correct unit (Squair 2021 Nat Commun 12:5692; Murphy & Skene 2022 Nat Commun 13:7851).
| Approach | Model | Tool | Strength | Fails / costs when |
|----------|-------|------|----------|--------------------|
| FDR-aware NB sample size | NB, varying mean/dispersion | ssizeRNA::ssizeRNA_vary | controls average power at a true FDR | needs a dispersion/expression model |
| Pilot-dispersion simulation | empirical dispersions from pilot | PROPER, powsimR | most defensible; study-specific | requires a pilot dataset |
| Single-parameter NB | one mean/dispersion for all genes | ssizeRNA::ssizeRNA_single | quick; transparent | ignores the mean-dispersion trend |
| Verify a planned n | average power + true FDR at fixed n | ssizeRNA::check.power | sanity-checks a budget-driven n | not a search over n |
| scRNA-seq cohort sizing | pseudobulk over donors | powsimR | counts the right unit (donors) | cell-level sizing is wrong unit |
| Per-feature t-test n | Gaussian (Cohen's d) | pwr::pwr.t.test | proteomics/continuous after transform | wrong for raw counts |
| Scenario | Recommended approach | Why |
|----------|---------------------|-----|
| Bulk RNA-seq, pilot available | estimate dispersions, then ssizeRNA_vary/PROPER | study-specific dispersion beats a guess |
| Bulk RNA-seq, no pilot | ssizeRNA_vary with a literature dispersion, stated as approximate | transparent starting point |
| Budget already fixed at some n | check.power to report achieved power and true FDR | answers "is this n adequate?" |
| scRNA-seq disease vs control | size the number of DONORS (pseudobulk; powsimR) | population power scales with donors |
| ChIP/ATAC/methylation | NB sample size per region; assay floor as minimum | overdispersed counts; detection floor |
| Proteomics (continuous) | pwr::pwr.t.test per protein, with missingness caveat | Gaussian after transform |
| Have technical replicates | collapse to biological units first | technical reps add no biological df |
| Clinical-trial endpoint | -> clinical-biostatistics/power-and-sample-size | regulated regime |
Goal: Find the minimum biological replicates per group for a target power at a target FDR, accounting for the proportion of DE genes and the mean-dispersion structure.
Approach: Specify the number of genes, the proportion non-DE (pi0), the mean count and dispersion (ideally from pilot data), the fold change, the target FDR, and the target power; let ssizeRNA_vary search replicate numbers and return the smallest that reaches the target.
library(ssizeRNA)
res <- ssizeRNA_vary(nGenes = 20000, pi0 = 0.95, # 5% DE
mu = 10, disp = 0.2, # mean count + dispersion (from pilot ideally)
fc = 1.5, fdr = 0.05, power = 0.80,
maxN = 30)
res$ssize # minimum n per group
# Verify a budget-fixed n: average power and TRUE realized FDR
check.power(nGenes = 20000, pi0 = 0.95, m = 6, mu = 10, disp = 0.2, fc = 1.5, fdr = 0.05, sims = 50)
Goal: Replace a guessed CV with a measured dispersion-mean trend from pilot data.
Approach: Fit dispersions on the pilot with DESeq2 or edgeR, summarize them, and feed them into the simulation-based estimator (PROPER or powsimR) rather than a single-CV closed form.
library(DESeq2)
dds <- DESeqDataSetFromMatrix(pilot_counts, pilot_coldata, ~ condition)
dds <- DESeq(dds)
disp <- dispersions(dds) # per-gene dispersion estimates
summary(disp[is.finite(disp)]) # feed median/trend to PROPER/powsimR
# A literature CV can be off by ~2x; a pilot dispersion is the defensible input.
Technical replicates estimate measurement variance; biological replicates estimate the variance that generalizes to the population, and only the latter supports inference about the biology. Average or sum technical replicates into their biological unit before any test. "n = 3 samples x 3 technical reps" is n = 3, not n = 9 (Blainey 2014). This is the sample-size face of the experimental-unit principle (see experimental-design/randomization-blocking).
Once depth is adequate (roughly >=10-20M mapped reads for bulk RNA-seq DE), additional biological replicates buy more power than additional depth (Liu 2014 Bioinformatics 30:301). Allocate a fixed budget toward more biological units first. scRNA-seq has an analogous rule at the donor level: more donors beat more cells per donor for population DE, with cells per cell type showing diminishing returns past a few hundred (Squair 2021; Murphy-Skene 2022).
| Assay | Practical minimum | For small effects | Source / note | |-------|-------------------|-------------------|---------------| | Bulk RNA-seq | 3 (convention) | 6-12 | Schurch 2016 RNA 22:839: >=6 recovers most true DE | | scRNA-seq (population DE) | 3 donors | 6+ donors | Squair 2021; donors, not cells, drive power | | ATAC-seq | 2 | 4-6 | library complexity + peak detection floor | | ChIP-seq | 2 | 3-4 | IDR reproducibility framework (ENCODE) | | Proteomics (DIA/TMT) | 3 | 6-10 | higher missingness; MNAR | | Methylation (array/WGBS) | 4 | 8-12 | high per-CpG variance |
The "minimum" columns are floors that assume low dispersion and large effects; treat them as the smallest defensible n only after a pilot or literature dispersion supports them.
| Threshold | Source | Rationale | |-----------|--------|-----------| | >=6 biological replicates for bulk RNA-seq DE | Schurch 2016 RNA 22:839 | recovers most true DE at realistic effects | | n=3 is a convention, not a calculation | Schurch 2016 | low power and tool-dependent below 6 | | Donors, not cells, set scRNA-seq DE power | Squair 2021 Nat Commun 12:5692 | cells are pseudoreplicates | | Technical reps add 0 biological df | Blainey 2014 Nat Methods 11:879 | only biological reps generalize | | Depth saturates ~10-20M reads; add replicates | Liu 2014 Bioinformatics 30:301 | biological variance dominates | | Add 10-20% extra units for failures | common practice | RNA degradation, failed libraries |
| Error / symptom | Cause | Solution | |-----------------|-------|----------| | Over-stated power | technical reps counted as n | collapse to biological units | | Underpowered at n=3 | convention not calculation | size to >=6 (or pilot-driven) | | scRNA-seq DE does not replicate | sized on cells | size on donors (pseudobulk) | | Planned n off by a large factor | guessed CV | estimate dispersion from pilot | | Study fails after sample loss | no failure margin | add 10-20% extra units |
| Pushback | Response | |----------|----------| | "Why this n?" | smallest n reaching marginal power >= 0.8 at FDR 0.05 for the minimum meaningful FC; power curve provided | | "Where did dispersion come from?" | estimated from pilot (DESeq2); literature value used only as a cross-check | | "Is n=3 enough?" | no; sized to >=6 per Schurch 2016 for realistic effects | | "Why so many donors for scRNA-seq?" | population DE power scales with donors, not cells (Squair 2021) | | "Technical replicates?" | collapsed to biological units; they add no biological degrees of freedom |
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.