clinical-biostatistics/power-and-sample-size/SKILL.md
Computes sample size and power for clinical trials including continuous, binary, and time-to-event endpoints; superiority, non-inferiority, and equivalence designs; FDA 2016 non-inferiority margin selection with M1/M2 framework; Schoenfeld 1981 and Lakatos 1988 for survival; Schuirmann TOST and 80-125% bioequivalence; minimum clinically important difference (MCID) vs δ distinction. Use when justifying trial size in protocol or SAP per CONSORT 2025 item 7.
npx skillsauth add GPTomics/bioSkills bio-clinical-biostatistics-power-sample-sizeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Reference examples tested with: statsmodels 0.14+, scipy 1.12+, numpy 1.26+, pandas 2.1+. R packages cited: pwr, gsDesign (Anderson/Merck), gsDesign2, rpact (Wassmer/Brannath), presize, npsurvSS, nph, simtrial.
Before using code patterns, verify installed versions match. If versions differ:
pip show <package> then help(module.function) to check signaturespackageVersion('<pkg>') then ?function_nameIf code throws ImportError, AttributeError, or TypeError, introspect the installed package and adapt the example to match the actual API rather than retrying.
"Justify the trial's sample size" -> Compute the n needed to detect a pre-specified alternative δ with power 1-β at significance α, accounting for endpoint distribution, design (superiority/NI/equivalence), expected dropout, multiplicity, and stratification — and distinguish δ (the effect the trial is powered to detect) from MCID (the clinically meaningful difference).
δ (the alternative effect): what the trial is powered to detect. Usually set above the MCID because sponsors want a strong signal that exceeds noise + design uncertainty.
MCID (Minimum Clinically Important Difference): the smallest effect size considered clinically meaningful. Jaeschke-Singer-Guyatt 1989 Control Clin Trials 10:407 (anchor-based) and Norman-Sloan-Wyrwich 2003 Med Care 41:582 ("the remarkable universality of half a standard deviation") established the modern conventions.
Confusing the two has produced both:
Postdoc rule of thumb: for superiority, δ >= 1.5 × MCID; for NI, M2 <= 0.5 × MCID.
| Design | Formula / approach | Software | Strength | Fails when |
|--------|--------------------|----------|----------|------------|
| Two-sample t-test, continuous | Cohen's d; n = 2 × (z_α/2 + z_β)² / d² | pwr::pwr.t.test (R); statsmodels.power.tt_ind_solve_power (Py) | Standard | Heteroscedasticity; non-normal outcomes |
| Two-sample proportions (Fleiss) | Asymptotic normal approximation with/without continuity correction | power.prop.test (R) -- uncorrected; pwr::pwr.2p.test; statsmodels | Standard | n < 100/arm: continuity correction debate (D'Agostino 1988) |
| Survival (Schoenfeld 1981) | events ≈ 4(z_α/2 + z_β)² / (log HR)² for 1:1 | gsDesign::nSurv; npsurvSS::size_two_arm | Standard PH-conformant | PH violated (immuno-oncology) under-estimates by 20-50% |
| Survival under non-PH (Lakatos 1988) | Markov chain accommodating time-varying HR, accrual, dropout | gsDesign::nSurv; npsurvSS; simtrial | Handles immuno-oncology delayed effects | Requires explicit specification of HR(t) and accrual |
| MaxCombo SS under NPH | Simulation-based; pre-specify weight family | nphRCT; simtrial | Robust to NPH pattern | Computationally heavier |
| Non-inferiority fixed-margin | n = (z_α + z_β)² × variance / M² | pwr::pwr.t2n.test adapted; rpact::getSampleSizeMeans | Pre-discounted M | Constancy assumption violation invisible |
| Non-inferiority synthesis | Pool historical control-vs-placebo + current test-vs-control | gsDesign::ssTwoArmTest | More efficient than fixed-margin | Constancy assumption MUST hold exactly |
| Equivalence TOST | Two one-sided tests at α each | pwr::pwr.t.test adapted; presize | No multiplicity adjustment needed | Wrong question when superiority/NI is intended |
| Group-sequential | Lan-DeMets spending function | rpact; gsDesign | Interim analyses; early stopping | More complex SAP |
| Sample-size re-estimation (Mehta-Pocock) | Promising-zone conditional power | rpact::getSampleSizeMeans with reestimation | Recovers power if interim shows promise | Unblinded SSR scares FDA |
| Cluster-randomised | Adjust for design effect = 1 + (m-1)ICC | clusterPower; pwr adapted | Standard | ICC misspecification |
Postdoc reading list:
| Scenario | Recommended approach | Why |
|----------|---------------------|-----|
| Continuous outcome, two-arm parallel, superiority | Fleiss / Cohen's d formula via power.t.test or statsmodels | Standard; cite Fleiss 1981 |
| Binary outcome, n > 100/arm | Uncorrected normal approximation via power.prop.test; statsmodels power_proportions_2indep | Adequate for n > 100 |
| Binary outcome, n < 100/arm | Fisher exact-based simulation OR Fleiss with continuity correction | Continuity correction debate; cite D'Agostino 1988 if uncorrected |
| Time-to-event with PH | Schoenfeld 1981 via gsDesign::nSurv | Standard; pre-specify hazards |
| Time-to-event under expected NPH (immuno-oncology) | Lakatos 1988 OR simulation via simtrial::simtrial/nphRCT | Schoenfeld under-estimates by 20-50%; cite Lin 2020 |
| Non-inferiority continuous | Fixed-margin with M2 = 0.5 × historical M1 (discounted) | FDA 2016 NI guidance; cite Temple-Ellenberg 2000 |
| Non-inferiority binary | Fixed-margin with Miettinen-Nurminen CI for RD | Cite EMA NI guidance; MN-CI standard |
| Bioequivalence (Cmax/AUC) | TOST with 80-125% margins on geometric mean ratio | FDA 1992 BE guidance; Schuirmann 1987 |
| Cluster-randomised | n × design effect = 1 + (m-1)ICC | Cite Murray 1998 cluster RCT methodology |
| Group-sequential | Lan-DeMets spending function | rpact or gsDesign |
| Pilot for sample size re-estimation | Blinded SSR (Friede-Kieser 2006) | No Type-I inflation; safer than unblinded |
| Promising-zone reestimation | Mehta-Pocock 2011 with pre-specified increase rule | Recovers power; cite caveat re Jennison-Turnbull 2015 critique |
from statsmodels.stats.power import tt_ind_solve_power
# Solve for n per group
n = tt_ind_solve_power(
effect_size=0.5, # Cohen's d = (mu1 - mu2) / sigma
alpha=0.05,
power=0.80,
alternative='two-sided'
)
print(f'n per arm = {np.ceil(n):.0f}')
# Solve for power given n
power = tt_ind_solve_power(effect_size=0.5, alpha=0.05, nobs1=100, alternative='two-sided')
Cohen's d benchmarks: small = 0.2, medium = 0.5, large = 0.8. Choose d to detect based on prior literature or clinically meaningful effect, NOT post-hoc to fit the affordable n.
Inflate for dropout: if dropout rate is q, multiply final n by 1/(1-q). For q=0.20, n_total = n/0.80.
Stratified randomisation efficiency: if randomisation stratifies on prognostic factors with combined R² = r against outcome, the effective n is n/(1-r²). Senn 2013 Stat Med 32:1439 makes the precision argument explicit.
from statsmodels.stats.power import NormalIndPower
import math
# Without continuity correction
p1, p2 = 0.30, 0.20
pbar = (p1 + p2) / 2
z_alpha = 1.96 # two-sided alpha=0.05
z_beta = 0.84 # power=0.80
n = ((z_alpha * math.sqrt(2 * pbar * (1 - pbar)) +
z_beta * math.sqrt(p1 * (1 - p1) + p2 * (1 - p2))) / (p1 - p2)) ** 2
print(f'n per arm (uncorrected) = {math.ceil(n)}')
# Continuity-corrected (Fleiss formula)
correction = (1 + math.sqrt(1 + 4 / (n * abs(p1 - p2)))) ** 2 / 4
n_corrected = math.ceil(n * correction)
print(f'n per arm (continuity corrected) = {n_corrected}')
The continuity correction debate (D'Agostino, Casagrande, Pike 1988 Stat Med 7:347):
R's power.prop.test uses uncorrected normal approximation — can over-state power 5-10% in small samples (n<100). For confirmatory work with n<100, use simulation-based SS via simr or presize.
import math
# Events needed for two-sample log-rank under PH (Schoenfeld 1981)
from scipy.stats import norm
def schoenfeld_events(hr, alpha=0.05, power=0.80, allocation_ratio=1.0, two_sided=True):
"""
events = (z_crit + z_beta)^2 / (p*(1-p) * log(HR)^2)
where z_crit = z_{1-alpha/2} for two-sided test OR z_{1-alpha} for one-sided.
"""
z_crit = norm.ppf(1 - alpha/2) if two_sided else norm.ppf(1 - alpha)
z_beta = norm.ppf(power)
p = 1 / (1 + allocation_ratio)
return math.ceil((z_crit + z_beta) ** 2 / (p * (1 - p) * math.log(hr) ** 2))
# For HR=0.70, alpha=0.05 two-sided, 80% power, 1:1 -> approx 247 events
events_needed = schoenfeld_events(hr=0.70)
print(f'Events needed for HR=0.70: {events_needed}')
Events drive power, not subjects. Convert to n via expected event rate and follow-up:
# n needed = events / overall event probability over follow-up
expected_overall_event_prob = 0.40 # from pilot or historical
n_per_arm = math.ceil(events_needed / (2 * expected_overall_event_prob))
# Inflate for dropout
n_per_arm_with_dropout = math.ceil(n_per_arm / (1 - 0.15)) # 15% dropout
Use R gsDesign::nSurv() or npsurvSS::size_two_arm() for complex scenarios with:
library(gsDesign)
n_lakatos <- nSurv(
lambdaC = 0.04, # control hazard per month
hr = 0.70, # treatment HR
eta = 0.005, # dropout hazard per month
T = 24, # total study duration in months
minfup = 12, # minimum follow-up
accrualTime = 12, # accrual duration
alpha = 0.025, # one-sided
beta = 0.10 # power = 90%
)
print(n_lakatos)
Schoenfeld 1981 ASSUMES PH. Under immuno-oncology with delayed separation, the formula under-estimates required events by 20-50%. The fix:
# Simulate under expected hazard pattern (Lin 2020 NPH Working Group recommendation)
library(simtrial)
sim_result <- simtrial(
n_per_arm = 250,
enroll_rate = piecewise_enroll(),
fail_rate = piecewise_fail( # delayed-effect specification
duration = c(6, 18),
fail_rate = c(0.04, 0.04), # control
hr = c(1.0, 0.5), # treatment HR is 1 for first 6 months, then 0.5
dropout_rate = c(0.005, 0.005)
),
total_duration = 36
)
# Compute empirical power via MaxCombo or weighted log-rank
FDA NI Trials Guidance (November 2016) establishes the canonical M1/M2 framework:
The "double discount" trap: M2 = 0.5 × (lower CI bound of M1) effectively requires 50% retention of a discounted historical estimate, yielding a much tighter margin than naive point-estimate retention. This has effectively halted new antibiotic NDA development under stringent NI margins.
# Fixed-margin NI sample size (binary endpoint)
import math
p_control = 0.85 # historical/observed control success rate
p_test = 0.85 # null assumption: test = control (NI null is test - control < -M2)
margin_M2 = 0.05 # pre-specified clinically acceptable loss
z_alpha = 1.645 # one-sided alpha=0.025
z_beta = 0.84 # power=0.80
# Approximate: n per arm
variance_term = p_control * (1 - p_control) + p_test * (1 - p_test)
n = math.ceil((z_alpha + z_beta) ** 2 * variance_term / margin_M2 ** 2)
print(f'NI sample size per arm: {n}')
Use Miettinen-Nurminen score CI for RD (the regulatory standard) when computing the NI margin from observed proportions. Cite ratesci::scoreci (R) for production.
Constancy assumption: the historical placebo-vs-active effect must persist in the NI trial. If standard of care drifted, M1 over-estimates and margin is too liberal. Hung-Wang-O'Neill 2005 Biom J critique: NI trials have NO within-trial Type-I error guarantee — alpha is conditional on constancy.
Assay sensitivity (Temple-Ellenberg 2000 Ann Intern Med 133:455 + 133:464): NI trials have only external assay sensitivity (inferred from historical data + constancy); placebo-controlled trials have internal assay sensitivity. This is why NI trials are second-best when placebo is ethical.
TOST (Two One-Sided Tests; Schuirmann 1987): reject H0 of inequivalence iff both one-sided tests reject at α each — equivalent to the 1-2α CI lying within (-δ, +δ). Closed under intersection-union -> no multiplicity adjustment needed despite two tests.
# TOST sample size for continuous outcome
import math
def tost_sample_size(mu_diff, sigma, margin, alpha=0.05, power=0.80):
"""Sample size per arm for TOST equivalence test."""
z_alpha = 1.645 # one-sided
z_beta = 0.84
effect = abs(margin - abs(mu_diff))
if effect <= 0:
raise ValueError('|mu_diff| must be < margin')
n = math.ceil(2 * sigma ** 2 * (z_alpha + z_beta) ** 2 / effect ** 2)
return n
n_eq = tost_sample_size(mu_diff=0.0, sigma=10, margin=5)
print(f'Equivalence sample size per arm: {n_eq}')
FDA 1992 bioequivalence (80-125% rule): geometric mean ratio of Cmax and AUC must have 90% CI within (0.80, 1.25). Implemented in PowerTOST R package for crossover BE designs.
Crossover trials randomise each subject to receive both treatments in different periods. Power calculation is fundamentally different from parallel: the within-subject SD (sigma_w) drives power, not between-subject SD (sigma_b). With reasonable carryover-free designs, crossover requires ~25-50% the n of parallel for the same precision.
# Two-period crossover sample size (continuous endpoint)
import math
def crossover_n_continuous(mean_diff, sd_within, alpha=0.05, power=0.80):
"""n per sequence (so total n = 2*n) for two-period crossover."""
z_alpha = 1.96 # two-sided
z_beta = 0.84 # power=0.80
n_per_seq = math.ceil(2 * (z_alpha + z_beta)**2 * sd_within**2 / mean_diff**2)
return n_per_seq, 2 * n_per_seq # n_per_seq, n_total
Average bioequivalence: geometric mean ratio (GMR) of test/reference Cmax and AUC must have 90% CI within (0.80, 1.25). Log-transform pharmacokinetic parameters; analyse with mixed-effects ANOVA (period, sequence, treatment, subject random).
library(PowerTOST)
# Sample size for 2x2 crossover bioequivalence with CV (within-subject)
n_be <- sampleN.TOST(
alpha = 0.05,
targetpower = 0.80,
theta0 = 0.95, # expected GMR
theta1 = 0.80, # lower BE bound
theta2 = 1.25, # upper BE bound
CV = 0.25, # within-subject CV
design = '2x2x2' # standard 2-period 2-sequence
)
print(n_be)
Highly variable drugs (CV > 30%): standard 80-125% fails with feasible n. Use scaled average bioequivalence (SABE) per FDA 2010:
PowerTOST::sampleN.scABEL() for sample sizeGrizzle test for carryover: compares baseline-adjusted period 1 vs period 2 effects. Now controversial: Senn 2002 Statistics in Medicine showed Grizzle test has poor power; significant carryover should drive design choice (washout extension), not analysis switch.
Modern operational rule: pre-specify adequate washout (>=5 half-lives); do NOT routinely test for carryover in primary analysis. If carryover is biologically plausible, use parallel design or extend washout.
Period effects (calendar/learning) are estimable in 2x2 crossover. Standard analysis includes period as fixed effect:
library(nlme)
fit <- lme(response ~ treatment + period, random = ~1 | subject, data = df)
If period significant, treatment effect is still unbiased (orthogonal in balanced 2x2); report period effect for transparency.
| Scenario | Use crossover? | Why | |----------|----------------|-----| | Bioequivalence of pharmacokinetic parameters | YES | Standard regulatory; within-subject precision much higher | | Chronic stable disease (HTN, GERD, asthma) | YES | Reversible response; ~50% sample size savings | | Acute disease (sepsis, MI) | NO | Cannot retreat same patient | | Curative intent (oncology, surgery) | NO | Treatment alters disease state irreversibly | | Long-half-life drug (>1 week) | NO | Washout impractical | | Patient-reported outcomes with strong period bias | CAUTION | Period bias may dominate |
| Method | Source | Description | |--------|--------|-------------| | Anchor-based | Jaeschke 1989 | Patient-reported anchor question; MCID = mean change in subjects who report "small important" change | | 1 SEM (Wyrwich 1999) | J Clin Epidemiol | MCID ≈ standard error of measurement | | 0.5 SD (Norman 2003) | Med Care 41:582 | "Remarkable universality of half a standard deviation" across PROMs | | 0.5 SD baseline | Common rule of thumb | Quick approximation for unknown PROMs |
Postdoc rule: δ in power calculation should EXCEED MCID (often 1.5-2× MCID) so that trials are powered to detect effects clearly larger than the noise threshold.
For co-primary endpoints, power-adjust per FDA Multiple Endpoints Guidance (October 2022):
| Pattern | Likely cause | Action | |---------|--------------|--------| | Schoenfeld n much smaller than Lakatos simulation n | PH violated under expected HR(t) pattern (immuno-oncology delayed effect) | Use Lakatos OR simulation under expected HR(t); cite Lin 2020 NPH Working Group | | Continuity-corrected vs uncorrected proportion SS differ by ~10% | Asymptotic approximation debate (D'Agostino 1988) | For confirmatory n < 100/arm, use simulation-based SS; document correction approach in SAP | | NI margin per FDA double discount (50% of M1 lower CI) vs sponsor proposed margin (50% of M1 point estimate) | Sponsor uses point estimate; FDA expects lower-CI discount | FDA 2016 NI guidance: M2 from lower CI bound is the regulatory expectation | | Bayesian predictive probability of success vs frequentist power give different n | Bayesian integrates prior uncertainty; frequentist conditions on hypothesised δ | Bayesian PPoS more honest about pre-trial uncertainty; frequentist preferred for confirmatory regulatory; report both for adaptive contexts | | TOST and CI-inclusion approach give different equivalence conclusions | TOST is two-one-sided at α; CI is 1-2α — mathematically equivalent so should NOT disagree | If disagree, check α conventions (Schuirmann 1987: TOST at α, CI at 1-2α) | | Sponsor's MCID vs published anchor-based MCID differ | Different anchor questions or population | Cite primary MCID source; cross-check with Norman 2003 0.5 SD heuristic; declare in SAP | | Stratified-randomisation SS calculation vs unstratified differ substantially | Stratification efficiency gain via Senn 2013 formula | Use stratified formula; cite efficiency = 1/(1-r²) where r² is R² of strata against outcome | | Cluster-RCT SS underpowered after enrollment | ICC misspecified at design (most common SS failure in cluster RCTs) | Use historical ICC + sensitivity analysis at ±50% of estimate; cite Murray 1998 |
| Threshold | Source | Rationale | |-----------|--------|-----------| | δ >= 1.5 × MCID for superiority | Postdoc rule; Norman 2003 implication | δ = MCID gives no margin for sampling variation | | M2 <= 0.5 × historical M1 (FDA double discount) | FDA NI 2016 guidance | Conservative retention against biocreep | | M2 <= 0.5 × MCID for NI | Standard regulatory expectation | Acceptable loss < clinically meaningful difference | | Schoenfeld under non-PH under-estimates by 20-50% | Lin 2020 NPH WG | Switch to Lakatos or simulation | | Continuity correction wastes ~10% sample size | D'Agostino 1988 Stat Med 7:347 | Modern computing makes exact tests cheap | | 90% CI within (0.80, 1.25) for BE | FDA 1992 BE guidance | Geometric mean ratio of Cmax/AUC | | Unblinded SSR Type-I requires CHW weights | Cui-Hung-Wang 1999 | Naive increase inflates Type-I | | Promising-zone CP range ~30-80% | Mehta-Pocock 2011 Stat Med 30:3267 | Mathematical calibration for Type-I preservation |
| Error / symptom | Cause | Solution |
|-----------------|-------|----------|
| Trial under-powered for immunotherapy | Schoenfeld formula used under non-PH | Lakatos or simulation; cite Lin 2020 |
| δ = MCID, CI straddles zero | δ-MCID confusion | Pre-specify δ >= 1.5 × MCID |
| NI "success" but active much worse than historical | Constancy violation | Cite Hung 2005; consider synthesis or 3-arm |
| Unblinded SSR with naive sample increase | Type-I inflation | CHW-weighted test; cite Cui-Hung-Wang 1999 |
| power.prop.test over-states power n < 100 | Uncorrected normal approximation | Simulation-based SS or Fleiss continuity correction |
| BE failure after large n with point estimate near 1.0 | Variability higher than assumed | Re-estimate from pilot; consider replicate design for highly variable drugs |
| Promising-zone tipping to favourable not pre-specified | Operational confusion | Document increase rule in SAP; pre-specify cap |
| Cluster RCT n based on individual-level | Design effect ignored | n × (1 + (m-1)ICC); cite Murray 1998 |
| TOST sample size with non-zero true difference | Treating null as abs(mu_diff)=0 | Include realistic mu_diff in calculation |
| Pushback | Response | |----------|----------| | "Is δ realistic?" | Justified from prior phase 2; δ exceeds MCID by factor X; pre-specified | | "Why this NI margin?" | M2 = 50% of historical lower-CI M1 per FDA 2016 double discount; M2 < MCID | | "Constancy assumption?" | Reviewed historical placebo-vs-active effect over time; stable; sensitivity to lower-end constancy provided | | "Schoenfeld vs Lakatos?" | PH plausible from phase 2 KM; pre-specified Schoenfeld; Lakatos simulation under expected delayed-effect as sensitivity | | "Continuity correction?" | n > 200/arm; uncorrected normal approximation valid (Fleiss 1981) | | "Bioequivalence variability assumption?" | Conservative (upper bound of historical CV); pilot will trigger re-estimation if CV >X% | | "Cluster ICC source?" | Historical ICC from prior cluster trial in similar setting; sensitivity at ICC ±50% provided | | "MCID source?" | Anchor-based MCID from validation study (Jaeschke 1989); supported by 0.5 SD heuristic (Norman 2003) | | "Promising zone or just blinded SSR?" | Promising zone with pre-specified CP boundaries; CHW weights for Type-I control |
development
Find restriction enzyme cut sites in DNA sequences using Biopython Bio.Restriction. Search with single enzymes, batches of enzymes, or commercially available enzyme sets. Returns cut positions for linear or circular DNA. Use when finding restriction enzyme cut sites in sequences.
development
Create restriction maps showing enzyme cut positions on DNA sequences using Biopython Bio.Restriction. Visualize cut sites, calculate distances between sites, and generate text or graphical maps. Use when creating or analyzing restriction maps.
development
Analyze restriction digest fragments using Biopython Bio.Restriction. Predict fragment sizes, get fragment sequences, simulate gel electrophoresis patterns, and perform double digests. Use when analyzing restriction digest fragment patterns.
development
Select restriction enzymes by criteria using Biopython Bio.Restriction. Find enzymes that cut once, don't cut, produce specific overhangs, are commercially available, or have compatible ends for cloning. Use when selecting restriction enzymes for cloning or analysis.