skills/analysis/statistics/hypothesis-testing-guide/SKILL.md
Statistical hypothesis testing, power analysis, and significance reporting
npx skillsauth add wentorai/research-plugins hypothesis-testing-guideInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Hypothesis testing is the backbone of empirical research. It provides a principled framework for deciding whether observed differences in data reflect genuine effects or merely random variation. Misuse of hypothesis tests -- p-hacking, ignoring assumptions, confusing statistical and practical significance -- is a leading cause of irreproducible findings.
This guide covers the core hypothesis testing framework, the most commonly used tests across disciplines, assumption checking, effect size reporting, power analysis for sample size planning, and multiple comparison corrections. Each test is accompanied by Python code using scipy, statsmodels, and pingouin, ready to integrate into research workflows.
The goal is not just to help you run tests, but to help you run the right test correctly and report results following modern standards (APA 7th edition, journal best practices).
| Error Type | Definition | Probability | |-----------|-----------|-------------| | Type I (False Positive) | Reject H0 when it is true | alpha (usually 0.05) | | Type II (False Negative) | Fail to reject H0 when it is false | beta (usually 0.20) | | Power | Probability of correctly detecting an effect | 1 - beta (target: 0.80) |
| Research Question | Data Type | Groups | Test | |-------------------|-----------|--------|------| | Two group means differ? | Continuous, normal | 2 independent | Independent t-test | | Before/after difference? | Continuous, normal | 2 paired | Paired t-test | | Multiple group means differ? | Continuous, normal | 3+ independent | One-way ANOVA | | Two group medians differ? | Ordinal / non-normal | 2 independent | Mann-Whitney U | | Before/after (non-normal)? | Ordinal / non-normal | 2 paired | Wilcoxon signed-rank | | Multiple groups (non-normal)? | Ordinal / non-normal | 3+ independent | Kruskal-Wallis | | Association between categories? | Categorical | 2 variables | Chi-square test | | Correlation? | Continuous | 2 variables | Pearson or Spearman |
from scipy import stats
import numpy as np
import pingouin as pg
# Generate example data
control = np.random.normal(50, 10, n=30)
treatment = np.random.normal(55, 10, n=30)
# Check normality assumption
stat_c, p_c = stats.shapiro(control)
stat_t, p_t = stats.shapiro(treatment)
print(f"Normality p-values: control={p_c:.3f}, treatment={p_t:.3f}")
# Check homogeneity of variance
stat_l, p_l = stats.levene(control, treatment)
print(f"Levene's test p={p_l:.3f}")
# Run t-test
t_stat, p_val = stats.ttest_ind(control, treatment, equal_var=(p_l > 0.05))
# Effect size (Cohen's d)
cohens_d = (treatment.mean() - control.mean()) / np.sqrt(
((len(control)-1)*control.var() + (len(treatment)-1)*treatment.var())
/ (len(control) + len(treatment) - 2)
)
print(f"t={t_stat:.3f}, p={p_val:.4f}, Cohen's d={cohens_d:.3f}")
import pandas as pd
df = pd.DataFrame({
'score': np.concatenate([
np.random.normal(50, 10, 30),
np.random.normal(55, 10, 30),
np.random.normal(60, 10, 30)
]),
'group': np.repeat(['A', 'B', 'C'], 30)
})
# ANOVA
aov = pg.anova(data=df, dv='score', between='group', detailed=True)
print(aov)
# Post-hoc pairwise comparisons (Tukey HSD)
posthoc = pg.pairwise_tukey(data=df, dv='score', between='group')
print(posthoc[['A', 'B', 'diff', 'p-tukey', 'hedges']])
# Contingency table
observed = pd.DataFrame(
[[45, 30], [25, 50]],
index=['Method A', 'Method B'],
columns=['Success', 'Failure']
)
chi2, p, dof, expected = stats.chi2_contingency(observed)
cramers_v = np.sqrt(chi2 / (observed.values.sum() * (min(observed.shape) - 1)))
print(f"chi2={chi2:.3f}, p={p:.4f}, Cramer's V={cramers_v:.3f}")
Power analysis answers: "How many participants do I need?"
from statsmodels.stats.power import TTestIndPower, FTestAnovaPower
# For a two-sample t-test
analysis = TTestIndPower()
# Calculate required sample size
n = analysis.solve_power(
effect_size=0.5, # Cohen's d (medium effect)
alpha=0.05,
power=0.80,
ratio=1.0, # Equal group sizes
alternative='two-sided'
)
print(f"Required n per group: {int(np.ceil(n))}")
# Power curve
import matplotlib.pyplot as plt
sample_sizes = np.arange(10, 200, 5)
powers = [analysis.power(effect_size=0.5, nobs1=n, ratio=1.0, alpha=0.05)
for n in sample_sizes]
fig, ax = plt.subplots()
ax.plot(sample_sizes, powers)
ax.axhline(0.8, color='red', linestyle='--', label='Power = 0.80')
ax.set_xlabel('Sample Size per Group')
ax.set_ylabel('Statistical Power')
ax.legend()
fig.savefig('power_curve.pdf')
| Effect Size | Small | Medium | Large | |-------------|-------|--------|-------| | Cohen's d (t-test) | 0.2 | 0.5 | 0.8 | | eta-squared (ANOVA) | 0.01 | 0.06 | 0.14 | | Cramer's V (chi-square) | 0.1 | 0.3 | 0.5 | | Pearson r (correlation) | 0.1 | 0.3 | 0.5 |
When running multiple tests, the family-wise error rate inflates. Use corrections:
from statsmodels.stats.multitest import multipletests
p_values = [0.01, 0.04, 0.03, 0.08, 0.002]
# Bonferroni (conservative)
reject_bonf, pvals_bonf, _, _ = multipletests(p_values, method='bonferroni')
# Benjamini-Hochberg FDR (less conservative)
reject_bh, pvals_bh, _, _ = multipletests(p_values, method='fdr_bh')
for i, p in enumerate(p_values):
print(f"p={p:.3f} | Bonferroni: {pvals_bonf[i]:.3f} ({reject_bonf[i]}) "
f"| BH-FDR: {pvals_bh[i]:.3f} ({reject_bh[i]})")
documentation
Write Tsinghua University theses using the ThuThesis LaTeX template
development
Templates, formatting rules, and strategies for thesis and dissertation writing
documentation
Set up LaTeX templates for PhD and Master's thesis documents
documentation
Write SJTU theses using the SJTUThesis LaTeX template with full compliance