skills/by-role/data-scientist/experiment-design/SKILL.md
Design statistically rigorous experiments to test data science hypotheses. Use when the user says "design an A/B test", "run an experiment", "test this hypothesis", "is this difference significant", "how many samples do I need", "randomized controlled trial", "causal inference", "uplift test", "holdout group", "significance test", "power calculation", "avoid p-hacking", or needs to determine whether an observed effect is real before making a product or model decision.
npx skillsauth add qa-aman/claude-skills experiment-designInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Based on "The Art of Statistics" by David Spiegelhalter. The core principle: statistical significance is not the same as practical significance, and a poorly designed experiment produces confidently wrong answers. Rigorous experiment design means defining the question, estimating the required sample size, and setting success criteria before any data is collected - then interpreting results within the limits of what the experiment can actually prove.
A hypothesis must specify: the intervention, the population, the outcome metric, and the expected direction.
Format: "Applying [intervention] to [population] will [increase/decrease] [metric] compared to the control condition."
Example: "Showing personalized recommendations to new users in the first session will increase 7-day retention compared to showing trending content."
Also state the null hypothesis explicitly: "There is no difference in 7-day retention between personalized and trending content groups."
Spiegelhalter's warning: if you cannot state what result would cause you to reject your hypothesis, you do not have a hypothesis - you have a belief.
Primary metric: The single number that determines whether the experiment succeeded. Choose one. Multiple primary metrics make interpretation ambiguous.
Guard metrics: Metrics that must not significantly worsen. If the primary metric improves but a guard metric degrades, the experiment fails.
Example:
Use a power calculation to determine how many observations you need. Running underpowered experiments wastes time and produces false negatives.
Inputs needed:
from statsmodels.stats.power import NormalIndPower
analysis = NormalIndPower()
n = analysis.solve_power(
effect_size=0.02 / (0.22 * (1 - 0.22)) ** 0.5,
alpha=0.05,
power=0.80,
alternative="two-sided"
)
print(f"Required n per group: {n:.0f}")
If the required sample size exceeds your available traffic within a reasonable window, increase the MDE or extend the timeline. Do not run an underpowered test.
Randomization unit must match the analysis unit.
Document:
Run a sanity check (A/A test) before the real experiment if possible - confirm the two groups are balanced on baseline characteristics.
Never stop an experiment early because the results look good. Peeking inflates false positive rates.
Define in advance:
If early stopping is operationally required, use a sequential testing method (e.g., always-valid p-values, group sequential design) instead of standard frequentist stopping rules.
Spiegelhalter's emphasis: p-values tell you whether an effect exists; effect sizes tell you whether it matters.
Report:
from scipy import stats
import numpy as np
control = df[df["group"] == "control"]["converted"]
treatment = df[df["group"] == "treatment"]["converted"]
t_stat, p_value = stats.ttest_ind(control, treatment)
effect_size = treatment.mean() - control.mean()
ci = stats.t.interval(0.95, len(df)-2, loc=effect_size, scale=stats.sem(np.concatenate([control, treatment])))
print(f"Effect: {effect_size:.4f} ({effect_size/control.mean()*100:.1f}% relative lift)")
print(f"95% CI: ({ci[0]:.4f}, {ci[1]:.4f})")
print(f"p-value: {p_value:.4f}")
State the conclusion as: "The data [supports / does not support] rejecting the null hypothesis. The observed effect was [X], with a 95% CI of [Y to Z]. This [is / is not] practically significant because [reason]."
1. Peeking and stopping early Bad: Checking results daily and stopping as soon as p < 0.05. Good: Pre-commit to a sample size and stop date. Do not look at significance until the stopping criterion is met.
2. Multiple primary metrics Bad: "We'll call it a win if conversion OR retention OR revenue improves." Good: One primary metric. Guard metrics are binary pass/fail, not additional wins.
3. Reporting only p-values Bad: "The result was statistically significant (p = 0.03), so we should ship." Good: "The result was significant (p = 0.03). The absolute lift was 0.4pp (95% CI: 0.1 to 0.7). This is below our 1pp MDE threshold, so we will not ship."
4. Ignoring novelty effects Bad: Running an experiment for 3 days and seeing a spike from users trying something new. Good: Running for at least 1-2 weeks and examining whether the effect size stabilizes over time.
development
Plan a webinar end-to-end using April Dunford's Obviously Awesome positioning framework to find the topic angle that makes the webinar obviously valuable to the right audience. Produces topic positioning, abstract, speaker brief, registration page, promotion sequence, day-of run-of-show, and post-webinar follow-up. Use when the user asks to plan a webinar, virtual event, online workshop, "we need a webinar on X", host a webinar, online masterclass, or any live virtual event with promotion and follow-up. Reads ICP, services, and brand voice from knowledge/.
development
Write long-form thought leadership articles, opinion pieces, industry POV essays, and CEO/founder bylines using the Made to Stick SUCCESs framework (Chip and Dan Heath). Use when the user asks for a long-form article, executive byline, opinion piece, industry POV, manifesto, "explain our point of view on X", or wants to publish an authority-building piece (1200-2500 words). Reads brand voice and positioning from knowledge/.
development
Plan a monthly content calendar across channels using the Content Marketing Matrix (Dave Chaffey, Smart Insights) - Entertain/Inspire/Educate/Convince. Every post gets a quadrant label. The monthly calendar must hit 40% Educate, 40% Inspire+Convince, 20% Entertain. Produces a week-by-week posting schedule with topics, formats, channels, and asset links. Use when the user says "content calendar", "social calendar", "plan next month's content", "what should we post", "content plan", "editorial calendar", "schedule posts for the month", or wants a structured posting plan for LinkedIn, Twitter, email, or blog. Reads brand voice, ICP, and past learnings from knowledge/.
development
Write SEO-optimized long-form articles targeting specific keywords using the They Ask You Answer Big 5 framework (Marcus Sheridan). Articles are categorized by Big 5 type (Cost, Problems, Versus, Best/Reviews, How-To) and structured accordingly. The "answer first" rule applies to every article. Use when the user asks for an SEO article, blog post for ranking, "rank for keyword X", organic content, search-optimized post, pillar page, or content for organic traffic. Includes keyword targeting, search intent matching, internal linking suggestions, and meta tags.