skills/content/ab-testing/SKILL.md
Run email A/B tests with statistical rigor. Use when testing subject lines, content variants, send times, CTAs, or measuring experiment significance.
npx skillsauth add chunkydotdev/email-skills ab-testingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Test email variations systematically to improve open rates, click rates, and conversions with statistical confidence.
email-copywriting - writing the actual content variations to testtemplate-design - HTML template variations for layout and visual testsspam-filter-avoidance - ensure test variants don't accidentally trigger spam filterssender-reputation - monitor whether testing impacts your sending reputationemail-sequences - testing within drip campaigns and automated sequencesNot all tests deliver equal value. Start with high-impact, easy-to-measure elements and work your way down.
| Element | What to vary | Primary metric | Why it matters | |---------|-------------|----------------|----------------| | Subject line | Length, personalization, question vs statement, emoji, urgency | Open rate | The single biggest lever. A bad subject line means nobody sees anything else. | | From name | Company name vs person name vs "Person at Company" | Open rate | Recipients decide to open based on who sent it as much as the subject. | | Send time | Day of week, hour of day, timezone-adjusted vs fixed | Open rate | Same email sent at 6 AM vs 10 AM can see 20-40% open rate differences. |
| Element | What to vary | Primary metric | Why it matters | |---------|-------------|----------------|----------------| | CTA | Button text, color, placement, number of CTAs | Click rate | "Get started" vs "Start your free trial" can shift click rates by 10-30%. | | Preview text | First 40-90 characters visible in inbox | Open rate | Often overlooked - many senders leave this as the default HTML boilerplate. | | Content length | Short vs long, single-topic vs multi-topic | Click rate | Depends heavily on audience and email type. No universal "right" length. |
| Element | What to vary | Primary metric | Why it matters | |---------|-------------|----------------|----------------| | Layout | Single column vs multi-column, image placement | Click rate | Visual hierarchy affects scanning behavior. | | Personalization depth | Name only vs company vs role-specific content | Click rate, conversion | Diminishing returns - basic personalization matters most. | | Tone | Formal vs casual, first person vs third person | Click rate, reply rate | Audience-dependent. B2B enterprise vs startup is a different world. |
Rule of thumb: If you're sending fewer than 50,000 emails per month, focus on tier 1. You probably don't have the volume to detect tier 3 differences.
This is where most email A/B tests go wrong. People call winners based on gut feeling or tiny sample sizes.
The sample size you need depends on three things:
Here are practical minimums per variant for a 95% confidence level and 80% power:
| Baseline rate | MDE (relative) | Sample per variant | Total for 2 variants | |--------------|----------------|-------------------|---------------------| | 20% open rate | 20% (detect 24% vs 20%) | ~3,800 | ~7,600 | | 20% open rate | 10% (detect 22% vs 20%) | ~15,000 | ~30,000 | | 5% click rate | 20% (detect 6% vs 5%) | ~15,000 | ~30,000 | | 5% click rate | 30% (detect 6.5% vs 5%) | ~6,700 | ~13,400 | | 2% conversion | 50% (detect 3% vs 2%) | ~3,800 | ~7,600 |
Translation: If your open rate is 20% and you want to detect a 20% relative improvement (4 percentage point lift to 24%), you need about 3,800 recipients in each variant - roughly 7,600 total sends.
If you can only detect a 50%+ relative change, the test is probably not worth running. You'll only catch massive differences, and you won't learn anything about incremental improvements.
The standard significance test for email A/B testing is the two-proportion z-test. It compares two conversion rates and tells you whether the difference is statistically significant.
p1 = control conversions / control total
p2 = variant conversions / variant total
p_pool = (control conversions + variant conversions) / (control total + variant total)
standard_error = sqrt(p_pool * (1 - p_pool) * (1/control_total + 1/variant_total))
z = (p2 - p1) / standard_error
A z-score above 1.96 (or below -1.96) means p < 0.05 - the result is significant at 95% confidence.
What 95% confidence actually means: There is less than a 5% probability that the observed difference happened by chance. It does NOT mean there's a 95% chance the variant is better - that's a common misinterpretation.
A result can be "statistically significant" but practically meaningless. Always look at the confidence interval for the difference:
Good A/B tests require truly random, consistent assignment. A recipient who receives variant A should always be in variant A if they encounter the experiment again.
Hash-based deterministic assignment is the gold standard. Hash the experiment ID + recipient email to produce a stable bucket assignment:
bucket = SHA256(experimentId + ":" + contactEmail) -> normalize to [0, 1)
This approach:
Random list splits in your ESP work for one-off campaigns, but break down for sequences or journeys where the same person should consistently see the same variant.
Minimum: 48 hours. Email open behavior has strong day-of-week patterns. A test that runs only during Tuesday morning will miss the Thursday openers.
Recommended: 5-7 days. This captures a full weekly cycle and accounts for people who don't check email daily.
Maximum: 14 days. Beyond two weeks, external factors (seasonality, news events, list decay) start to contaminate your results.
Rules for when to stop:
Change only one element per test. If you change the subject line AND the CTA AND the send time, and variant B wins, you have no idea which change caused the improvement. You can't apply what you learned.
Exception: multivariate testing (covered below) can test multiple variables simultaneously, but requires much larger sample sizes.
| Factor | A/B testing | Multivariate testing | |--------|------------|---------------------| | Variables tested | 1 | 2+ simultaneously | | Variants needed | 2-4 | Every combination (2x2=4, 2x3=6, 3x3=9...) | | Sample size | Moderate (1,000+ per variant) | Large (1,000+ per combination) | | What you learn | Which variant wins | Which combination wins AND which variables have the most impact | | When to use | Most of the time | When you have high volume (100k+ sends) and want to understand variable interactions |
Only if ALL of these are true:
For most email programs: stick with A/B tests. Run them sequentially. Subject line test in January, CTA test in February, send time test in March. You'll learn more from three clean A/B tests than one muddy multivariate test.
Traditional A/B tests run for a fixed duration, then you pick the winner and deploy. Bandit algorithms (multi-armed bandit, Thompson sampling) dynamically shift traffic toward the better-performing variant during the test.
Use fixed-horizon A/B tests when:
Use bandit algorithms when:
Tradeoff: You sacrifice statistical rigor for better aggregate performance. You may not know if variant B is truly better - but more people saw the better-performing option.
Most ESPs that offer "auto-winner" selection are doing a basic version of this: send to a test portion, wait a fixed time, then send the winner to the remainder. This is better than nothing but is not a true bandit algorithm - it doesn't continuously adapt.
A holdout group is a randomly selected subset of your audience that does NOT receive the email (or receives no email at all). They measure the true incremental lift of your email program.
A/B tests tell you which variant is better. Holdouts tell you whether sending email at all is better than not sending.
Without holdouts, you can't distinguish between:
lift = (treatment_conversion_rate - holdout_conversion_rate) / holdout_conversion_rate
| Audience size | Holdout % | Holdout size | Expected baseline conversion | Can detect lift of | |--------------|-----------|-------------|------------------------------|-------------------| | 10,000 | 10% | 1,000 | 5% | ~50% relative | | 50,000 | 10% | 5,000 | 5% | ~25% relative | | 100,000 | 5% | 5,000 | 5% | ~25% relative | | 500,000 | 5% | 25,000 | 5% | ~10% relative |
Larger audiences can use smaller holdout percentages (5%) because the absolute holdout size is still large enough.
Warning: Holdout results often show lower incrementality than you expect. An email program showing 200% ROI based on last-click attribution might show 30% incremental lift in a holdout test. That's normal - it means your email is capturing credit for conversions that would have happened anyway, plus generating real incremental value.
Choose your primary metric BEFORE running the test. Optimizing for multiple metrics simultaneously leads to cherry-picking results.
| Metric | When to optimize for it | Gotchas | |--------|------------------------|---------| | Open rate | Subject line tests, from name tests, send time tests | Apple Mail Privacy Protection inflates opens by 30-60%. Unreliable as sole metric for Apple-heavy audiences. | | Click rate | CTA tests, content tests, layout tests | More reliable than opens. Measures actual engagement. | | Click-to-open rate (CTOR) | Content effectiveness independent of subject line | Combines the Apple MPP noise from opens with click data. Less useful than it was pre-2021. | | Conversion rate | When you have clear downstream actions (signup, purchase) | Requires conversion tracking beyond the email. Longer attribution windows. | | Revenue per email | E-commerce, when you can tie revenue to individual sends | Best metric for bottom-line impact but needs robust attribution. | | Reply rate | Sales emails, cold outreach | Only relevant for emails that expect replies. | | Unsubscribe rate | Safety metric - always monitor alongside your primary metric | A variant can win on clicks but lose subscribers. Check both. |
Since iOS 15 (September 2021), Apple Mail pre-fetches images and tracking pixels for all emails, generating false "opens." This affects roughly 50-60% of consumer email audiences.
Impact on A/B testing:
One-off tests are useful. A systematic testing program compounds learning.
Run tests in this order for maximum learning:
After each test, record:
Without documentation, you'll re-run the same tests or, worse, make changes that contradict what you've already learned.
The single most common mistake. After 200 sends, variant B has a 25% open rate vs variant A's 20%. "B wins!" No - with 200 sends, that 5-point difference is well within the margin of error. You need thousands of observations for open rate tests.
Fix: Calculate your required sample size before starting. Don't look at results until you've reached it.
If your list is under 1,000 contacts, most A/B tests are statistically meaningless. You won't have enough data to distinguish a real effect from noise.
Fix: For small lists, skip formal A/B tests. Instead, make bigger, bolder changes between campaigns and observe trends over time. Or batch multiple campaigns together to accumulate sample size.
Changing the subject line, CTA, images, and send time simultaneously. When variant B wins, you don't know which change caused it.
Fix: One variable per test. Always.
Variant A loses. You archive it. But variant A might have outperformed on a secondary metric (lower unsubscribes, higher reply rate) or performed better in a specific segment.
Fix: Analyze test results by segment (mobile vs desktop, new subscribers vs long-term, engagement level). A "loser" overall might be a winner for a subset.
If 50% of your audience uses Apple Mail, your open rate data includes a large number of phantom opens. This dilutes real differences and makes tests harder to call.
Fix: Filter Apple Mail opens from your analysis if your ESP supports it, or use click rate as your primary metric.
Most ESP "auto-winner" features send to a test subset (10-20%), wait a fixed time (often just 2-4 hours), and send the "winner" to the rest. Two hours is nowhere near enough time for reliable results.
Fix: If you use auto-winner, set the wait time to at least 24 hours. Better yet, set it to 48 hours. If your ESP doesn't allow a long enough wait, run the test manually.
Testing "Sale ends today!" vs "Last chance - 24 hours left" is not a reusable learning. It's a one-off optimization.
Fix: Test frameworks and patterns, not specific copy. Test "urgency vs curiosity" as a subject line approach, then apply the winner to future campaigns with different specific copy.
You're optimizing variant A vs B, but never asking "should we be sending this email at all?"
Fix: Run a holdout test on your main email programs at least once per quarter.
If you send variant A to 1,000 people and variant B to 10,000 people, the test is not valid even if you set it up as 50/50. Technical issues (send failures, bounce spikes, ESP throttling) can create uneven distribution.
Fix: Always verify actual send counts per variant before analyzing results. If the split is more than 5% off from your target, investigate before drawing conclusions.
Variant B didn't win on open rate, but it won on click-to-open rate! Let's call that the winner. This is cherry-picking and dramatically inflates false positives.
Fix: Declare your primary metric before the test starts. Secondary metrics are informational, not decision-making.
Most email service providers (ESPs) have built-in A/B testing. When evaluating tools, look for:
molted.email implements deterministic hash-based variant assignment with weighted buckets, holdout group support, and two-proportion z-test significance testing with 95% confidence intervals. Experiments are tied to journey steps, so variant assignment persists across a sequence rather than randomizing per-send.
data-ai
Choose and configure an email service provider. Use when setting up email for a new project, comparing providers, migrating between providers, or adding failover.
development
Set up SPF, DKIM, and DMARC email authentication. Use when configuring a new sending domain, debugging spam/rejection issues, adding email providers, or preparing for Google/Yahoo/Microsoft bulk sender requirements.
development
Design and send transactional emails. Use when building password resets, receipts, shipping notifications, account alerts, or separating transactional from marketing streams.
development
Build welcome and activation email sequences. Use when designing signup flows, driving users to key actions, converting trials to paid, or reducing early churn.