skills/ab-test-setup/SKILL.md
Design, plan, and analyze A/B tests with statistical rigor. Use when the user asks about A/B testing, split testing, experiment design, statistical significance, sample size calculation, test duration, multivariate testing, or conversion experiments. Trigger phrases include "A/B test", "split test", "experiment", "statistical significance", "sample size", "test duration", "which version wins", "conversion experiment", "hypothesis test", "variant testing".
npx skillsauth add OpenClaudia/openclaudia-skills ab-test-setupInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
You are an expert in experimentation and A/B testing. When the user asks you to design a test, calculate sample sizes, analyze results, or plan an experimentation roadmap, follow this framework.
Establish: page/feature being tested, current conversion rate, monthly traffic, primary metric, secondary metrics, guardrail metrics, duration constraints, testing platform (Optimizely, VWO, custom).
OBSERVATION: [What we noticed in data/research/feedback]
HYPOTHESIS: If we [specific change], then [metric] will [change] by [amount],
because [behavioral/psychological reasoning].
CONTROL (A): [Current state]
VARIANT (B): [Proposed change]
PRIMARY METRIC: [Single metric that determines winner]
GUARDRAILS: [Metrics that must not degrade]
n = (Z_alpha/2 + Z_beta)^2 * (p1*(1-p1) + p2*(1-p2)) / (p2 - p1)^2
Where: Z_alpha/2 = 1.96 (95%), Z_beta = 0.84 (80% power), p2 = p1 * (1 + MDE)
| Baseline CR | 10% MDE | 15% MDE | 20% MDE | 25% MDE | |---|---|---|---|---| | 2% | 385,040 | 173,470 | 98,740 | 63,850 | | 3% | 253,670 | 114,300 | 65,080 | 42,110 | | 5% | 148,640 | 67,040 | 38,200 | 24,730 | | 10% | 70,420 | 31,780 | 18,120 | 11,740 | | 15% | 44,310 | 20,010 | 11,420 | 7,400 | | 20% | 31,310 | 14,140 | 8,070 | 5,230 |
Duration = (Sample size per variant x Number of variants) / Daily traffic. Minimum 7 days, maximum 8 weeks.
If duration exceeds 8 weeks: increase MDE, reduce variants, test a higher-traffic page, use a micro-conversion metric, or accept lower power.
| Type | What | When | Caution | |---|---|---|---| | A/B | Two versions, 50/50 split | One specific change, sufficient traffic | Minimum 7 days | | A/B/n | Control + 2-4 variants | Multiple approaches to same element | Needs proportionally more traffic | | MVT | Multiple element combinations | High traffic (100K+/month) | Combinations multiply fast | | Bandit | Dynamic traffic allocation | High opportunity cost | Harder to reach significance | | Pre/Post | Before vs. after (no split) | Cannot split traffic | Weakest causal evidence |
Test: value prop angle, specificity, social proof integration, question vs. statement, length. Measure: conversion rate, bounce rate, scroll depth.
Test: button copy (action vs. benefit), color (contrast), size, placement, surrounding copy. Measure: click-through rate, conversion rate.
Test: single vs. two column, long vs. short form, section order, video vs. static hero, with vs. without nav. Measure: conversion rate, scroll depth. Guardrail: page load time.
Test: price point, billing display, tier count, feature allocation, default plan, anchoring, decoy pricing. Measure: revenue per visitor (not just CR). Guardrail: support tickets, refund rate.
Test: tone, length, format (paragraphs vs. bullets), emotional angle, proof type. Measure: conversion rate, read depth.
TEST RESULTS
============
Test: [name] | Duration: [days] | Sample: [n] | Split: [%/%]
SRM Check: [Pass/Fail]
| Variant | Visitors | Conversions | CR | vs Control | p-value | Significant? |
|---------|----------|-------------|-----|------------|---------|--------------|
| Control | X,XXX | XXX | X.XX% | -- | -- | -- |
| Var B | X,XXX | XXX | X.XX% | +X.X% | 0.XXX | Yes/No |
DECISION: [Implement / Keep Control / Iterate]
REASONING: [Data-based rationale]
NEXT TEST: [What to test next]
Impact (1-10): How much will this move the metric?
Confidence (1-10): How likely to produce a result?
Ease (1-10): How easy to implement?
ICE Score = (Impact + Confidence + Ease) / 3
EXPERIMENTATION ROADMAP
Quarter: [Q] | Page: [target] | Traffic: [volume] | Current CR: [X%]
| Priority | Test | ICE | Duration | Status |
|----------|------|-----|----------|--------|
| 1 | ... | 8.3 | 14 days | Ready |
| 2 | ... | 7.7 | 21 days | Ready |
| 3 | ... | 7.0 | 14 days | Idea |
Run tests sequentially on the same page to avoid interaction effects. Provide a backlog ranked by ICE score.
testing
Edit podcast audio — trim pre/post-show chat, remove filler words, cut silences, and enhance audio quality. Use when the user asks to edit a podcast, clean up audio, remove fillers, trim a recording, or improve voice quality.
data-ai
Generate images using AI (OpenAI GPT Image or Stability AI). Use when the user asks to generate an image, create an AI image, make an illustration, or produce artwork from a text prompt.
development
Analyze YouTube channel and video performance using the YouTube Data API. Use when the user says "YouTube analytics", "check my channel", "video performance", "YouTube stats", "channel analysis", "compare YouTube channels", "YouTube SEO", or asks about YouTube metrics, views, subscribers, or content performance.
development
Create high-converting landing page copy and structure. Use when the user says "landing page", "sales page", "create a landing page", "landing page copy", "conversion page", "lead gen page", "signup page", "product page copy", "hero section", "write landing page", or asks for marketing page copy with conversion goals.