skills/ab-test-setup/SKILL.md
When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," or "hypothesis." For tracking implementation, see analytics-tracking. For CRO strategy and test ideas, see page-cro. For statistical methods beyond experimentation, see statistical-analysis.
npx skillsauth add sharkitect-solutions/sharkitect-claude-toolkit ab-test-setupInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
| File | What It Contains | Load When |
|---|---|---|
| SKILL.md | Test design decisions, hypothesis framework, analysis methodology, anti-patterns | Always loaded (you are here) |
| statistical-pitfalls.md | SRM detection, multiple comparisons, Bayesian vs frequentist, sequential testing, when calculators lie | User asks about statistics, significance, sample size issues, or test results seem wrong |
| platform-implementation.md | Tool comparison (PostHog/Optimizely/VWO/LaunchDarkly/Statsig/Eppo), client-side vs server-side gotchas, CDN interference, feature flag migration | User asks about implementation, tool selection, or debugging test setup |
| test-velocity-management.md | Prioritization frameworks (ICE/PIE gotchas), test backlog management, velocity benchmarks, when NOT to test, rollout procedures | User asks about test prioritization, program management, or scaling experimentation |
Do NOT load companion files unless the user's question specifically requires that depth. Most test design questions are answerable from this file alone.
| Topic | This Skill | Not This Skill (Use Instead) | |---|---|---| | Designing experiments | YES | | | Hypothesis formulation | YES | | | Sample size decisions | YES | | | Test analysis methodology | YES | | | Statistical significance | YES | statistical-analysis (for general stats) | | Bayesian A/B testing | YES | statistical-analysis (for Bayesian theory) | | CRO test ideas | | page-cro, signup-flow-cro, popup-cro | | Event tracking setup | | analytics-tracking | | Multivariate copy testing | | copywriting | | Feature flag management (non-experiment) | | Use engineering judgment | | Pricing experiments | Hypothesis + measurement | pricing-strategy (for pricing decisions) |
Pick the first match:
| Signal | Test Type | Why | |---|---|---| | Major page redesign or new flow | Split URL test | Different URLs avoid client-side modification complexity. Cleaner implementation for structural changes | | Multiple simultaneous element changes + >50K weekly visitors | Multivariate (MVT) | Tests interaction effects between changes. Requires 4-10x traffic of simple A/B. Most teams overestimate their traffic for MVT | | Testing a new feature rollout | Feature flag with measurement | Not every rollout needs full A/B infrastructure. Binary on/off with pre/post comparison is sufficient when effect size is large (>30%) | | Comparing 3+ creative approaches | A/B/n | Multiple variants, single change dimension. Requires Bonferroni or Holm correction for multiple comparisons (most tools don't do this automatically) | | Single change hypothesis | A/B (50/50 split) | Default. Simple, fast to reach significance, easiest to analyze correctly |
The MVT trap: Teams run MVT because it feels more "scientific." Reality: a 2x2 MVT with 4 cells needs ~4x the traffic of a simple A/B per cell. A site with 10K weekly visitors running a 2x2 MVT needs 16+ weeks vs 4 weeks for simple A/B. Run sequential A/Bs instead.
| Level | Example | Problem | |---|---|---| | 1 (Weak) | "Let's test a new button color" | No reasoning, no prediction, no measurement criteria | | 2 (Directional) | "A green button will increase clicks" | Has prediction but no reasoning or quantification | | 3 (Reasoned) | "Because heatmaps show users miss the CTA, a higher-contrast button will increase clicks" | Has reasoning + prediction but no quantification | | 4 (Quantified) | "Because heatmaps show 60% of users never scroll to the CTA, moving it above the fold will increase click-through rate by 15%+ for new visitors" | Has observation + specific change + quantified prediction + audience + metric | | 5 (Falsifiable) | Level 4 + "We'll measure over 14 days with 5K visitors per variant. If CTR increase is <10%, we'll reject the hypothesis" | Adds pre-committed sample size, duration, and rejection criteria |
Always aim for Level 4+. Level 1-2 hypotheses produce tests that can't fail -- every result gets interpreted as "interesting," which means you learn nothing.
OBSERVE: [specific data point or user behavior]
BELIEVE: [specific change] will cause [metric] to [increase/decrease] by [X%]
FOR: [audience segment]
MEASURE: [primary metric] over [duration] with [sample size] per variant
REJECT IF: [metric change] < [minimum threshold]
The standard formula (Evan Miller calculator, etc.) assumes conditions that rarely hold:
| Assumption | Reality | Impact | |---|---|---| | Fixed sample size, single analysis | Most teams peek at results | Inflates false positive rate from 5% to 20-30%. Use sequential testing or commit to NO peeking | | Single primary metric | Teams track 5-10 metrics | Multiple comparisons: 5 metrics at 95% confidence = 23% chance of at least one false positive. Apply Bonferroni correction or pre-commit to ONE primary metric | | Equal variance across segments | Mobile/desktop, new/returning have different baselines | Segment-level effects can be real but opposite in direction (Simpson's paradox). Pre-stratify or analyze segments independently | | Stable baseline during test | Seasonality, promotions, PR events shift baselines | Run tests for full business cycles (minimum 1 week, ideally 2). Don't start tests on Black Friday | | No interference between variants | Users may switch devices, share URLs, use multiple accounts | Cookie-based assignment leaks. Use user-ID assignment when possible. Accept ~5% contamination as unavoidable |
The most under-checked test validity issue. If your 50/50 split shows 51.2%/48.8% actual distribution, your test may be broken.
| Split Deviation | With 10K Users | Assessment | |---|---|---| | <0.5% | 50.2/49.8 | Normal random variation | | 0.5-1.5% | 50.5/49.5 to 51.5/48.5 | Check SRM (chi-squared test, p<0.01 = problem) | | >1.5% | 51.5/48.5+ | Almost certainly SRM. DO NOT trust results. Debug assignment logic |
Common SRM causes: bot traffic hitting one variant more, redirects dropping users, ad blockers blocking test JavaScript, CDN caching serving stale variant, broken assignment logic.
| Scenario | Action | Common Mistake | |---|---|---| | Reached sample size, p<0.05, meaningful effect | Ship the winner | Waiting for "more data" after reaching pre-committed sample size. You committed to a methodology -- honor it | | Reached sample size, p<0.05, tiny effect (1-2%) | Consider implementation cost | A 1.5% conversion lift on 10K monthly visitors = 150 more conversions. Worth it if implementation is a CSS change. Not worth it if it's a 2-week refactor | | Reached sample size, p>0.05 | Keep control, document learning | Calling it "inconclusive" and re-running. If you reached your pre-committed sample size and didn't detect an effect, the effect is likely smaller than your MDE. That IS a result | | Haven't reached sample size, variant looks like it's winning | Keep running | Peeking + stopping = false positive inflation. The apparent winner at 40% of sample size reverts ~30% of the time | | Haven't reached sample size, variant looks harmful | Check guardrail metrics | If guardrails are significantly negative (p<0.01), stop the test. This is the ONE exception to "don't stop early" | | Conflicting segment results | Report the overall result, note segments | Cherry-picking the segment where the variant won. Post-hoc segments are hypotheses for the NEXT test, not conclusions |
| When You Peek and Stop | Actual False Positive Rate (vs stated 5%) | |---|---| | After 25% of sample | ~26% | | After 50% of sample | ~16% | | After 75% of sample | ~10% | | At pre-committed sample size only | ~5% (as intended) |
If you MUST peek: Use sequential testing (SPRT, always-valid p-values, or mSPRT). These methods control error rates across multiple looks but require 20-30% more sample size.
| Situation | Why Skip A/B Testing | Do Instead | |---|---|---| | <1K weekly visitors to test page | Will take months to reach significance for any reasonable MDE | Make the change, compare pre/post with caution. Or test on a higher-traffic page first | | Fixing an obvious bug | You don't A/B test whether to fix broken things | Fix it. Monitor for regression | | Legal/compliance requirement | No choice in implementation | Implement. Measure impact but don't delay for a test | | Already 95%+ confident in direction | The "test everything" dogma wastes resources | Ship it. Save testing capacity for genuine uncertainty | | Effect is binary (works/doesn't) | No spectrum of outcomes to measure | Feature flag with monitoring, not A/B test |
What happens: Check results daily, stop test when variant is ahead, declare victory. Why it fails: At 50% of target sample size, apparent winners revert 30% of the time. You're making decisions on noise. The rationalization: "We could see the trend clearly"
What happens: Change 5 things at once in the variant. "New headline, new image, new CTA, new layout, new social proof." Why it fails: When it wins (or loses), you don't know which change caused it. Zero learning, pure gambling. The rationalization: "We wanted a bold test"
What happens: Test doesn't hit significance on primary metric. Analyst searches secondary metrics until finding one that's significant. Why it fails: With 20 metrics, you'll find ~1 significant result by chance alone. This IS the multiple comparisons problem. The rationalization: "We found an interesting insight"
What happens: Test runs for 3+ months because "we want more data." Traffic allocation never freed up. Why it fails: External factors accumulate over long tests (seasonality, product changes, audience shifts). Results become meaningless. The rationalization: "We want to be really sure"
What happens: Overall result is flat. Team slices by every available dimension until finding a segment where the variant wins. Ships variant for that segment. Why it fails: Post-hoc segments are hypotheses, not conclusions. With 10 segments, you'll find 1-2 "winners" by chance. The rationalization: "It works for mobile users from California on Tuesdays"
What happens: Test designed to detect 20% lift with 80% power. Actual effect is 3%. Test runs to completion, variant shows +3% (p=0.4). Team calls it "directionally positive" and ships. Why it fails: A non-significant result is NOT evidence of a small positive effect. It's evidence that the effect (if any) is smaller than your MDE. The rationalization: "The data suggests a trend"
development
When the user wants help with paid advertising campaigns on Google Ads, Meta (Facebook/Instagram), LinkedIn, Twitter/X, or other ad platforms. Also use when the user mentions 'PPC,' 'paid media,' 'ad copy,' 'ad creative,' 'ROAS,' 'CPA,' 'ad campaign,' 'retargeting,' or 'audience targeting.' This skill covers campaign strategy, ad creation, audience targeting, and optimization.
testing
--- name: using-sharkitect-methodology description: Use when starting any conversation in a Sharkitect workspace OR before any task involving NEW pricing, positioning, proposal, strategy, plan-execution, or schema-design work — mandates invocation of Sharkitect-specific methodology skills (pricing-strategy, marketing-strategy-pmm, smb-cfo, hq-revenue-ops, executing-plans, brainstorming) under the same anti-rationalization discipline as using-superpowers. Documentation has failed 4 times across H
testing
Use when user says 'end session', 'wrap up', 'stop for the day', 'done for today', 'close out', 'save session', 'wrapping up', or invokes /end-session. Runs the full 9-step end-of-session protocol: resource audit, MEMORY.md update, lessons capture, plan status, pending items, workspace checklist, .tmp/ audit, git commit+push, Supabase brain sync, session brief, summary. Final step schedules a detached self-kill of the current session ONLY (3s delay) so the window closes cleanly. Other claude.exe processes (active workspaces) are NOT touched -- orphan cleanup is handled separately by Claude-Orphan-Cleanup-Hourly with proper age safeguards. Do NOT use for: mid-session quick saves (use session-checkpoint), skill syncing (use sync-skills.py), brain memory queries (use supabase-sync.py pull), document freshness reviews (use document-lifecycle), resource gap detection (use resource-auditor).
testing
Remove signs of AI-generated writing from text. Use when editing or reviewing text to make it sound more natural and human-written. Based on Wikipedia's comprehensive "Signs of AI writing" guide. Detects and fixes patterns including: inflated symbolism, promotional language, superficial -ing analyses, vague attributions, em dash overuse, rule of three, AI vocabulary words, passive voice, negative parallelisms, and filler phrases.