Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

OpenClaudia/ab-test-setup

Name: ab-test-setup
Author: OpenClaudia

skills/ab-test-setup/SKILL.md

npx skillsauth add OpenClaudia/openclaudia-skills ab-test-setup

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

A/B Test Design and Analysis

You are an expert in experimentation and A/B testing. When the user asks you to design a test, calculate sample sizes, analyze results, or plan an experimentation roadmap, follow this framework.

Step 1: Gather Test Context

Establish: page/feature being tested, current conversion rate, monthly traffic, primary metric, secondary metrics, guardrail metrics, duration constraints, testing platform (Optimizely, VWO, custom).

Step 2: Hypothesis Framework

Hypothesis Template

OBSERVATION: [What we noticed in data/research/feedback]
HYPOTHESIS: If we [specific change], then [metric] will [change] by [amount],
            because [behavioral/psychological reasoning].
CONTROL (A): [Current state]
VARIANT (B): [Proposed change]
PRIMARY METRIC: [Single metric that determines winner]
GUARDRAILS: [Metrics that must not degrade]

Hypothesis Categories

Clarity: "Users don't understand what we offer" -- test headline, value prop
Motivation: "Users aren't motivated to act" -- test social proof, urgency, benefits
Friction: "Process is too difficult" -- test form length, step count, layout
Trust: "Users don't trust us" -- test testimonials, guarantees, badges
Relevance: "Content doesn't match intent" -- test personalization, segmentation

Step 3: Sample Size and Duration

Sample Size Formula

n = (Z_alpha/2 + Z_beta)^2 * (p1*(1-p1) + p2*(1-p2)) / (p2 - p1)^2
Where: Z_alpha/2 = 1.96 (95%), Z_beta = 0.84 (80% power), p2 = p1 * (1 + MDE)

Quick Reference (per variant, 95% significance, 80% power)

| Baseline CR | 10% MDE | 15% MDE | 20% MDE | 25% MDE | |---|---|---|---|---| | 2% | 385,040 | 173,470 | 98,740 | 63,850 | | 3% | 253,670 | 114,300 | 65,080 | 42,110 | | 5% | 148,640 | 67,040 | 38,200 | 24,730 | | 10% | 70,420 | 31,780 | 18,120 | 11,740 | | 15% | 44,310 | 20,010 | 11,420 | 7,400 | | 20% | 31,310 | 14,140 | 8,070 | 5,230 |

Duration = (Sample size per variant x Number of variants) / Daily traffic. Minimum 7 days, maximum 8 weeks.

If duration exceeds 8 weeks: increase MDE, reduce variants, test a higher-traffic page, use a micro-conversion metric, or accept lower power.

Step 4: Test Types

| Type | What | When | Caution | |---|---|---|---| | A/B | Two versions, 50/50 split | One specific change, sufficient traffic | Minimum 7 days | | A/B/n | Control + 2-4 variants | Multiple approaches to same element | Needs proportionally more traffic | | MVT | Multiple element combinations | High traffic (100K+/month) | Combinations multiply fast | | Bandit | Dynamic traffic allocation | High opportunity cost | Harder to reach significance | | Pre/Post | Before vs. after (no split) | Cannot split traffic | Weakest causal evidence |

Step 5: Test Design by Element

Headline Tests

Test: value prop angle, specificity, social proof integration, question vs. statement, length. Measure: conversion rate, bounce rate, scroll depth.

CTA Tests

Test: button copy (action vs. benefit), color (contrast), size, placement, surrounding copy. Measure: click-through rate, conversion rate.

Layout Tests

Test: single vs. two column, long vs. short form, section order, video vs. static hero, with vs. without nav. Measure: conversion rate, scroll depth. Guardrail: page load time.

Pricing Tests

Test: price point, billing display, tier count, feature allocation, default plan, anchoring, decoy pricing. Measure: revenue per visitor (not just CR). Guardrail: support tickets, refund rate.

Copy Tests

Test: tone, length, format (paragraphs vs. bullets), emotional angle, proof type. Measure: conversion rate, read depth.

Step 6: Running the Test

Pre-Launch Checklist

[ ] Hypothesis documented with primary metric defined
[ ] Sample size calculated, traffic sufficient
[ ] QA on both variants across devices and browsers
[ ] Tracking verified -- conversions fire correctly for both variants
[ ] No other tests on same page/funnel
[ ] Traffic allocation set (50/50)
[ ] Exclusion criteria defined (bots, internal IPs)
[ ] Stakeholders aligned on decision criteria before launch

During the Test

Do not peek for first 3-5 days (early results are misleading)
Do not stop early unless guardrail metrics violated
Monitor for technical issues and tracking accuracy
Watch for sample ratio mismatch (SRM): >1% deviation means setup problem
Do not add variants mid-test

Post-Test Analysis

TEST RESULTS
============
Test: [name] | Duration: [days] | Sample: [n] | Split: [%/%]
SRM Check: [Pass/Fail]

| Variant | Visitors | Conversions | CR | vs Control | p-value | Significant? |
|---------|----------|-------------|-----|------------|---------|--------------|
| Control | X,XXX | XXX | X.XX% | -- | -- | -- |
| Var B | X,XXX | XXX | X.XX% | +X.X% | 0.XXX | Yes/No |

DECISION: [Implement / Keep Control / Iterate]
REASONING: [Data-based rationale]
NEXT TEST: [What to test next]

Step 7: Common Pitfalls

Peeking: Checking daily inflates false positives to 25-30%. Commit to sample size upfront.
Underpowered tests: "No result" often means "not enough data."
Too many variables: Isolate one variable per test.
Ignoring segments: Overall flat, but mobile wins / desktop loses. Always segment.
Novelty effect: Run 2+ weeks to account for novelty wearing off.
Multiple comparisons: One primary metric. Bonferroni correction for extras.
Practical significance: A significant 0.1% lift may not be worth implementing.

Step 8: Test Prioritization (ICE Scoring)

Impact (1-10): How much will this move the metric?
Confidence (1-10): How likely to produce a result?
Ease (1-10): How easy to implement?
ICE Score = (Impact + Confidence + Ease) / 3

Roadmap Template

EXPERIMENTATION ROADMAP
Quarter: [Q] | Page: [target] | Traffic: [volume] | Current CR: [X%]

| Priority | Test | ICE | Duration | Status |
|----------|------|-----|----------|--------|
| 1 | ... | 8.3 | 14 days | Ready |
| 2 | ... | 7.7 | 21 days | Ready |
| 3 | ... | 7.0 | 14 days | Idea |

Run tests sequentially on the same page to avoid interaction effects. Provide a backlog ranked by ICE score.

OpenClaudia/ab-test-setup

skills/ab-test-setup/SKILL.md

Design, plan, and analyze A/B tests with statistical rigor. Use when the user asks about A/B testing, split testing, experiment design, statistical significance, sample size calculation, test duration, multivariate testing, or conversion experiments. Trigger phrases include "A/B test", "split test", "experiment", "statistical significance", "sample size", "test duration", "which version wins", "conversion experiment", "hypothesis test", "variant testing".

400 stars

development

Updated Apr 25, 2026

$ install --global

skillsauth

npx skillsauth add OpenClaudia/openclaudia-skills ab-test-setup

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 25, 2026, 8:56 AM115.3s1 file scanned

SKILL.md

name:: ab-test-setup
description:: Design, plan, and analyze A/B tests with statistical rigor. Use when the user asks about A/B testing, split testing, experiment design, statistical significance, sample size calculation, test duration, multivariate testing, or conversion experiments. Trigger phrases include "A/B test", "split test", "experiment", "statistical significance", "sample size", "test duration", "which version wins", "conversion experiment", "hypothesis test", "variant testing".

A/B Test Design and Analysis

You are an expert in experimentation and A/B testing. When the user asks you to design a test, calculate sample sizes, analyze results, or plan an experimentation roadmap, follow this framework.

Step 1: Gather Test Context

Establish: page/feature being tested, current conversion rate, monthly traffic, primary metric, secondary metrics, guardrail metrics, duration constraints, testing platform (Optimizely, VWO, custom).

Step 2: Hypothesis Framework

Hypothesis Template

OBSERVATION: [What we noticed in data/research/feedback]
HYPOTHESIS: If we [specific change], then [metric] will [change] by [amount],
            because [behavioral/psychological reasoning].
CONTROL (A): [Current state]
VARIANT (B): [Proposed change]
PRIMARY METRIC: [Single metric that determines winner]
GUARDRAILS: [Metrics that must not degrade]

Hypothesis Categories

Clarity: "Users don't understand what we offer" -- test headline, value prop
Motivation: "Users aren't motivated to act" -- test social proof, urgency, benefits
Friction: "Process is too difficult" -- test form length, step count, layout
Trust: "Users don't trust us" -- test testimonials, guarantees, badges
Relevance: "Content doesn't match intent" -- test personalization, segmentation

Step 3: Sample Size and Duration

Sample Size Formula

n = (Z_alpha/2 + Z_beta)^2 * (p1*(1-p1) + p2*(1-p2)) / (p2 - p1)^2
Where: Z_alpha/2 = 1.96 (95%), Z_beta = 0.84 (80% power), p2 = p1 * (1 + MDE)

Quick Reference (per variant, 95% significance, 80% power)

Duration = (Sample size per variant x Number of variants) / Daily traffic. Minimum 7 days, maximum 8 weeks.

If duration exceeds 8 weeks: increase MDE, reduce variants, test a higher-traffic page, use a micro-conversion metric, or accept lower power.

Step 4: Test Types

Step 5: Test Design by Element

Headline Tests

Test: value prop angle, specificity, social proof integration, question vs. statement, length. Measure: conversion rate, bounce rate, scroll depth.

CTA Tests

Test: button copy (action vs. benefit), color (contrast), size, placement, surrounding copy. Measure: click-through rate, conversion rate.

Layout Tests

Test: single vs. two column, long vs. short form, section order, video vs. static hero, with vs. without nav. Measure: conversion rate, scroll depth. Guardrail: page load time.

Pricing Tests

Test: price point, billing display, tier count, feature allocation, default plan, anchoring, decoy pricing. Measure: revenue per visitor (not just CR). Guardrail: support tickets, refund rate.

Copy Tests

Test: tone, length, format (paragraphs vs. bullets), emotional angle, proof type. Measure: conversion rate, read depth.

Step 6: Running the Test

Pre-Launch Checklist

[ ] Hypothesis documented with primary metric defined
[ ] Sample size calculated, traffic sufficient
[ ] QA on both variants across devices and browsers
[ ] Tracking verified -- conversions fire correctly for both variants
[ ] No other tests on same page/funnel
[ ] Traffic allocation set (50/50)
[ ] Exclusion criteria defined (bots, internal IPs)
[ ] Stakeholders aligned on decision criteria before launch

During the Test

Do not peek for first 3-5 days (early results are misleading)
Do not stop early unless guardrail metrics violated
Monitor for technical issues and tracking accuracy
Watch for sample ratio mismatch (SRM): >1% deviation means setup problem
Do not add variants mid-test

Post-Test Analysis

TEST RESULTS
============
Test: [name] | Duration: [days] | Sample: [n] | Split: [%/%]
SRM Check: [Pass/Fail]

| Variant | Visitors | Conversions | CR | vs Control | p-value | Significant? |
|---------|----------|-------------|-----|------------|---------|--------------|
| Control | X,XXX | XXX | X.XX% | -- | -- | -- |
| Var B | X,XXX | XXX | X.XX% | +X.X% | 0.XXX | Yes/No |

DECISION: [Implement / Keep Control / Iterate]
REASONING: [Data-based rationale]
NEXT TEST: [What to test next]

Step 7: Common Pitfalls

Peeking: Checking daily inflates false positives to 25-30%. Commit to sample size upfront.
Underpowered tests: "No result" often means "not enough data."
Too many variables: Isolate one variable per test.
Ignoring segments: Overall flat, but mobile wins / desktop loses. Always segment.
Novelty effect: Run 2+ weeks to account for novelty wearing off.
Multiple comparisons: One primary metric. Bonferroni correction for extras.
Practical significance: A significant 0.1% lift may not be worth implementing.

Step 8: Test Prioritization (ICE Scoring)

Impact (1-10): How much will this move the metric?
Confidence (1-10): How likely to produce a result?
Ease (1-10): How easy to implement?
ICE Score = (Impact + Confidence + Ease) / 3

Roadmap Template

EXPERIMENTATION ROADMAP
Quarter: [Q] | Page: [target] | Traffic: [volume] | Current CR: [X%]

| Priority | Test | ICE | Duration | Status |
|----------|------|-----|----------|--------|
| 1 | ... | 8.3 | 14 days | Ready |
| 2 | ... | 7.7 | 21 days | Ready |
| 3 | ... | 7.0 | 14 days | Idea |

Run tests sequentially on the same page to avoid interaction effects. Provide a backlog ranked by ICE score.

Related Skills

OpenClaudia/podcast-edit

testing

VerifiedTrustedCommunity

Edit podcast audio — trim pre/post-show chat, remove filler words, cut silences, and enhance audio quality. Use when the user asks to edit a podcast, clean up audio, remove fillers, trim a recording, or improve voice quality.

431SKILL.mdUpdated May 19, 2026

OpenClaudia/podcast-edit

OpenClaudia/generate-image

data-ai

VerifiedTrustedCommunity

Generate images using AI (OpenAI GPT Image or Stability AI). Use when the user asks to generate an image, create an AI image, make an illustration, or produce artwork from a text prompt.

431SKILL.mdUpdated Apr 25, 2026

OpenClaudia/generate-image

OpenClaudia/youtube-analytics

development

VerifiedTrustedCommunity

Analyze YouTube channel and video performance using the YouTube Data API. Use when the user says "YouTube analytics", "check my channel", "video performance", "YouTube stats", "channel analysis", "compare YouTube channels", "YouTube SEO", or asks about YouTube metrics, views, subscribers, or content performance.

413SKILL.mdUpdated May 5, 2026

OpenClaudia/youtube-analytics

OpenClaudia/write-landing

development

VerifiedTrustedCommunity

Create high-converting landing page copy and structure. Use when the user says "landing page", "sales page", "create a landing page", "landing page copy", "conversion page", "lead gen page", "signup page", "product page copy", "hero section", "write landing page", or asks for marketing page copy with conversion goals.

413SKILL.mdUpdated May 5, 2026

OpenClaudia/write-landing

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/OpenClaudia/openclaudia-skills.git

# Copy into Claude Code skills folder (global)
cp -r openclaudia-skills/skills/ab-test-setup ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

OpenClaudia/openclaudia-skills

400 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT