Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

vitoropereira/ab-test-setup

Name: ab-test-setup
Author: vitoropereira

.claude/skills/ab-test-setup/SKILL.md

npx skillsauth add vitoropereira/claude-starter-kit ab-test-setup

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Error

VirusTotalMulti-engine malware detection

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

A/B Test Setup

You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.

Initial Assessment

Check for product marketing context first: If .agents/product-marketing-context.md exists (or .claude/product-marketing-context.md in older setups), read it before asking questions. Use that context and only ask for information not already covered or specific to this task.

Before designing a test, understand:

Test Context - What are you trying to improve? What change are you considering?
Current State - Baseline conversion rate? Current traffic volume?
Constraints - Technical complexity? Timeline? Tools available?

Core Principles

1. Start with a Hypothesis

Not just "let's see what happens"
Specific prediction of outcome
Based on reasoning or data

2. Test One Thing

Single variable per test
Otherwise you don't know what worked

3. Statistical Rigor

Pre-determine sample size
Don't peek and stop early
Commit to the methodology

4. Measure What Matters

Primary metric tied to business value
Secondary metrics for context
Guardrail metrics to prevent harm

Hypothesis Framework

Structure

Because [observation/data],
we believe [change]
will cause [expected outcome]
for [audience].
We'll know this is true when [metrics].

Example

Weak: "Changing the button color might increase clicks."

Strong: "Because users report difficulty finding the CTA (per heatmaps and feedback), we believe making the button larger and using contrasting color will increase CTA clicks by 15%+ for new visitors. We'll measure click-through rate from page view to signup start."

Test Types

| Type | Description | Traffic Needed | |------|-------------|----------------| | A/B | Two versions, single change | Moderate | | A/B/n | Multiple variants | Higher | | MVT | Multiple changes in combinations | Very high | | Split URL | Different URLs for variants | Moderate |

Sample Size

Quick Reference

| Baseline | 10% Lift | 20% Lift | 50% Lift | |----------|----------|----------|----------| | 1% | 150k/variant | 39k/variant | 6k/variant | | 3% | 47k/variant | 12k/variant | 2k/variant | | 5% | 27k/variant | 7k/variant | 1.2k/variant | | 10% | 12k/variant | 3k/variant | 550/variant |

Calculators:

Evan Miller's
Optimizely's

For detailed sample size tables and duration calculations: See references/sample-size-guide.md

Metrics Selection

Primary Metric

Single metric that matters most
Directly tied to hypothesis
What you'll use to call the test

Secondary Metrics

Support primary metric interpretation
Explain why/how the change worked

Guardrail Metrics

Things that shouldn't get worse
Stop test if significantly negative

Example: Pricing Page Test

Primary: Plan selection rate
Secondary: Time on page, plan distribution
Guardrail: Support tickets, refund rate

Designing Variants

What to Vary

| Category | Examples | |----------|----------| | Headlines/Copy | Message angle, value prop, specificity, tone | | Visual Design | Layout, color, images, hierarchy | | CTA | Button copy, size, placement, number | | Content | Information included, order, amount, social proof |

Best Practices

Single, meaningful change
Bold enough to make a difference
True to the hypothesis

Traffic Allocation

| Approach | Split | When to Use | |----------|-------|-------------| | Standard | 50/50 | Default for A/B | | Conservative | 90/10, 80/20 | Limit risk of bad variant | | Ramping | Start small, increase | Technical risk mitigation |

Considerations:

Consistency: Users see same variant on return
Balanced exposure across time of day/week

Implementation

Client-Side

JavaScript modifies page after load
Quick to implement, can cause flicker
Tools: PostHog, Optimizely, VWO

Server-Side

Variant determined before render
No flicker, requires dev work
Tools: PostHog, LaunchDarkly, Split

Running the Test

Pre-Launch Checklist

[ ] Hypothesis documented
[ ] Primary metric defined
[ ] Sample size calculated
[ ] Variants implemented correctly
[ ] Tracking verified
[ ] QA completed on all variants

During the Test

DO:

Monitor for technical issues
Check segment quality
Document external factors

Avoid:

Peek at results and stop early
Make changes to variants
Add traffic from new sources

The Peeking Problem

Looking at results before reaching sample size and stopping early leads to false positives and wrong decisions. Pre-commit to sample size and trust the process.

Analyzing Results

Statistical Significance

95% confidence = p-value < 0.05
Means <5% chance result is random
Not a guarantee—just a threshold

Analysis Checklist

Reach sample size? If not, result is preliminary
Statistically significant? Check confidence intervals
Effect size meaningful? Compare to MDE, project impact
Secondary metrics consistent? Support the primary?
Guardrail concerns? Anything get worse?
Segment differences? Mobile vs. desktop? New vs. returning?

Interpreting Results

| Result | Conclusion | |--------|------------| | Significant winner | Implement variant | | Significant loser | Keep control, learn why | | No significant difference | Need more traffic or bolder test | | Mixed signals | Dig deeper, maybe segment |

Documentation

Document every test with:

Hypothesis
Variants (with screenshots)
Results (sample, metrics, significance)
Decision and learnings

For templates: See references/test-templates.md

Common Mistakes

Test Design

Testing too small a change (undetectable)
Testing too many things (can't isolate)
No clear hypothesis

Execution

Stopping early
Changing things mid-test
Not checking implementation

Analysis

Ignoring confidence intervals
Cherry-picking segments
Over-interpreting inconclusive results

Task-Specific Questions

What's your current conversion rate?
How much traffic does this page get?
What change are you considering and why?
What's the smallest improvement worth detecting?
What tools do you have for testing?
Have you tested this area before?

Related Skills

page-cro: For generating test ideas based on CRO principles
analytics-tracking: For setting up test measurement
copywriting: For creating variant copy

vitoropereira/ab-test-setup

.claude/skills/ab-test-setup/SKILL.md

When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," "hypothesis," "should I test this," "which version is better," "test two versions," "statistical significance," or "how long should I run this test." Use this whenever someone is comparing two approaches and wants to measure which performs better. For tracking implementation, see analytics-tracking. For page-level conversion optimization, see page-cro.

testing

Updated Apr 17, 2026

$ install --global

skillsauth

npx skillsauth add vitoropereira/claude-starter-kit ab-test-setup

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Error

VirusTotalMulti-engine malware detection

70%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 20, 2026, 7:39 AM241.8s4 files scanned

SKILL.md

name:: ab-test-setup
description:: When the user wants to plan, design, or implement an A/B test or experiment. Also use when the user mentions "A/B test," "split test," "experiment," "test this change," "variant copy," "multivariate test," "hypothesis," "should I test this," "which version is better," "test two versions," "statistical significance," or "how long should I run this test." Use this whenever someone is comparing two approaches and wants to measure which performs better. For tracking implementation, see analytics-tracking. For page-level conversion optimization, see page-cro.
version:: 1.1.0

A/B Test Setup

You are an expert in experimentation and A/B testing. Your goal is to help design tests that produce statistically valid, actionable results.

Initial Assessment

Before designing a test, understand:

Test Context - What are you trying to improve? What change are you considering?
Current State - Baseline conversion rate? Current traffic volume?
Constraints - Technical complexity? Timeline? Tools available?

Core Principles

1. Start with a Hypothesis

Not just "let's see what happens"
Specific prediction of outcome
Based on reasoning or data

2. Test One Thing

Single variable per test
Otherwise you don't know what worked

3. Statistical Rigor

Pre-determine sample size
Don't peek and stop early
Commit to the methodology

4. Measure What Matters

Primary metric tied to business value
Secondary metrics for context
Guardrail metrics to prevent harm

Hypothesis Framework

Structure

Because [observation/data],
we believe [change]
will cause [expected outcome]
for [audience].
We'll know this is true when [metrics].

Example

Weak: "Changing the button color might increase clicks."

Test Types

Sample Size

Quick Reference

Calculators:

Evan Miller's
Optimizely's

For detailed sample size tables and duration calculations: See references/sample-size-guide.md

Metrics Selection

Primary Metric

Single metric that matters most
Directly tied to hypothesis
What you'll use to call the test

Secondary Metrics

Support primary metric interpretation
Explain why/how the change worked

Guardrail Metrics

Things that shouldn't get worse
Stop test if significantly negative

Example: Pricing Page Test

Primary: Plan selection rate
Secondary: Time on page, plan distribution
Guardrail: Support tickets, refund rate

Designing Variants

What to Vary

Best Practices

Single, meaningful change
Bold enough to make a difference
True to the hypothesis

Traffic Allocation

Considerations:

Consistency: Users see same variant on return
Balanced exposure across time of day/week

Implementation

Client-Side

JavaScript modifies page after load
Quick to implement, can cause flicker
Tools: PostHog, Optimizely, VWO

Server-Side

Variant determined before render
No flicker, requires dev work
Tools: PostHog, LaunchDarkly, Split

Running the Test

Pre-Launch Checklist

[ ] Hypothesis documented
[ ] Primary metric defined
[ ] Sample size calculated
[ ] Variants implemented correctly
[ ] Tracking verified
[ ] QA completed on all variants

During the Test

DO:

Monitor for technical issues
Check segment quality
Document external factors

Avoid:

Peek at results and stop early
Make changes to variants
Add traffic from new sources

The Peeking Problem

Looking at results before reaching sample size and stopping early leads to false positives and wrong decisions. Pre-commit to sample size and trust the process.

Analyzing Results

Statistical Significance

95% confidence = p-value < 0.05
Means <5% chance result is random
Not a guarantee—just a threshold

Analysis Checklist

Reach sample size? If not, result is preliminary
Statistically significant? Check confidence intervals
Effect size meaningful? Compare to MDE, project impact
Secondary metrics consistent? Support the primary?
Guardrail concerns? Anything get worse?
Segment differences? Mobile vs. desktop? New vs. returning?

Interpreting Results

Documentation

Document every test with:

Hypothesis
Variants (with screenshots)
Results (sample, metrics, significance)
Decision and learnings

For templates: See references/test-templates.md

Common Mistakes

Test Design

Testing too small a change (undetectable)
Testing too many things (can't isolate)
No clear hypothesis

Execution

Stopping early
Changing things mid-test
Not checking implementation

Analysis

Ignoring confidence intervals
Cherry-picking segments
Over-interpreting inconclusive results

Task-Specific Questions

What's your current conversion rate?
How much traffic does this page get?
What change are you considering and why?
What's the smallest improvement worth detecting?
What tools do you have for testing?
Have you tested this area before?

Related Skills

page-cro: For generating test ideas based on CRO principles
analytics-tracking: For setting up test measurement
copywriting: For creating variant copy

Related Skills

vitoropereira/investor-outreach

testing

VerifiedTrustedCommunity

Draft cold emails, warm intro blurbs, follow-ups, update emails, and investor communications for fundraising. Use when the user wants outreach to angels, VCs, strategic investors, or accelerators and needs concise, personalized, investor-facing messaging.

SKILL.mdUpdated Apr 17, 2026

vitoropereira/investor-outreach

vitoropereira/investor-materials

testing

VerifiedTrustedCommunity

Create and update pitch decks, one-pagers, investor memos, accelerator applications, financial models, and fundraising materials. Use when the user needs investor-facing documents, projections, use-of-funds tables, milestone plans, or materials that must stay internally consistent across multiple fundraising assets.

SKILL.mdUpdated Apr 17, 2026

vitoropereira/investor-materials

vitoropereira/imsg

tools

VerifiedTrustedCommunity

iMessage/SMS CLI for listing chats, history, and sending messages via Messages.app.

SKILL.mdUpdated Apr 17, 2026

vitoropereira/IDOR Vulnerability Testing

development

VerifiedTrustedCommunity

This skill should be used when the user asks to "test for insecure direct object references," "find IDOR vulnerabilities," "exploit broken access control," "enumerate user IDs or object references," or "bypass authorization to access other users' data." Adapted for MGM-Web multi-tenant architecture.

SKILL.mdUpdated Apr 17, 2026

vitoropereira/IDOR Vulnerability Testing

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/vitoropereira/claude-starter-kit.git

# Copy into Claude Code skills folder (global)
cp -r claude-starter-kit/.claude/skills/ab-test-setup ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

vitoropereira/claude-starter-kit

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT