Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

mohitagw15856/ab-test-planner

Name: ab-test-planner
Author: mohitagw15856

plugins/pm-delivery/skills/ab-test-planner/SKILL.md

npx skillsauth add mohitagw15856/pm-claude-skills ab-test-planner

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

A/B Test Planner Skill

Design experiments that produce trustworthy results — not just directional signals. Every test output includes hypothesis, success metrics, sample size, duration, and a results interpretation guide.

Required Inputs

Ask the user for these if not provided:

What is being tested (feature, UI change, copy, pricing, onboarding step)
Hypothesis (or ask to help formulate one)
Primary metric (conversion rate, click-through, completion rate, etc.)
Baseline rate and minimum detectable effect (MDE)
Daily eligible users (to calculate duration)

Experiment Design Checklist

Before running any test, confirm:

[ ] Clear hypothesis with predicted direction
[ ] Single primary metric (plus up to 2 guardrail metrics)
[ ] Minimum detectable effect (MDE) defined
[ ] Sample size calculated
[ ] Test duration estimated
[ ] Segment isolated (no overlap with other running tests)
[ ] Rollback plan defined

Hypothesis Template

"We believe that [change] will cause [primary metric] to [increase/decrease] by [X%] for [user segment], because [rationale based on data or insight]."

Never run a test without a directional hypothesis. "Let's just see what happens" is not a hypothesis.

Sample Size Calculator Logic

Use this formula (provide the output, not the formula, to the user):

Baseline conversion rate: Current rate of primary metric
MDE: Smallest change worth detecting (recommend 10–20% relative lift for most features)
Statistical power: 80% (standard)
Significance level: 95% (p < 0.05)

For common scenarios, provide pre-calculated estimates:

| Baseline Rate | MDE (Relative) | Required Sample per Variant | |---|---|---| | 5% | 20% | ~19,000 | | 10% | 15% | ~14,000 | | 20% | 10% | ~15,000 | | 40% | 10% | ~9,500 | | 60% | 5% | ~42,000 |

Always warn: "These are estimates. Use a tool like Evan Miller's calculator or Statsig for precision."

Test Duration Guidance

Minimum: 2 full weeks (to capture weekly seasonality) Maximum: 4 weeks (novelty effect distorts results beyond this)

Duration = Required sample ÷ (Daily traffic × % exposed)

Flag if traffic is too low to reach significance in under 8 weeks — recommend a different approach (e.g., holdout test, qualitative research).

Output Format

A/B Test Plan — [Test Name] — [Date]

Hypothesis:

[Filled hypothesis template]

Variants:

Control (A): [Current experience]
Treatment (B): [Changed experience — be specific]

Primary Metric: [Metric name + how measured] Guardrail Metrics: [Metrics that must not degrade]

Target Segment: [Who sees the test — % of traffic, user type] Traffic Split: [50/50 recommended unless ramp-up needed]

Sample Size Required: ~[N] users per variant Estimated Duration: [X] weeks (based on [Y] daily eligible users) Significance Threshold: 95% confidence, 80% power

Exclusions: [Any user segments to exclude and why]

Rollback Trigger: If [guardrail metric] degrades by [X%], stop the test immediately.

Results Interpretation Guide:

✅ Ship if: Treatment shows [X%]+ lift on primary metric at 95% confidence AND guardrail metrics are stable
🔄 Iterate if: Direction is positive but not significant — consider extending or redesigning
❌ Reject if: No lift or negative direction at significance
⚠️ Inconclusive: Do not ship. Do not call it a win.

Guidelines

Always recommend against peeking at results before the test reaches planned sample size — explain p-hacking risk
If user wants to test multiple variants, explain the multiple comparisons problem and recommend a Bonferroni correction or a Bayesian approach
If traffic is very low (<1,000 users/day), recommend qualitative alternatives: moderated testing, 5-second tests, or user interviews
Never approve a test with no guardrail metrics — always protect revenue, retention, or core engagement

Anti-Patterns

[ ] Do not run a test without a directional hypothesis — "let's see what happens" produces uninterpretable results
[ ] Do not declare a winner before reaching the pre-planned sample size — peeking at results inflates false positive rates
[ ] Do not test multiple independent changes in a single variant — you won't know which change caused the result
[ ] Do not use engagement metrics (clicks, time-on-page) as the primary metric when the goal is revenue or retention — proxy metrics mislead
[ ] Do not ignore guardrail metrics — a conversion lift that causes a support ticket spike is not a win

Quality Checks

[ ] Hypothesis is directional (predicts a specific direction and magnitude, not "let's see")
[ ] Primary metric is singular (guardrail metrics are secondary)
[ ] Sample size is calculated from actual MDE and baseline (not guessed)
[ ] Test duration accounts for weekly seasonality (minimum 2 weeks)
[ ] Guardrail metrics are defined (at least one to protect revenue or core engagement)
[ ] Rollback trigger is specified with a concrete threshold

mohitagw15856/ab-test-planner

plugins/pm-delivery/skills/ab-test-planner/SKILL.md

Design statistically rigorous A/B tests for product features, UI changes, onboarding flows, and pricing experiments. Use when asked to set up an experiment, design an A/B test, calculate sample size, or interpret test results. Produces a complete test plan with hypothesis, variant definitions, sample size, duration estimate, guardrail metrics, and a results interpretation guide.

946 stars

development

Updated Jun 9, 2026

$ install --global

skillsauth

npx skillsauth add mohitagw15856/pm-claude-skills ab-test-planner

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 9, 2026, 7:19 AM170.2s1 file scanned

SKILL.md

name:: ab-test-planner
description:: Design statistically rigorous A/B tests for product features, UI changes, onboarding flows, and pricing experiments. Use when asked to set up an experiment, design an A/B test, calculate sample size, or interpret test results. Produces a complete test plan with hypothesis, variant definitions, sample size, duration estimate, guardrail metrics, and a results interpretation guide.

A/B Test Planner Skill

Required Inputs

Ask the user for these if not provided:

What is being tested (feature, UI change, copy, pricing, onboarding step)
Hypothesis (or ask to help formulate one)
Primary metric (conversion rate, click-through, completion rate, etc.)
Baseline rate and minimum detectable effect (MDE)
Daily eligible users (to calculate duration)

Experiment Design Checklist

Before running any test, confirm:

[ ] Clear hypothesis with predicted direction
[ ] Single primary metric (plus up to 2 guardrail metrics)
[ ] Minimum detectable effect (MDE) defined
[ ] Sample size calculated
[ ] Test duration estimated
[ ] Segment isolated (no overlap with other running tests)
[ ] Rollback plan defined

Hypothesis Template

"We believe that [change] will cause [primary metric] to [increase/decrease] by [X%] for [user segment], because [rationale based on data or insight]."

Never run a test without a directional hypothesis. "Let's just see what happens" is not a hypothesis.

Sample Size Calculator Logic

Use this formula (provide the output, not the formula, to the user):

Baseline conversion rate: Current rate of primary metric
MDE: Smallest change worth detecting (recommend 10–20% relative lift for most features)
Statistical power: 80% (standard)
Significance level: 95% (p < 0.05)

For common scenarios, provide pre-calculated estimates:

| Baseline Rate | MDE (Relative) | Required Sample per Variant | |---|---|---| | 5% | 20% | ~19,000 | | 10% | 15% | ~14,000 | | 20% | 10% | ~15,000 | | 40% | 10% | ~9,500 | | 60% | 5% | ~42,000 |

Always warn: "These are estimates. Use a tool like Evan Miller's calculator or Statsig for precision."

Test Duration Guidance

Minimum: 2 full weeks (to capture weekly seasonality) Maximum: 4 weeks (novelty effect distorts results beyond this)

Duration = Required sample ÷ (Daily traffic × % exposed)

Flag if traffic is too low to reach significance in under 8 weeks — recommend a different approach (e.g., holdout test, qualitative research).

Output Format

A/B Test Plan — [Test Name] — [Date]

Hypothesis:

[Filled hypothesis template]

Variants:

Control (A): [Current experience]
Treatment (B): [Changed experience — be specific]

Primary Metric: [Metric name + how measured] Guardrail Metrics: [Metrics that must not degrade]

Target Segment: [Who sees the test — % of traffic, user type] Traffic Split: [50/50 recommended unless ramp-up needed]

Sample Size Required: ~[N] users per variant Estimated Duration: [X] weeks (based on [Y] daily eligible users) Significance Threshold: 95% confidence, 80% power

Exclusions: [Any user segments to exclude and why]

Rollback Trigger: If [guardrail metric] degrades by [X%], stop the test immediately.

Results Interpretation Guide:

✅ Ship if: Treatment shows [X%]+ lift on primary metric at 95% confidence AND guardrail metrics are stable
🔄 Iterate if: Direction is positive but not significant — consider extending or redesigning
❌ Reject if: No lift or negative direction at significance
⚠️ Inconclusive: Do not ship. Do not call it a win.

Guidelines

Always recommend against peeking at results before the test reaches planned sample size — explain p-hacking risk
If user wants to test multiple variants, explain the multiple comparisons problem and recommend a Bonferroni correction or a Bayesian approach
If traffic is very low (<1,000 users/day), recommend qualitative alternatives: moderated testing, 5-second tests, or user interviews
Never approve a test with no guardrail metrics — always protect revenue, retention, or core engagement

Anti-Patterns

[ ] Do not run a test without a directional hypothesis — "let's see what happens" produces uninterpretable results
[ ] Do not declare a winner before reaching the pre-planned sample size — peeking at results inflates false positive rates
[ ] Do not test multiple independent changes in a single variant — you won't know which change caused the result
[ ] Do not use engagement metrics (clicks, time-on-page) as the primary metric when the goal is revenue or retention — proxy metrics mislead
[ ] Do not ignore guardrail metrics — a conversion lift that causes a support ticket spike is not a win

Quality Checks

[ ] Hypothesis is directional (predicts a specific direction and magnitude, not "let's see")
[ ] Primary metric is singular (guardrail metrics are secondary)
[ ] Sample size is calculated from actual MDE and baseline (not guessed)
[ ] Test duration accounts for weekly seasonality (minimum 2 weeks)
[ ] Guardrail metrics are defined (at least one to protect revenue or core engagement)
[ ] Rollback trigger is specified with a concrete threshold

Related Skills

mohitagw15856/win-loss-analysis

business

VerifiedTrustedCommunity

Analyze why deals are won and lost and turn it into an action plan. Use when asked to run a win/loss analysis, review closed-won and closed-lost deals, understand why the team is losing to a competitor, or summarize sales feedback into patterns. Produces a structured win/loss report with themes, win/loss rates by segment and competitor, representative quotes, and prioritized actions for product, marketing, and sales.

1,117SKILL.mdUpdated Jul 2, 2026

mohitagw15856/win-loss-analysis

mohitagw15856/which-skill

development

VerifiedTrustedCommunity

Route a fuzzy request to the right skill in this library. Use when the user is unsure which skill fits, asks 'which skill should I use for X', describes a task without naming a skill, or when a request could plausibly match several skills. Produces a best-fit recommendation with the inputs to gather, a runner-up with the tie-breaker, and a workflow recipe when the job spans multiple skills.

1,117SKILL.mdUpdated Jul 2, 2026

mohitagw15856/which-skill

mohitagw15856/vuln-triage

testing

VerifiedTrustedCommunity

Triage a vulnerability or scanner finding — assess real severity, exploitability, and how urgently to fix. Use when asked to triage a CVE, prioritize scanner/pentest findings, assess a vuln's risk, or decide what to patch first. Produces a triage verdict: CVSS-informed severity adjusted for your context, exploitability, real risk, a fix/mitigation, and an SLA — so you fix what matters, not just what's red.

1,117SKILL.mdUpdated Jul 2, 2026

mohitagw15856/vuln-triage

mohitagw15856/voice-of-customer-program

development

VerifiedTrustedCommunity

Stand up a Voice of Customer (VoC) program that turns feedback into action. Use when asked to build a VoC program, design a customer feedback loop, consolidate feedback sources, or set up a closed-loop feedback process. Produces a VoC program design — objectives, feedback sources and channels, a taxonomy, collection and analysis cadence, closed-loop routing, ownership, and success metrics.

1,117SKILL.mdUpdated Jul 2, 2026

mohitagw15856/voice-of-customer-program

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/mohitagw15856/pm-claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r pm-claude-skills/plugins/pm-delivery/skills/ab-test-planner ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

mohitagw15856/pm-claude-skills

946 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT