Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

phuryn/ab-test-analysis

Name: ab-test-analysis
Author: phuryn

pm-data-analytics/skills/ab-test-analysis/SKILL.md

npx skillsauth add phuryn/pm-skills ab-test-analysis

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Error

VirusTotalMulti-engine malware detection

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

A/B Test Analysis

Evaluate A/B test results with statistical rigor and translate findings into clear product decisions.

Context

You are analyzing A/B test results for $ARGUMENTS.

If the user provides data files (CSV, Excel, or analytics exports), read and analyze them directly. Generate Python scripts for statistical calculations when needed.

Instructions

Understand the experiment:
- What was the hypothesis?
- What was changed (the variant)?
- What is the primary metric? Any guardrail metrics?
- How long did the test run?
- What is the traffic split?
Validate the test setup:
- Sample size: Is the sample large enough for the expected effect size?
  - Use the formula: n = (Z²α/2 × 2 × p × (1-p)) / MDE²
  - Flag if the test is underpowered (<80% power)
- Duration: Did the test run for at least 1-2 full business cycles?
- Randomization: Any evidence of sample ratio mismatch (SRM)?
- Novelty/primacy effects: Was there enough time to wash out initial behavior changes?
Calculate statistical significance:
- Conversion rate for control and variant
- Relative lift: (variant - control) / control × 100
- p-value: Using a two-tailed z-test or chi-squared test
- Confidence interval: 95% CI for the difference
- Statistical significance: Is p < 0.05?
- Practical significance: Is the lift meaningful for the business?
If the user provides raw data, generate and run a Python script to calculate these.
Check guardrail metrics:
- Did any guardrail metrics (revenue, engagement, page load time) degrade?
- A winning primary metric with degraded guardrails may not be a true win
Interpret results:

| Outcome | Recommendation | |---|---| | Significant positive lift, no guardrail issues | Ship it — roll out to 100% | | Significant positive lift, guardrail concerns | Investigate — understand trade-offs before shipping | | Not significant, positive trend | Extend the test — need more data or larger effect | | Not significant, flat | Stop the test — no meaningful difference detected | | Significant negative lift | Don't ship — revert to control, analyze why |

Provide the analysis summary:

## A/B Test Results: [Test Name]

**Hypothesis**: [What we expected]
**Duration**: [X days] | **Sample**: [N control / M variant]

| Metric | Control | Variant | Lift | p-value | Significant? |
|---|---|---|---|---|---|
| [Primary] | X% | Y% | +Z% | 0.0X | Yes/No |
| [Guardrail] | ... | ... | ... | ... | ... |

**Recommendation**: [Ship / Extend / Stop / Investigate]
**Reasoning**: [Why]
**Next steps**: [What to do]

Think step by step. Save as markdown. Generate Python scripts for calculations if raw data is provided.

phuryn/ab-test-analysis

pm-data-analytics/skills/ab-test-analysis/SKILL.md

Analyze A/B test results with statistical significance, sample size validation, confidence intervals, and ship/extend/stop recommendations. Use when evaluating experiment results, checking if a test reached significance, interpreting split test data, or deciding whether to ship a variant.

7,806 stars

testing

Updated Mar 20, 2026

$ install --global

skillsauth

npx skillsauth add phuryn/pm-skills ab-test-analysis

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Error

VirusTotalMulti-engine malware detection

70%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Mar 20, 2026, 8:11 AM142.1s1 file scanned

SKILL.md

name:: ab-test-analysis
description:: Analyze A/B test results with statistical significance, sample size validation, confidence intervals, and ship/extend/stop recommendations. Use when evaluating experiment results, checking if a test reached significance, interpreting split test data, or deciding whether to ship a variant.

A/B Test Analysis

Evaluate A/B test results with statistical rigor and translate findings into clear product decisions.

Context

You are analyzing A/B test results for $ARGUMENTS.

If the user provides data files (CSV, Excel, or analytics exports), read and analyze them directly. Generate Python scripts for statistical calculations when needed.

Instructions

Understand the experiment:
- What was the hypothesis?
- What was changed (the variant)?
- What is the primary metric? Any guardrail metrics?
- How long did the test run?
- What is the traffic split?
Validate the test setup:
- Sample size: Is the sample large enough for the expected effect size?
  - Use the formula: n = (Z²α/2 × 2 × p × (1-p)) / MDE²
  - Flag if the test is underpowered (<80% power)
- Duration: Did the test run for at least 1-2 full business cycles?
- Randomization: Any evidence of sample ratio mismatch (SRM)?
- Novelty/primacy effects: Was there enough time to wash out initial behavior changes?
Calculate statistical significance:
- Conversion rate for control and variant
- Relative lift: (variant - control) / control × 100
- p-value: Using a two-tailed z-test or chi-squared test
- Confidence interval: 95% CI for the difference
- Statistical significance: Is p < 0.05?
- Practical significance: Is the lift meaningful for the business?
If the user provides raw data, generate and run a Python script to calculate these.
Check guardrail metrics:
- Did any guardrail metrics (revenue, engagement, page load time) degrade?
- A winning primary metric with degraded guardrails may not be a true win
Interpret results:

| Outcome | Recommendation | |---|---| | Significant positive lift, no guardrail issues | Ship it — roll out to 100% | | Significant positive lift, guardrail concerns | Investigate — understand trade-offs before shipping | | Not significant, positive trend | Extend the test — need more data or larger effect | | Not significant, flat | Stop the test — no meaningful difference detected | | Significant negative lift | Don't ship — revert to control, analyze why |

Provide the analysis summary:

## A/B Test Results: [Test Name]

**Hypothesis**: [What we expected]
**Duration**: [X days] | **Sample**: [N control / M variant]

| Metric | Control | Variant | Lift | p-value | Significant? |
|---|---|---|---|---|---|
| [Primary] | X% | Y% | +Z% | 0.0X | Yes/No |
| [Guardrail] | ... | ... | ... | ... | ... |

**Recommendation**: [Ship / Extend / Stop / Investigate]
**Reasoning**: [Why]
**Next steps**: [What to do]

Think step by step. Save as markdown. Generate Python scripts for calculations if raw data is provided.

Related Skills

phuryn/shipping-artifacts

tools

VerifiedTrustedCommunity

The durable documentation set that makes an AI-built (vibe-coded) app reviewable before shipping. A small core every app needs — architecture, user/permission flows, permissions, variables/secrets, and a test-coverage map — plus conditional docs added only when they apply: emails, scheduled work, SEO, and embedded agents/automation. Defines what each doc must capture and how a reviewer or auditor uses it. Use when documenting a codebase for handoff, mapping user journeys and trust-boundary crossings, planning test coverage, or preparing for a security or performance audit.

22,371SKILL.mdUpdated Jun 6, 2026

phuryn/shipping-artifacts

phuryn/intended-vs-implemented

development

VerifiedTrustedCommunity

The method for finding the gap between what a system is supposed to do and what the code actually does — the class of bug generic scanners miss because they have no model of intent. Defines what counts as documented intent, what counts as implementation evidence, which mismatches matter, and how to avoid hand-wavy findings. Use when auditing AI-built code, reviewing access control against documented permissions, or checking whether a codebase matches its own documentation.

22,371SKILL.mdUpdated Jun 6, 2026

phuryn/intended-vs-implemented

phuryn/strategy-red-team

testing

VerifiedTrustedCommunity

Red-team a PRD, roadmap, or strategy by attacking its load-bearing assumptions before reality does. Steelmans then attacks each claim, ranks failure modes by impact × likelihood × cheapness-to-test, and returns the cheapest test and kill criteria for each. Use when stress-testing a plan, pressure-testing a strategy, challenging assumptions, or preparing a doc for executive review.

12,047SKILL.mdUpdated Jun 6, 2026

phuryn/strategy-red-team

phuryn/review-resume

testing

VerifiedTrustedCommunity

Comprehensive PM resume review and tailoring against 10 best practices including XYZ+S formula, keyword optimization, job-specific tailoring, and structure. Use when reviewing a PM resume, preparing for job applications, or improving resume impact.

7,806SKILL.mdUpdated Mar 20, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/phuryn/pm-skills.git

# Copy into Claude Code skills folder (global)
cp -r pm-skills/pm-data-analytics/skills/ab-test-analysis ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

phuryn/pm-skills

7,806 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

phuryn/ab-test-analysis

$ install --global

Security Scan Results

SKILL.md

A/B Test Analysis

Context

Instructions

Further Reading

Related Skills

phuryn/shipping-artifacts

phuryn/intended-vs-implemented

phuryn/strategy-red-team

phuryn/review-resume

phuryn/ab-test-analysis

$ install --global

Security Scan Results

SKILL.md

A/B Test Analysis

Context

Instructions

Further Reading

Related Skills

phuryn/shipping-artifacts

phuryn/intended-vs-implemented

phuryn/strategy-red-team

phuryn/review-resume