Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

frank-luongt/skills/codex/ab-test-analysis

Name: skills/codex/ab-test-analysis
Author: frank-luongt

skills/codex/ab-test-analysis/SKILL.md

npx skillsauth add frank-luongt/faos-skills-marketplace skills/codex/ab-test-analysis

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

name: ab-test-analysis description: Analyze A/B test results with statistical rigor — calculate significance, check guardrails, and make ship/extend/stop decisions. Use when evaluating experiment results or interpreting test data.

A/B Test Analysis

Analyze experiment results with statistical rigor and produce a clear Ship / Investigate / Extend / Stop recommendation.

This skill complements ab-test-setup (which handles experiment design). Use this skill when you have results to analyze.

Purpose

Most A/B test interpretations are wrong — teams either call tests too early, ignore guardrail metrics, or ship on directional trends without statistical significance. This skill enforces disciplined analysis.

When to Use

An A/B test has completed its planned duration
You have conversion data for control and variant groups
Stakeholders are asking "did the test win?"
You need to decide: ship, extend, or kill

When NOT to Use

Designing or setting up an experiment (use ab-test-setup)
The test hasn't reached minimum sample size yet
You're analyzing observational data (not a controlled experiment)

Required Data (Ask If Missing)

| Field | Description | | --- | --- | | Primary metric | What the test is trying to improve (e.g., conversion rate) | | Control group | Sample size (N) and conversions (C) for the control | | Variant group | Sample size (N) and conversions (C) for the variant | | Test duration | How long the test ran | | Planned duration | How long it was designed to run | | Guardrail metrics | Metrics that must not degrade (e.g., revenue, page load time) | | MDE | Minimum Detectable Effect used in power calculation |

Analysis Process

Step 1: Validate the Setup

Before analyzing results, check:

[ ] Sample size adequate? Compare actual N to planned N from power analysis
[ ] Duration sufficient? Must cover at least 1–2 full business cycles (e.g., weekday + weekend)
[ ] SRM check? Sample Ratio Mismatch — control and variant should have ~equal N (within 1%). If skewed, the test is invalid.
[ ] No novelty effects? If you can, check early vs. late behavior. New UI elements get more clicks initially.

If any check fails, the test results may be unreliable. Flag this before proceeding.

Step 2: Calculate Core Statistics

For conversion rate tests:

Control conversion rate: p_c = C_control / N_control
Variant conversion rate: p_v = C_variant / N_variant
Relative lift: (p_v - p_c) / p_c × 100%

Pooled proportion: p = (C_control + C_variant) / (N_control + N_variant)
Standard error: SE = sqrt(p × (1-p) × (1/N_control + 1/N_variant))
Z-score: Z = (p_v - p_c) / SE
P-value: two-tailed from Z

95% Confidence Interval: (p_v - p_c) ± 1.96 × SE

Step 3: Assess Significance

| Criterion | Threshold | Status | | --- | --- | --- | | Statistical significance | p-value < 0.05 | Pass / Fail | | Practical significance | Lift > MDE | Pass / Fail | | Confidence interval | Does CI exclude 0? | Pass / Fail |

Both statistical AND practical significance are required to ship.

Step 4: Check Guardrail Metrics

For each guardrail metric:

| Guardrail | Control | Variant | Change | Status | | --- | --- | --- | --- | --- | | [metric name] | [value] | [value] | [+/- %] | OK / Warning / Degraded |

A guardrail is degraded if it shows a statistically significant negative change.

Step 5: Make the Decision

Use this decision matrix:

| Primary Metric | Guardrails | Recommendation | | --- | --- | --- | | Significant positive | All OK | Ship — roll out to 100% | | Significant positive | Some degraded | Investigate — understand trade-off before deciding | | Not significant, positive trend | All OK | Extend — run longer if sample size was insufficient | | Not significant, flat | All OK | Stop — no effect detected, free up the experiment slot | | Significant negative | Any | Don't Ship — revert and learn from the result |

Output Format

# A/B Test Results: [Test Name]

## Summary

| Field | Value |
| --- | --- |
| Test name | [name] |
| Hypothesis | [We believed X would cause Y] |
| Primary metric | [metric name] |
| Duration | [start] — [end] ([N] days) |
| Decision | **Ship / Investigate / Extend / Stop / Don't Ship** |

---

## Results

| Group | Sample Size | Conversions | Rate |
| --- | --- | --- | --- |
| Control | [N] | [C] | [rate]% |
| Variant | [N] | [C] | [rate]% |

**Relative lift:** [+/- X.X%]
**P-value:** [value]
**95% CI:** [[lower]%, [upper]%]
**Statistically significant:** Yes / No
**Practically significant:** Yes / No (MDE was [X]%)

---

## Guardrail Metrics

| Metric | Control | Variant | Change | Status |
| --- | --- | --- | --- | --- |
| [metric] | [val] | [val] | [change] | OK / Warning |

---

## Recommendation

**Decision: [Ship / Investigate / Extend / Stop / Don't Ship]**

**Rationale:** [2–3 sentences explaining the decision]

**Next steps:**
1. [action]
2. [action]

---

## Learnings

- [What we learned from this test, regardless of outcome]
- [How this informs future experiments]

Common Pitfalls

| Pitfall | Why It's Wrong | Correct Approach | | --- | --- | --- | | Peeking at results daily | Inflates false positive rate | Wait for planned duration and sample size | | Calling it at p=0.06 | "Almost significant" isn't significant | Set the threshold before the test, stick to it | | Ignoring guardrails | Winning on one metric while losing on another | Always check guardrails before shipping | | Post-hoc segmentation | Finding "it worked for mobile users!" after the fact is data mining | Pre-register segments or treat as hypothesis for next test | | Running too many variants | Each variant needs full sample size | Limit to 1–2 variants per test | | Not learning from losses | "It didn't work" is not a learning | Document WHY it didn't work and what to try next |

Anti-Patterns

| Avoid | Why | Instead | | --- | --- | --- | | "Directional win" | Not a statistical standard | Require p < 0.05 and lift > MDE | | Shipping without guardrail check | May degrade critical metrics | Always check before shipping | | Ending early because it "looks good" | Sequential testing bias | Run to planned duration | | Not documenting learnings | Same failed experiments get repeated | Maintain an experiment log |

References

Kohavi, R., Tang, D., & Xu, Y. Trustworthy Online Controlled Experiments (2020)
Evan Miller's A/B Test Calculator
Sample Size Calculator

frank-luongt/skills/codex/ab-test-analysis

skills/codex/ab-test-analysis/SKILL.md

--- name: ab-test-analysis description: Analyze A/B test results with statistical rigor — calculate significance, check guardrails, and make ship/extend/stop decisions. Use when evaluating experiment results or interpreting test data. --- # A/B Test Analysis Analyze experiment results with statistical rigor and produce a clear **Ship / Investigate / Extend / Stop** recommendation. This skill complements `ab-test-setup` (which handles e

12 stars

development

Updated Apr 20, 2026

$ install --global

skillsauth

npx skillsauth add frank-luongt/faos-skills-marketplace skills/codex/ab-test-analysis

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 20, 2026, 3:02 PM17.4s2 files scanned

SKILL.md

name: ab-test-analysis description: Analyze A/B test results with statistical rigor — calculate significance, check guardrails, and make ship/extend/stop decisions. Use when evaluating experiment results or interpreting test data.

A/B Test Analysis

Analyze experiment results with statistical rigor and produce a clear Ship / Investigate / Extend / Stop recommendation.

This skill complements ab-test-setup (which handles experiment design). Use this skill when you have results to analyze.

Purpose

When to Use

An A/B test has completed its planned duration
You have conversion data for control and variant groups
Stakeholders are asking "did the test win?"
You need to decide: ship, extend, or kill

When NOT to Use

Designing or setting up an experiment (use ab-test-setup)
The test hasn't reached minimum sample size yet
You're analyzing observational data (not a controlled experiment)

Required Data (Ask If Missing)

Analysis Process

Step 1: Validate the Setup

Before analyzing results, check:

[ ] Sample size adequate? Compare actual N to planned N from power analysis
[ ] Duration sufficient? Must cover at least 1–2 full business cycles (e.g., weekday + weekend)
[ ] SRM check? Sample Ratio Mismatch — control and variant should have ~equal N (within 1%). If skewed, the test is invalid.
[ ] No novelty effects? If you can, check early vs. late behavior. New UI elements get more clicks initially.

If any check fails, the test results may be unreliable. Flag this before proceeding.

Step 2: Calculate Core Statistics

For conversion rate tests:

Control conversion rate: p_c = C_control / N_control
Variant conversion rate: p_v = C_variant / N_variant
Relative lift: (p_v - p_c) / p_c × 100%

Pooled proportion: p = (C_control + C_variant) / (N_control + N_variant)
Standard error: SE = sqrt(p × (1-p) × (1/N_control + 1/N_variant))
Z-score: Z = (p_v - p_c) / SE
P-value: two-tailed from Z

95% Confidence Interval: (p_v - p_c) ± 1.96 × SE

Step 3: Assess Significance

Both statistical AND practical significance are required to ship.

Step 4: Check Guardrail Metrics

For each guardrail metric:

| Guardrail | Control | Variant | Change | Status | | --- | --- | --- | --- | --- | | [metric name] | [value] | [value] | [+/- %] | OK / Warning / Degraded |

A guardrail is degraded if it shows a statistically significant negative change.

Step 5: Make the Decision

Use this decision matrix:

Output Format

# A/B Test Results: [Test Name]

## Summary

| Field | Value |
| --- | --- |
| Test name | [name] |
| Hypothesis | [We believed X would cause Y] |
| Primary metric | [metric name] |
| Duration | [start] — [end] ([N] days) |
| Decision | **Ship / Investigate / Extend / Stop / Don't Ship** |

---

## Results

| Group | Sample Size | Conversions | Rate |
| --- | --- | --- | --- |
| Control | [N] | [C] | [rate]% |
| Variant | [N] | [C] | [rate]% |

**Relative lift:** [+/- X.X%]
**P-value:** [value]
**95% CI:** [[lower]%, [upper]%]
**Statistically significant:** Yes / No
**Practically significant:** Yes / No (MDE was [X]%)

---

## Guardrail Metrics

| Metric | Control | Variant | Change | Status |
| --- | --- | --- | --- | --- |
| [metric] | [val] | [val] | [change] | OK / Warning |

---

## Recommendation

**Decision: [Ship / Investigate / Extend / Stop / Don't Ship]**

**Rationale:** [2–3 sentences explaining the decision]

**Next steps:**
1. [action]
2. [action]

---

## Learnings

- [What we learned from this test, regardless of outcome]
- [How this informs future experiments]

Common Pitfalls

Anti-Patterns

References

Kohavi, R., Tang, D., & Xu, Y. Trustworthy Online Controlled Experiments (2020)
Evan Miller's A/B Test Calculator
Sample Size Calculator

Related Skills

frank-luongt/skills/codex/grpo-rl-training

development

VerifiedTrustedCommunity

--- name: grpo-rl-training description: GRPO reinforcement learning training with TRL. Use when applying Group Relative Policy Optimization for reasoning and task-specific model training. --- # GRPO/RL Training with TRL Expert-level guidance for implementing Group Relative Policy Optimization (GRPO) using the Transformer Reinforcement Learning (TRL) library. This skill provides battle-tested patterns, critical insights, and production-r

26SKILL.mdUpdated Jul 9, 2026

frank-luongt/skills/codex/grpo-rl-training

frank-luongt/skills/codex/graphql-architect

tools

VerifiedTrustedCommunity

--- name: graphql-architect description: Master modern GraphQL with federation, performance optimization, --- ## Use this skill when - Working on graphql architect tasks or workflows - Needing guidance, best practices, or checklists for graphql architect ## Do not use this skill when - The task is unrelated to graphql architect - You need a different domain or tool outside this scope ## Instructions - Clarify goals, constraints, and

26SKILL.mdUpdated Jul 9, 2026

frank-luongt/skills/codex/graphql-architect

frank-luongt/skills/codex/grafana-dashboards

development

VerifiedTrustedCommunity

--- name: grafana-dashboards description: Create and manage production Grafana dashboards for real-time visualization of system and application metrics. Use when building monitoring dashboards, visualizing metrics, or creating operational observability interfaces. --- # Grafana Dashboards Create and manage production-ready Grafana dashboards for comprehensive system observability. ## Do not use this skill when - The task is unrelated

26SKILL.mdUpdated Jul 9, 2026

frank-luongt/skills/codex/grafana-dashboards

frank-luongt/skills/codex/gptq

development

VerifiedTrustedCommunity

--- name: gptq description: GPTQ post-training quantization for generative models. Use when quantizing large models to 4-bit with calibration-based weight compression. --- # GPTQ (Generative Pre-trained Transformer Quantization) Post-training quantization method that compresses LLMs to 4-bit with minimal accuracy loss using group-wise quantization. ## When to use GPTQ **Use GPTQ when:** - Need to fit large models (70B+) on limited GPU

26SKILL.mdUpdated Jul 9, 2026

frank-luongt/skills/codex/gptq

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/frank-luongt/faos-skills-marketplace.git

# Copy into Claude Code skills folder (global)
cp -r faos-skills-marketplace/skills/codex/ab-test-analysis ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

frank-luongt/faos-skills-marketplace

12 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT