skills/codex/ab-test-analysis/SKILL.md
<!-- AUTO-GENERATED by export-skills.py — DO NOT EDIT --> --- name: ab-test-analysis description: Analyze A/B test results with statistical rigor — calculate significance, check guardrails, and make ship/extend/stop decisions. Use when evaluating experiment results or interpreting test data. --- # A/B Test Analysis Analyze experiment results with statistical rigor and produce a clear **Ship / Investigate / Extend / Stop** recommendation. This skill complements `ab-test-setup` (which handles e
npx skillsauth add frank-luongt/faos-skills-marketplace skills/codex/ab-test-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Analyze experiment results with statistical rigor and produce a clear Ship / Investigate / Extend / Stop recommendation.
This skill complements ab-test-setup (which handles experiment design). Use this skill when you have results to analyze.
Most A/B test interpretations are wrong — teams either call tests too early, ignore guardrail metrics, or ship on directional trends without statistical significance. This skill enforces disciplined analysis.
ab-test-setup)| Field | Description | | --- | --- | | Primary metric | What the test is trying to improve (e.g., conversion rate) | | Control group | Sample size (N) and conversions (C) for the control | | Variant group | Sample size (N) and conversions (C) for the variant | | Test duration | How long the test ran | | Planned duration | How long it was designed to run | | Guardrail metrics | Metrics that must not degrade (e.g., revenue, page load time) | | MDE | Minimum Detectable Effect used in power calculation |
Before analyzing results, check:
If any check fails, the test results may be unreliable. Flag this before proceeding.
For conversion rate tests:
Control conversion rate: p_c = C_control / N_control
Variant conversion rate: p_v = C_variant / N_variant
Relative lift: (p_v - p_c) / p_c × 100%
Pooled proportion: p = (C_control + C_variant) / (N_control + N_variant)
Standard error: SE = sqrt(p × (1-p) × (1/N_control + 1/N_variant))
Z-score: Z = (p_v - p_c) / SE
P-value: two-tailed from Z
95% Confidence Interval: (p_v - p_c) ± 1.96 × SE
| Criterion | Threshold | Status | | --- | --- | --- | | Statistical significance | p-value < 0.05 | Pass / Fail | | Practical significance | Lift > MDE | Pass / Fail | | Confidence interval | Does CI exclude 0? | Pass / Fail |
Both statistical AND practical significance are required to ship.
For each guardrail metric:
| Guardrail | Control | Variant | Change | Status | | --- | --- | --- | --- | --- | | [metric name] | [value] | [value] | [+/- %] | OK / Warning / Degraded |
A guardrail is degraded if it shows a statistically significant negative change.
Use this decision matrix:
| Primary Metric | Guardrails | Recommendation | | --- | --- | --- | | Significant positive | All OK | Ship — roll out to 100% | | Significant positive | Some degraded | Investigate — understand trade-off before deciding | | Not significant, positive trend | All OK | Extend — run longer if sample size was insufficient | | Not significant, flat | All OK | Stop — no effect detected, free up the experiment slot | | Significant negative | Any | Don't Ship — revert and learn from the result |
# A/B Test Results: [Test Name]
## Summary
| Field | Value |
| --- | --- |
| Test name | [name] |
| Hypothesis | [We believed X would cause Y] |
| Primary metric | [metric name] |
| Duration | [start] — [end] ([N] days) |
| Decision | **Ship / Investigate / Extend / Stop / Don't Ship** |
---
## Results
| Group | Sample Size | Conversions | Rate |
| --- | --- | --- | --- |
| Control | [N] | [C] | [rate]% |
| Variant | [N] | [C] | [rate]% |
**Relative lift:** [+/- X.X%]
**P-value:** [value]
**95% CI:** [[lower]%, [upper]%]
**Statistically significant:** Yes / No
**Practically significant:** Yes / No (MDE was [X]%)
---
## Guardrail Metrics
| Metric | Control | Variant | Change | Status |
| --- | --- | --- | --- | --- |
| [metric] | [val] | [val] | [change] | OK / Warning |
---
## Recommendation
**Decision: [Ship / Investigate / Extend / Stop / Don't Ship]**
**Rationale:** [2–3 sentences explaining the decision]
**Next steps:**
1. [action]
2. [action]
---
## Learnings
- [What we learned from this test, regardless of outcome]
- [How this informs future experiments]
| Pitfall | Why It's Wrong | Correct Approach | | --- | --- | --- | | Peeking at results daily | Inflates false positive rate | Wait for planned duration and sample size | | Calling it at p=0.06 | "Almost significant" isn't significant | Set the threshold before the test, stick to it | | Ignoring guardrails | Winning on one metric while losing on another | Always check guardrails before shipping | | Post-hoc segmentation | Finding "it worked for mobile users!" after the fact is data mining | Pre-register segments or treat as hypothesis for next test | | Running too many variants | Each variant needs full sample size | Limit to 1–2 variants per test | | Not learning from losses | "It didn't work" is not a learning | Document WHY it didn't work and what to try next |
| Avoid | Why | Instead | | --- | --- | --- | | "Directional win" | Not a statistical standard | Require p < 0.05 and lift > MDE | | Shipping without guardrail check | May degrade critical metrics | Always check before shipping | | Ending early because it "looks good" | Sequential testing bias | Run to planned duration | | Not documenting learnings | Same failed experiments get repeated | Maintain an experiment log |
development
<!-- AUTO-GENERATED by export-skills.py — DO NOT EDIT --> --- name: databricks-mlflow-evaluation --- # MLflow 3 GenAI Evaluation ## Before Writing Any Code 1. **Read GOTCHAS.md** - 15+ common mistakes that cause failures 2. **Read CRITICAL-interfaces.md** - Exact API signatures and data schemas ## End-to-End Workflows Follow these workflows based on your goal. Each step indicates which reference files to read. ### Workflow 1: First-Time Evaluation Setup For users new to MLflow GenAI evalu
development
<!-- AUTO-GENERATED by export-skills.py — DO NOT EDIT --> --- name: databricks-lakebase-provisioned --- # Lakebase Provisioned Patterns and best practices for using Lakebase Provisioned (Databricks managed PostgreSQL) for OLTP workloads. ## When to Use Use this skill when: - Building applications that need a PostgreSQL database for transactional workloads - Adding persistent state to Databricks Apps - Implementing reverse ETL from Delta Lake to an operational database - Storing chat/agent m
tools
<!-- AUTO-GENERATED by export-skills.py — DO NOT EDIT --> --- name: databricks-jobs --- # Databricks Lakeflow Jobs ## Overview Databricks Jobs orchestrate data workflows with multi-task DAGs, flexible triggers, and comprehensive monitoring. Jobs support diverse task types and can be managed via Python SDK, CLI, or Asset Bundles. ## Reference Files | Use Case | Reference File | | ----------------------
development
<!-- AUTO-GENERATED by export-skills.py — DO NOT EDIT --> --- name: databricks-genie --- # Databricks Genie Create and query Databricks Genie Spaces - natural language interfaces for SQL-based data exploration. ## Overview Genie Spaces allow users to ask natural language questions about structured data in Unity Catalog. The system translates questions into SQL queries, executes them on a SQL warehouse, and presents results conversationally. ## When to Use This Skill Use this skill when: -