plugins/pm/skills/experiment-decision/SKILL.md
Decide when to A/B test vs just ship. Framework for experiment planning and prioritization.
npx skillsauth add coalesce-labs/catalyst experiment-decisionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
/experiment-decision
Then provide:
I'll walk you through the decision tree: reversibility, hypothesis strength, detectable impact, and risk level. You'll get a clear recommendation: A/B test, ship + monitor, or just ship.
Output: Decision documented inline or saved to thoughts/shared/product/decisions/
Time: ~5 min for clear-cut cases, ~15 min for nuanced decisions
When to use: Before building any feature, when stakeholders demand "data-driven" decisions, or when unsure if testing is worth the effort
Framework source: Aakash Gupta's "When to A/B Test vs Just Ship"
Use this decision tree:
If YES → Ship it
Why: Reversible changes have low risk. Ship, monitor, rollback if needed.
If NO → Continue to Question 2
If NO → Don't test
Why: Testing without a hypothesis is wasteful. Either clarify the hypothesis or don't build it.
If YES → Continue to Question 3
Run a power calculation:
Minimum Detectable Effect (MDE) = Effect you need to see to justify the work
If your feature is expected to improve conversion by 0.5%, but you need 10M users to detect it → Don't test, just ship and monitor
If impact is too small to detect → Ship without test
If impact is detectable → Continue to Question 4
High risk scenarios:
If HIGH risk → A/B test
If LOW risk → Ship without test
| Risk Level | Impact Size | Reversible? | Decision | | ---------- | ----------- | ----------- | --------------------------------------- | | High | Large | No | A/B Test | | High | Large | Yes | A/B Test (or ship with kill switch) | | High | Small | No | Don't build | | High | Small | Yes | Ship + Monitor | | Low | Large | No | Ship + Monitor | | Low | Large | Yes | Just Ship | | Low | Small | No | Just Ship | | Low | Small | Yes | Just Ship |
1. High-stakes decisions
2. Controversial hypotheses
3. Long-term bets
4. Optimization work
1. Fast iteration needed
2. Low risk, high certainty
3. Qualitative insights are strong
4. Testing would take too long
Time costs:
Engineering costs:
Opportunity costs:
When testing costs exceed value → Just ship
Decision: A/B Test
Decision: Just Ship
Decision: A/B Test
❌ Testing everything "to be data-driven"
❌ Shipping without monitoring
❌ Running underpowered tests
❌ Testing when qualitative data is clear
Before building any feature, ask:
Before committing to an A/B test, estimate whether you have enough traffic to detect a meaningful difference.
Three inputs you need:
Minimum Detectable Effect (MDE) -- what's the smallest improvement worth detecting?
Baseline conversion rate -- what's the current rate you're trying to improve?
Daily traffic to the experiment -- how many users will enter the test per day?
You need approximately 1,000 conversions per variant to detect a 5% relative change at 80% power (95% confidence).
| Baseline Rate | MDE (Relative) | Conversions Needed Per Variant | At 1K daily visitors, days needed | | ------------- | -------------- | ------------------------------ | --------------------------------- | | 50% | 5% | ~3,200 | ~7 days | | 20% | 5% | ~12,500 | ~63 days | | 5% | 10% | ~15,000 | ~300 days | | 2% | 10% | ~40,000 | ~800 days |
If your power calculation shows the test would take longer than 4-6 weeks:
Some decisions don't need the full decision tree:
Action: Just ship it. You don't have a choice. But: Document the change, set up monitoring, track any user impact.
Action: Just fix it. No one A/B tests bug fixes. But: If the "bug fix" changes user behavior significantly, monitor post-fix metrics.
Action: Document the decision and ship. Set up measurement so you can report on impact. But: Frame your measurement as "proving the impact" rather than "testing whether to do it." This builds credibility for future data-driven decisions.
Action: If a competitor just shipped a similar feature and your users are asking for it, speed matters more than experimentation. Ship fast, measure after. But: Don't use "competitive pressure" as an excuse for every feature. Reserve this for genuine market urgency.
Action: If you're removing a feature that <1% of users touch, just remove it with advance notice. But: If the feature has any paying customers relying on it, communicate early and provide alternatives.
Before delivering the experiment decision, verify:
/experiment-metrics - Choose the right metrics to measure/activation-analysis - Test activation improvements/metrics-framework - Understand leading vs lagging metrics/define-north-star - Align tests to North StarFramework credit: Adapted from Aakash Gupta's experiment decision frameworks. Read: https://www.news.aakashg.com/p/when-to-ab-test
When the PM uses /experiment-decision, I automatically:
Source: thoughts/shared/product/decisions/, past decisions
Source: thoughts/shared/pm/metrics/, active PRDs
Source: Connection to /experiment-metrics skill
/experiment-metricsSource: thoughts/shared/pm/context/stakeholder-template.md, recent discussions
Source: Team capacity, past experiment timelines
testing
Phase-agent that fixes a failing verify verdict so the pipeline self-heals instead of stalling to needs-human (CTL-653). Reads `${ORCH_DIR}/workers/<ticket>/verify.json`, fixes the `findings[]` (every severity:"high" plus the regression_risk drivers) directly via Edit/Write, commits the remediation, and emits `phase.remediate.complete.<ticket>`. The scheduler's router then re-dispatches `verify` to re-check (the verify⇄remediate cycle, cap 3). Dispatched as a `claude --bg` job by `phase-agent-dispatch`, which invokes it via slash command — hence `user-invocable: true`.
development
Phase agent for the verify step of the 9-phase orchestrator pipeline (CTL-450). NEW skill — has no canonical wrapper. Runs read-only adversarial verification against the implement-phase diff: tsc, tests, lint, security scan, reward-hacking scan, code review, test coverage, silent-failure hunt. Writes ${ORCH_DIR}/workers/<TICKET>/verify.json then emits phase.verify.complete.<ticket>. Reads phase-implement.json as its prior-phase artifact. NEVER writes application code — only test files allowed. Spawned via phase-agent-dispatch via slash command — hence `user-invocable: true`.
tools
--- name: phase-triage description: Phase agent that triages a Linear ticket — expands acronyms, classifies (feature/bug/docs/refactor/chore), identifies dependencies, estimates scope, writes triage.json, and posts a triage analysis comment to Linear. Triage completion is signaled by that comment plus the local triage.json — there is no `triaged` label. Emits phase.triage.complete.<TICKET> on success and phase.triage.failed.<TICKET> on error. Dispatched by the phase-agent orchestrator (CTL-452)
testing
Phase agent for the review step of the 9-phase orchestrator pipeline (CTL-450). Wraps the /review skill (gstack) — explicitly skips /ultrareview per user decision. Reads verify.json from the prior phase, runs /review against the diff, writes ${ORCH_DIR}/workers/<TICKET>/review.json, and creates a remediation commit for any HIGH-severity finding that has a deterministic fix. Emits phase.review.complete.<ticket>. Spawned via phase-agent-dispatch via slash command — hence `user-invocable: true`.