plugins/discovery/skills/experiment-metrics/SKILL.md
STEDII framework for selecting trustworthy experiment metrics. Ensures metric validity and reliability.
npx skillsauth add coalesce-labs/catalyst experiment-metricsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When to use: Before launching any experiment, when metrics feel unreliable, or when experiment results are confusing
Framework source: Aakash Gupta's "How to Choose the Right Metrics to Evaluate Experiments"
Choose experiment metrics that are:
What it means: The metric moves when your feature actually improves the experience
Bad example:
Good example:
How to check: Ask: "If this experiment succeeds, will this metric move within the experiment window?"
Common mistake: Using metrics that are too aggregated (MAU, total revenue) when you need something more granular (daily activation, conversion rate by cohort).
What it means: You get signal fast enough to make decisions
Bad example:
Good example:
Tradeoff alert: Sometimes you NEED slow metrics (LTV, annual retention). In those cases:
How to check: Ask: "Can I get actionable results within [1 week / 2 weeks / 1 month]?"
What it means: You can detect the effect with reasonable sample size and time
Bad example:
Good example:
Statistical power explained:
How to check: Run a power calculation:
Minimum sample size = (Z + Z)² × (σ² / δ²)
Where:
- Z = confidence level (usually 1.96 for 95%)
- σ = standard deviation of metric
- δ = minimum detectable effect
Practical tip: If you need >1M users to detect a 5% lift, your metric isn't efficient enough.
What it means: When something goes wrong, you can figure out why
Bad example:
Good example:
How to check: Ask: "If this metric tanks, can I quickly understand what happened?"
What makes metrics debuggable:
Red flags:
What it means: Stakeholders can understand what the metric represents
Bad example:
Good example:
The grandma test: Can you explain this metric to your grandma? If not, it fails interpretability.
How to check:
What it means: The metric moves because of your experiment, not external factors
Bad example:
Good example:
Common isolation failures:
How to check: Ask: "Could something OTHER than my experiment cause this metric to move?"
Use /experiment-metrics
I'm running an experiment to: [describe your experiment]
Help me brainstorm 5-10 candidate metrics we could measure.
Create a table:
| Metric | Sensitive? | Timely? | Efficient? | Debuggable? | Interpretable? | Isolated? | Total Score | |--------|------------|---------|------------|-------------|----------------|-----------|-------------| | Metric 1 | 2/3 | 3/3 | 2/3 | 3/3 | 3/3 | 2/3 | 15/18 | | Metric 2 | 3/3 | 1/3 | 3/3 | 2/3 | 3/3 | 3/3 | 15/18 |
Scoring:
Primary metric: The ONE metric your experiment is designed to move
Guardrail metrics (3-5): Metrics you DON'T want to hurt
Example:
Before launching:
A:A Test - Run experiment with no actual change
Sample Ratio Check - Verify 50/50 split is actually 50/50
Metric Stability - Check historical variance
Problem: Optimize one thing, break another
Solution: Always have guardrail metrics
Lagging metrics:
Leading metrics:
Best practice: Use leading metrics to get fast signal, validate with lagging metrics on a sample.
Problem: Testing a small feature but measuring site-wide metrics
Example:
Solution: Measure metrics scoped to exposed users
Problem: Aggregate metric moves one way, segments move the opposite way
Example:
Solution: Always segment your metrics (new vs returning, mobile vs desktop, etc.)
Experiment: Testing new thumbnail images
Bad metric: Monthly viewing hours
Good metric: Click-through rate on thumbnails
Experiment: Showing "Only 2 rooms left!" urgency message
Bad metric: Bookings per visitor
Good metrics:
Result: +2.5% conversion, but -5% satisfaction and -3% return visits Decision: Don't ship. Guardrails caught a bad long-term tradeoff.
Before you launch an experiment, verify:
[ ] Primary metric clearly defined
[ ] STEDII checklist passed
[ ] Guardrails defined (3-5 metrics)
[ ] Statistical plan complete
[ ] Segmentation plan
/experiment-decision - Decide when to A/B test vs ship/metrics-framework - Understand leading vs lagging metrics/define-north-star - Choose your North Star Metric/retention-analysis - Measure long-term impactFramework credit: Adapted from Aakash Gupta's STEDII framework. Read the full article: https://www.news.aakashg.com/p/metrics-experiments
When the PM uses /experiment-metrics, I automatically:
Source: thoughts/shared/pm/prds/, success metrics defined there
Source: PostHog, PostHog, Posthog (if connected)
Source: thoughts/shared/pm/metrics/, company guardrails
Source: thoughts/shared/pm/metrics/, A/B test results
Source: Connection to /experiment-decision skill
Before presenting output to the PM, verify:
thoughts/shared/pm/metrics/ for existing experiments and baselines, and thoughts/shared/pm/prds/ for pre-defined success metricstesting
Phase-agent that fixes a failing verify verdict so the pipeline self-heals instead of stalling to needs-human (CTL-653). Reads `${ORCH_DIR}/workers/<ticket>/verify.json`, fixes the `findings[]` (every severity:"high" plus the regression_risk drivers) directly via Edit/Write, commits the remediation, and emits `phase.remediate.complete.<ticket>`. The scheduler's router then re-dispatches `verify` to re-check (the verify⇄remediate cycle, cap 3). Dispatched as a `claude --bg` job by `phase-agent-dispatch`, which invokes it via slash command — hence `user-invocable: true`.
tools
--- name: phase-triage description: Phase agent that triages a Linear ticket — expands acronyms, classifies (feature/bug/docs/refactor/chore), identifies genuine blockers (a semantic second-pass over the backlog — NOT a prose scrape; CTL-838), estimates scope, writes triage.json, and posts a triage analysis comment to Linear. Triage completion is signaled by that comment plus the local triage.json — there is no `triaged` label. Emits phase.triage.complete.<TICKET> on success and phase.triage.fai
tools
Phase agent for the research step of the 9-phase orchestrator pipeline (CTL-450). Wraps /catalyst-dev:research-codebase and produces thoughts/shared/research/<date>-<ticket>.md, then emits phase.research.complete.<ticket>. Reads triage.json from the worker dir as its prior-phase artifact. Spawned via plugins/dev/scripts/phase-agent-dispatch, which invokes it via slash command — hence `user-invocable: true`.
development
Phase-agent wrapper that opens the pull request after implementation completes (CTL-449 Initiative 1 Phase 3). Delegates to `/catalyst-dev:create-pr` (which already auto-runs `describe-pr` and transitions Linear to `inReview`), then writes the PR number + URL into the phase signal file so the downstream `phase-monitor-merge` agent can read it without re-querying GitHub. Dispatched as a `claude --bg` job by `phase-agent-dispatch`, which invokes it via slash command — hence `user-invocable: true`.