skills/statistics/SKILL.md
--- name: statistics description: >- Statistical analysis and hypothesis testing for data-driven decisions. Use when: choosing the right statistical test for a question, calculating sample sizes, running A/B test analysis, comparing distributions, measuring correlation, building confidence intervals, validating assumptions before applying a test, interpreting p-values and effect sizes, or selecting the right summary statistics for a dataset. Covers descriptive statistics, hypothesi
npx skillsauth add michaelsvanbeek/personal-agent-skills skills/statisticsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
| Question Type | What You're Asking | Go To | |--------------|-------------------|-------| | Describe | What does the data look like? | Descriptive Statistics | | Compare | Are these groups different? | Comparison Tests | | Relate | Do these variables move together? | Correlation & Regression | | Predict | What will happen next? | Regression & Modeling (beyond this skill — use ML) | | Classify | Does this differ from expected? | Goodness-of-Fit Tests |
| Factor | Options | Why It Matters | |--------|---------|---------------| | Number of groups | 1, 2, or 3+ | Determines which test family to use | | Paired or independent? | Same subjects measured twice vs. different subjects | Paired tests are more powerful but require matched data | | Variable type | Continuous (interval/ratio) vs. categorical (nominal/ordinal) | Determines parametric vs. non-parametric | | Sample size | Small (< 30) vs. large (≥ 30) | Small samples need non-parametric or exact tests | | Distribution shape | Normal vs. skewed vs. unknown | Parametric tests assume normality |
Use these to summarize and understand data before testing anything.
| Measure | When to Use | Sensitive To | |---------|------------|-------------| | Mean | Symmetric, roughly normal data | Outliers (a single extreme value shifts it) | | Median | Skewed data or when outliers are present | Nothing — robust by design | | Mode | Categorical data or multimodal distributions | Ties, small samples |
Rule: If mean ≠ median by more than 10–15%, the data is skewed — prefer median.
| Measure | When to Use | Notes | |---------|------------|-------| | Standard deviation | Normal-ish data; pair with mean | Same units as data | | IQR (Q3 − Q1) | Skewed data; pair with median | Robust to outliers | | Range | Quick sense of spread | Useless with outliers | | Coefficient of variation (CV) | Comparing spread across datasets with different scales | CV = std / mean |
| Check | Method | Interpretation |
|-------|--------|---------------|
| Normality | Shapiro-Wilk (n < 5000), Anderson-Darling, or Q-Q plot | p < 0.05 → not normal |
| Skewness | scipy.stats.skew | |skew| > 1 = substantially skewed |
| Kurtosis | scipy.stats.kurtosis | > 0 = heavy tails, < 0 = light tails |
How many groups?
├── 1 group vs. known value
│ ├── Normal → One-sample t-test
│ └── Not normal → Wilcoxon signed-rank
├── 2 groups
│ ├── Paired (same subjects)?
│ │ ├── Normal → Paired t-test
│ │ └── Not normal → Wilcoxon signed-rank
│ └── Independent?
│ ├── Normal, equal variance → Independent t-test
│ ├── Normal, unequal variance → Welch's t-test
│ └── Not normal → Mann-Whitney U
└── 3+ groups
├── Normal, equal variance → One-way ANOVA
├── Not normal → Kruskal-Wallis
└── Repeated measures → Repeated measures ANOVA or Friedman test
| Test | Compares | Assumptions | Non-Parametric Alternative | |------|---------|-------------|--------------------------| | One-sample t-test | Sample mean vs. known value | Normal, continuous | Wilcoxon signed-rank | | Independent t-test | Means of 2 independent groups | Normal, equal variance, independent | Mann-Whitney U | | Welch's t-test | Means of 2 independent groups | Normal, independent (no equal variance needed) | Mann-Whitney U | | Paired t-test | Means of 2 paired measurements | Normal differences, continuous | Wilcoxon signed-rank | | One-way ANOVA | Means of 3+ independent groups | Normal, equal variance, independent | Kruskal-Wallis | | Chi-squared test | Observed vs. expected frequencies | Expected count ≥ 5 in each cell | Fisher's exact test |
If the omnibus test is significant (groups differ), use post-hoc tests to find which groups differ:
| Test | When to Use | |------|------------| | Tukey HSD | All pairwise comparisons, equal sample sizes | | Bonferroni correction | Few planned comparisons | | Dunn's test | Post-hoc for Kruskal-Wallis |
| Method | Variable Types | Measures | Assumptions | |--------|---------------|----------|-------------| | Pearson's r | Both continuous | Linear relationship strength | Normal, linear, no outliers | | Spearman's ρ | Ordinal or continuous | Monotonic relationship strength | None (rank-based) | | Kendall's τ | Ordinal, small samples | Monotonic relationship strength | None (rank-based, more robust than Spearman for small n) | | Chi-squared test of independence | Both categorical | Whether variables are associated | Expected counts ≥ 5 | | Point-biserial | One binary, one continuous | Correlation | Normal continuous variable |
Correlation ≠ causation. Always state this explicitly when reporting. A correlation is a starting point for investigation, not a conclusion.
Always report effect size alongside significance. A tiny p-value with a tiny effect is not actionable.
| Context | Measure | Small | Medium | Large | |---------|---------|-------|--------|-------| | 2-group comparison | Cohen's d | 0.2 | 0.5 | 0.8 | | ANOVA | Eta-squared (η²) | 0.01 | 0.06 | 0.14 | | Correlation | r or R² | 0.1 | 0.3 | 0.5 | | Chi-squared | Cramér's V | 0.1 | 0.3 | 0.5 | | Binary outcome | Odds ratio | 1.5 | 2.5 | 4.0 |
Before running an experiment, calculate the required sample size:
## Power Analysis
**Test**: <e.g., "Independent t-test">
**Desired power**: 0.80 (standard) or 0.90 (high confidence)
**Significance level (α)**: 0.05
**Minimum detectable effect (MDE)**: <e.g., "Cohen's d = 0.3" or "5% conversion lift">
**Required sample size per group**: <computed value>
**Expected duration to collect**: <e.g., "2 weeks at current traffic">
Rules of thumb:
A/B tests are the most common applied statistics scenario. Follow this workflow:
## A/B Test Results
**Test**: <name>
**Duration**: <start – end>
**Sample size**: Control: <n>, Variant: <n>
**Primary metric**: <metric name>
| Group | Value | 95% CI |
|-------|-------|--------|
| Control | <X> | [<lo>, <hi>] |
| Variant | <X> | [<lo>, <hi>] |
**Relative lift**: <X%> [<CI lo>, <CI hi>]
**p-value**: <X>
**Effect size**: <measure and value>
**Guardrail metrics**: <all stable / degradation in X>
**Decision**: <Ship variant | Keep control | Inconclusive — extend or redesign>
**Reasoning**: <why this decision, referencing effect size and practical significance>
Before applying any parametric test, verify:
| Assumption | How to Check | If Violated | |-----------|-------------|-------------| | Normality | Shapiro-Wilk test, Q-Q plot, histogram | Use non-parametric alternative | | Equal variance | Levene's test, F-test | Use Welch's t-test or rank-based test | | Independence | Study design (not a statistical test) | Use paired/repeated measures test | | Linearity (regression) | Scatter plot of residuals vs. fitted | Transform variables or use non-linear model | | No multicollinearity (regression) | VIF < 5 for each predictor | Remove or combine correlated predictors |
## Statistical Question
**Question**: <What are you trying to learn? Be specific.>
**Population**: <Who or what does this apply to?>
**Hypothesis** (if testing):
- H₀: <null hypothesis — no effect, no difference>
- H₁: <alternative hypothesis — the effect you expect>
**Practical significance threshold**: <What size effect would actually matter?>
## Data Profile
**Source**: <where the data comes from>
**Sample size**: <n>
**Variables**:
- <var1>: <type> — <description>
- <var2>: <type> — <description>
**Missing data**: <count, percentage, pattern>
**Distribution**: <per variable — normal, skewed, categorical frequencies>
**Assumption checks**: <normality, variance, independence — pass/fail>
Use the selection guide above. Document:
## Method Selection
**Test**: <name>
**Why**: <reasoning tied to question type, data structure, and assumption checks>
**Alternative considered**: <what else could work and why it wasn't chosen>
**Correction applied**: <e.g., "Bonferroni for 3 pairwise comparisons" or "none — single test">
## Results
**Test statistic**: <name> = <value>
**p-value**: <value>
**Effect size**: <measure> = <value> (<small | medium | large>)
**Confidence interval**: [<lower>, <upper>] at <confidence level>
**Interpretation**: <Plain-language statement of what this means. Not just "p < 0.05 so we reject H₀" — state the practical implication.>
**Limitations**: <What this result does NOT tell us. Confounders, generalizability, assumptions that were borderline.>
Every statistical analysis should produce:
Regression is the most common applied statistics method. Use this decision guide:
| Outcome Variable | Predictor(s) | Method | Use When | |-----------------|-------------|--------|----------| | Continuous | 1 continuous | Simple linear regression | Predicting one variable from another (e.g., revenue from ad spend) | | Continuous | Multiple | Multiple linear regression | Predicting with several factors; controlling for confounders | | Binary (0/1) | Any | Logistic regression | Predicting yes/no outcomes (e.g., churn, conversion) | | Count (0, 1, 2...) | Any | Poisson regression | Predicting event counts (e.g., support tickets per day) |
Before running regression, check:
Key outputs to report: R², adjusted R², coefficients with CIs, residual plots, and F-test p-value.
Frequentist methods (everything above) are the default. Consider Bayesian approaches when:
| Signal | Why Bayesian Helps | |--------|-------------------| | Small sample size (n < 30) | Priors regularize unstable estimates | | You have strong prior knowledge | Incorporating domain expertise improves estimates | | You need P(hypothesis | data), not P(data | hypothesis) | Credible intervals answer "what's the probability the effect is > X?" directly | | Sequential testing / continuous monitoring | Bayesian A/B tests allow peeking without inflating error rates | | Stakeholders struggle with p-values | "95% probability the effect is between 2% and 8%" is more intuitive |
Practical tools: pymc, arviz for general Bayesian analysis; Bayesian A/B testing via bayesian-testing or built-in platform features (Optimizely, LaunchDarkly).
Default stance: Use frequentist methods unless one of the above signals is present. Don't switch to Bayesian for complexity's sake.
| Library | Language | Best For |
|---------|----------|----------|
| scipy.stats | Python | Hypothesis tests, distributions, descriptive stats |
| statsmodels | Python | Regression, ANOVA, time series, assumption diagnostics |
| pingouin | Python | Clean API for t-tests, ANOVA, correlation, effect sizes |
| scikit-learn | Python | Train/test splits, cross-validation, preprocessing |
| pymc | Python | Bayesian modeling and inference |
| power_analysis / statsmodels.stats.power | Python | Sample size and power calculations |
| Pitfall | Why It Fails | |---------|-------------| | p-hacking | Running multiple tests and reporting only significant results inflates false positives. Pre-register your hypothesis. | | Ignoring effect size | p = 0.001 with Cohen's d = 0.05 means "we're very sure about a trivially small difference." Not actionable. | | Small sample, big claims | A study with n = 12 that finds p = 0.04 is fragile. One outlier changes the conclusion. | | Violating independence | Using the same users in both groups, or multiple measurements without paired tests. Results are invalid. | | Peeking at A/B tests | Checking daily and stopping when p < 0.05 dramatically inflates false positive rate. Use sequential testing if you must peek. | | Treating non-significant as "no effect" | Absence of evidence ≠ evidence of absence. You may be underpowered. Report power. | | Applying parametric tests to ordinal data | A Likert scale (1–5) is not continuous. Use non-parametric methods. | | Confusing correlation with causation | Pearson r = 0.8 does not mean X causes Y. It means they move together. | | Cherry-picking subgroups | "It wasn't significant overall, but it was significant for users aged 25–30 on Tuesdays." This is noise. | | Reporting without uncertainty | "Conversion rate is 4.2%" is less useful than "4.2% ± 0.8% (95% CI)." Always show the interval. |
development
TypeScript coding standards and type safety conventions. Use when: creating TypeScript files, defining interfaces and types, writing type-safe code, reviewing TypeScript for type correctness, auditing a codebase for type safety gaps, eliminating any or ts-ignore usage, or improving strict-mode compliance. Covers strict typing, avoiding any and ts-ignore, discriminated unions, Zod runtime validation, immutability patterns, and proper type definitions.
testing
Writing clear, actionable tickets in any issue tracker (Jira, Linear, GitHub Issues, ServiceNow, etc.). Use when: creating epics, stories, tasks, bugs, or spikes; writing acceptance criteria; decomposing work for a sprint; linking dependencies between tickets; auditing backlog items for clarity; or coaching a team on ticket quality. Covers title conventions, description templates, acceptance criteria, decomposition rules, dependency linking, and org-specific pluggable configuration.
development
Testing strategy, patterns, and evaluation for software and LLM/AI systems. Use when: writing tests, choosing test boundaries, designing test data, structuring test suites, evaluating LLM outputs, building evaluation pipelines, setting coverage thresholds, auditing test coverage gaps in existing projects, or improving test quality and structure.
development
Writing effective status updates for different audiences and cadences. Use when: writing a weekly status update, preparing a monthly summary, drafting a quarterly review, sending updates to leadership, sharing progress with stakeholders, or improving the clarity and impact of team communications. Covers weekly, monthly, and quarterly formats tailored for upward, lateral, and downward communication.