Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

a-green-hand-jack/statistical-analysis-planner

Name: statistical-analysis-planner
Author: a-green-hand-jack

skills/statistical-analysis-planner/SKILL.md

npx skillsauth add a-green-hand-jack/ml-research-skills statistical-analysis-planner

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Statistical Analysis Planner

Design the statistical analysis before running, and report it correctly after results exist. This skill prevents underpowered claims, misleading averages-without-variance, and significance theater in ML papers.

Use this skill when:

deciding which significance tests or confidence intervals to report for a result table
seed variance is high and a single-run result may not be representative
comparing methods and wanting to know if the difference is statistically meaningful
a paper or rebuttal needs to defend a claim quantitatively against reviewer variance concerns
an ablation result is close and the decision to include it depends on whether the difference is real
multiple comparisons are being made and type-I error accumulation needs to be controlled

Do not use this skill to run the experiments — use run-experiment. Do not use this skill to interpret surprising results scientifically — use result-diagnosis. Use this skill after results exist (or in planning mode before deciding how many seeds to run).

Pair this skill with:

experiment-design-planner to plan the number of seeds, runs, and controls before running
result-diagnosis when the statistical analysis reveals that a result is not reliable
paper-evidence-board to update evidence slots with confidence-annotated claims
table-results-review to ensure result tables report variance and pass statistical requirements

Skill Directory Layout

<installed-skill-dir>/
├── SKILL.md
└── references/
    └── test-selection.md

Progressive Loading

Always read references/test-selection.md when choosing a statistical test or confidence interval method.
Read memory/claim-board.md and memory/evidence-board.md to understand what claims need statistical backing.

Core Principles

A mean result without variance is not an empirical claim — it is an anecdote.
Report the number of seeds and independent runs, not just the metric value.
Choose the test before seeing the results, not after. Post-hoc test selection biases results.
Effect size matters more than p-value for practical significance in ML.
Multiple comparisons require corrections. If you test 10 ablations, 0.5 of them will be "significant" at p<0.05 by chance.
Reviewer variance concerns are common at NeurIPS/ICLR. Anticipate them with pre-planned variance analysis.
If compute prevents many seeds, acknowledge the limitation explicitly rather than overclaiming.

Step 1 — Identify What Needs Statistical Analysis

For each result that will appear in the paper, record:

the claim being made ("Method A outperforms Baseline B on Task C")
the metric and its expected distribution
how many independent runs (seeds) exist
whether the comparison is within-subject (same data, different methods) or between-subject (different data splits)

Classify each result as:

requires-analysis: main claim or primary comparison
supporting-analysis: ablation or secondary result
descriptive-only: mean reported, no significance claim
single-run: only one run exists, limitations must be acknowledged

Step 2 — Choose the Analysis Plan

Read references/test-selection.md.

For each requires-analysis result:

Result: <claim or comparison>
Metric: <metric name>
N seeds / runs: <count>
Distribution assumption: normal / non-normal / unknown
Test: <paired t-test / Wilcoxon / bootstrap CI / permutation test / McNemar>
Significance threshold: α = 0.05 (or 0.01 for primary claim)
Effect size measure: Cohen's d / Cliff's delta / relative improvement %
Multiple comparison correction: <Bonferroni / Holm / Benjamini-Hochberg / none>
Report format: mean ± std / 95% CI / p-value + effect size

For seed variance analysis, plan:

minimum number of seeds to detect the expected effect size at power 0.8
how to report variance: standard deviation across seeds, bootstrap CI, or min/max range

Step 3 — Run or Verify the Analysis

For results that already exist, compute:

mean and standard deviation across seeds
95% confidence interval (bootstrap recommended for non-normal distributions)
p-value from the chosen test (if significance is being claimed)
effect size (Cohen's d or relative improvement %)
corrected p-values if multiple comparisons are made

For compute-limited settings (1–3 seeds):

report mean and range (min/max) rather than standard deviation
acknowledge the limitation explicitly in the paper
do not claim statistical significance with fewer than 5 independent runs for parametric tests

Step 4 — Report Format for Paper

For main result tables:

Method A: 82.3 ± 1.2 (mean ± std, N=5 seeds)
          [80.4, 84.1] 95% CI
          p < 0.05 vs Baseline B (paired t-test, Bonferroni-corrected)
          Effect size: d = 0.83 (large)

For text claims:

"X outperforms Y by Z% (p < 0.05, d = 0.6)" is preferred over "X significantly outperforms Y"
"X achieves [metric] = A ± B across N seeds" is preferred over "X achieves A"
Avoid "significantly" without a reported test and threshold

For low-seed settings:

"X achieves [metric] = A (range: [B, C], N=3 seeds); we note this result is based on limited seeds"

Step 5 — Multiple Comparison Audit

If the paper reports more than 3 comparisons on the same held-out set:

list all comparisons
apply Bonferroni correction (divide α by number of tests) or Holm correction (less conservative)
flag any comparison that loses significance after correction
decide whether to include or describe as "trend" rather than "significant"

Memory Writeback

Update memory/evidence-board.md when statistical analysis changes the confidence level of a claim
Update memory/claim-board.md to reflect corrected or strengthened claim wording
Update memory/risk-board.md when low seed count or failed significance is a reviewer risk

Final Sanity Check

Before finalizing:

every main result table row has at least N, mean, and variance reported
significance tests were chosen before seeing the specific results, or the analysis plan was declared a priori
multiple-comparison corrections are applied when > 3 comparisons share a test set
effect sizes are reported alongside p-values for claimed differences
compute-limited seed counts are acknowledged as limitations
claims in the paper match the statistical evidence (no overclaiming)

a-green-hand-jack/statistical-analysis-planner

skills/statistical-analysis-planner/SKILL.md

Plan and report statistical rigor for ML experiment results. Use when significance testing, effect size reporting, confidence intervals, seed variance analysis, or multiple-comparison corrections are needed before including results in a paper or rebuttal.

4 stars

development

Updated May 16, 2026

$ install --global

skillsauth

npx skillsauth add a-green-hand-jack/ml-research-skills statistical-analysis-planner

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 16, 2026, 4:27 AM128.6s2 files scanned

SKILL.md

name:: statistical-analysis-planner
description:: Plan and report statistical rigor for ML experiment results. Use when significance testing, effect size reporting, confidence intervals, seed variance analysis, or multiple-comparison corrections are needed before including results in a paper or rebuttal.
argument-hint:: [project-dir] [--mode plan|report|audit] [--test <test-type>]
allowed-tools:: Read, Write, Edit, Bash, Glob

Statistical Analysis Planner

Use this skill when:

deciding which significance tests or confidence intervals to report for a result table
seed variance is high and a single-run result may not be representative
comparing methods and wanting to know if the difference is statistically meaningful
a paper or rebuttal needs to defend a claim quantitatively against reviewer variance concerns
an ablation result is close and the decision to include it depends on whether the difference is real
multiple comparisons are being made and type-I error accumulation needs to be controlled

Pair this skill with:

experiment-design-planner to plan the number of seeds, runs, and controls before running
result-diagnosis when the statistical analysis reveals that a result is not reliable
paper-evidence-board to update evidence slots with confidence-annotated claims
table-results-review to ensure result tables report variance and pass statistical requirements

Skill Directory Layout

<installed-skill-dir>/
├── SKILL.md
└── references/
    └── test-selection.md

Progressive Loading

Always read references/test-selection.md when choosing a statistical test or confidence interval method.
Read memory/claim-board.md and memory/evidence-board.md to understand what claims need statistical backing.

Core Principles

A mean result without variance is not an empirical claim — it is an anecdote.
Report the number of seeds and independent runs, not just the metric value.
Choose the test before seeing the results, not after. Post-hoc test selection biases results.
Effect size matters more than p-value for practical significance in ML.
Multiple comparisons require corrections. If you test 10 ablations, 0.5 of them will be "significant" at p<0.05 by chance.
Reviewer variance concerns are common at NeurIPS/ICLR. Anticipate them with pre-planned variance analysis.
If compute prevents many seeds, acknowledge the limitation explicitly rather than overclaiming.

Step 1 — Identify What Needs Statistical Analysis

For each result that will appear in the paper, record:

the claim being made ("Method A outperforms Baseline B on Task C")
the metric and its expected distribution
how many independent runs (seeds) exist
whether the comparison is within-subject (same data, different methods) or between-subject (different data splits)

Classify each result as:

requires-analysis: main claim or primary comparison
supporting-analysis: ablation or secondary result
descriptive-only: mean reported, no significance claim
single-run: only one run exists, limitations must be acknowledged

Step 2 — Choose the Analysis Plan

Read references/test-selection.md.

For each requires-analysis result:

Result: <claim or comparison>
Metric: <metric name>
N seeds / runs: <count>
Distribution assumption: normal / non-normal / unknown
Test: <paired t-test / Wilcoxon / bootstrap CI / permutation test / McNemar>
Significance threshold: α = 0.05 (or 0.01 for primary claim)
Effect size measure: Cohen's d / Cliff's delta / relative improvement %
Multiple comparison correction: <Bonferroni / Holm / Benjamini-Hochberg / none>
Report format: mean ± std / 95% CI / p-value + effect size

For seed variance analysis, plan:

minimum number of seeds to detect the expected effect size at power 0.8
how to report variance: standard deviation across seeds, bootstrap CI, or min/max range

Step 3 — Run or Verify the Analysis

For results that already exist, compute:

mean and standard deviation across seeds
95% confidence interval (bootstrap recommended for non-normal distributions)
p-value from the chosen test (if significance is being claimed)
effect size (Cohen's d or relative improvement %)
corrected p-values if multiple comparisons are made

For compute-limited settings (1–3 seeds):

report mean and range (min/max) rather than standard deviation
acknowledge the limitation explicitly in the paper
do not claim statistical significance with fewer than 5 independent runs for parametric tests

Step 4 — Report Format for Paper

For main result tables:

Method A: 82.3 ± 1.2 (mean ± std, N=5 seeds)
          [80.4, 84.1] 95% CI
          p < 0.05 vs Baseline B (paired t-test, Bonferroni-corrected)
          Effect size: d = 0.83 (large)

For text claims:

"X outperforms Y by Z% (p < 0.05, d = 0.6)" is preferred over "X significantly outperforms Y"
"X achieves [metric] = A ± B across N seeds" is preferred over "X achieves A"
Avoid "significantly" without a reported test and threshold

For low-seed settings:

"X achieves [metric] = A (range: [B, C], N=3 seeds); we note this result is based on limited seeds"

Step 5 — Multiple Comparison Audit

If the paper reports more than 3 comparisons on the same held-out set:

list all comparisons
apply Bonferroni correction (divide α by number of tests) or Holm correction (less conservative)
flag any comparison that loses significance after correction
decide whether to include or describe as "trend" rather than "significant"

Memory Writeback

Update memory/evidence-board.md when statistical analysis changes the confidence level of a claim
Update memory/claim-board.md to reflect corrected or strengthened claim wording
Update memory/risk-board.md when low seed count or failed significance is a reviewer risk

Final Sanity Check

Before finalizing:

every main result table row has at least N, mean, and variance reported
significance tests were chosen before seeing the specific results, or the analysis plan was declared a priori
multiple-comparison corrections are applied when > 3 comparisons share a test set
effect sizes are reported alongside p-values for claimed differences
compute-limited seed counts are acknowledged as limitations
claims in the paper match the statistical evidence (no overclaiming)

Related Skills

a-green-hand-jack/ml-research-bootstrap

testing

VerifiedTrustedCommunity

Bootstrap project-local ml-research-skills. Use from global installs when creating a new ML research project, enabling this collection in an existing ML research repo, or deciding whether to install the full bundle locally. Route to project-init for new projects; do not handle paper or experiment work directly.

4SKILL.mdUpdated May 26, 2026

a-green-hand-jack/ml-research-bootstrap

a-green-hand-jack/project-ops-router

development

VerifiedTrustedCommunity

Route project operations tasks — git, memory, bootstrap, remote, workspace, code review, timeline, ops — to the correct skill. Use when the task involves commits, pushes, worktrees, project memory, enabling project-local skills, SSH/server coordination, sidecar runners, or audits. Do not solve the ops task directly.

4SKILL.mdUpdated May 19, 2026

a-green-hand-jack/project-ops-router

a-green-hand-jack/paper-writing-router

testing

VerifiedTrustedCommunity

Route ML/AI paper writing tasks to the correct skill — contract planning, prose drafting, section writing, consistency editing, review simulation, rebuttal, submission, or citation work. Use when the task involves writing, revising, reviewing, or submitting a paper instead of guessing between paper-writing-assistant, paper-writing-contract-planner, paper-reviewer-simulator, auto-paper-improvement-loop, or citation skills. Do not draft prose directly.

4SKILL.mdUpdated May 19, 2026

a-green-hand-jack/paper-writing-router

a-green-hand-jack/ml-research-router

data-ai

VerifiedTrustedCommunity

Project-local router for ML research skill selection. Use inside an initialized ML research project, or while maintaining this skill repo, when the user describes an ML research/paper/experiment/discovery/ops/release workflow and may not know the skill; route to a domain router or high-signal leaf. Do not use for generic non-ML projects.

4SKILL.mdUpdated May 19, 2026

a-green-hand-jack/ml-research-router

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/a-green-hand-jack/ml-research-skills.git

# Copy into Claude Code skills folder (global)
cp -r ml-research-skills/skills/statistical-analysis-planner ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

a-green-hand-jack/ml-research-skills

4 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT