rigorous-experiments/SKILL.md
This skill should be used when designing, running, validating, or auditing statistical experiments on personal or observational time-series data (health metrics, speech/text corpora, behavioral logs, diaries, n-of-1 self-tracking). It enforces pre-registration, exact permutation tests, FDR discipline, data-validation gates, adversarial code review, and cross-validation with external models. Triggers on "design an experiment", "test this hypothesis on my data", "is this correlation real", "audit these findings", "pre-register", "validate this dataset", or any n-of-1 / quantified-self analysis request.
npx skillsauth add glebis/claude-skills rigorous-experimentsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Run statistical experiments on observational/personal time-series data that survive scrutiny. Distilled from a 54-experiment n-of-1 program in which sampled permutation tests, missing-data artifacts, app-categorization bugs and collinear mechanisms repeatedly manufactured — and then destroyed — "findings". Every rule here exists because its absence once produced a wrong conclusion.
Pick the mode matching the request; chain them for a full study.
| Mode | When | Reference |
|------|------|-----------|
| design | New hypothesis or study | references/design.md |
| conduct | Implementing + running the experiment | references/statistics.md |
| validate-data | Before trusting ANY new data source | references/data-validation.md |
| cross-validate | Findings worth defending; code review; external model review (e.g. GPT Pro) | references/cross-validation.md |
| investigate-leads | A sweep/run produced leads (p<0.06, not FDR-confirmed) | references/lead-investigation.md |
| audit | Re-examining past claims, registries of findings | references/statistics.md §Audit |
scripts/perm_stats.py.len(tests) (that defeats pre-registration; the linter
rejects it). Assert the run matches the declared m. Confirmatory
families small and separate from exploratory sweeps; pooling everything
into one BH buries true effects, cherry-picking families manufactures
them. Plain BH assumes independent/positively-dependent tests; for
strongly dependent lag families use BH-Yekutieli or maxT resampling.validate-data gate on any new source (see reference — the checklist
has caught: zero-vs-missing conflation, dedup semantics, substring
category bugs, rolling purge windows, timezone conventions).design: pre-registered hypotheses + family + power sanity.conduct: implement with scripts/perm_stats.py; run; write results
JSON with tests, statuses, and caveats including known limitations.cross-validate: adversarial code review (e.g. Codex read-only) BEFORE
trusting results; fix findings; re-run. For major claims, external
model review with a privacy-screened archive.investigate-leads on anything that surfaced as a lead (not at the
same scale — the triage battery: LOO, directionality, detrend-vs-step,
within-cycle, prewhiten+bootstrap; consolidate same-direction leads
into one composite). Mark diagnostic runs descriptive_only: true.Launch the bundled explorer over any directory of results JSONs:
python3 scripts/explorer.py <results_dir> [--port 8799] [--pattern "exp*.json"] [--sort newest|oldest]
Generates explorer.html in the directory, starts (or reuses) a loopback
http server on the port, and opens the browser: experiment list with
confirmed/lead badges, filter, sortable test tables color-coded by
status, verdicts, caveats, raw JSON. The page fetches result files live —
re-running experiments updates the view; re-run the script only when new
result files appear. Serve over localhost, never file:// (CDN fonts) and
never on a non-loopback interface (results may contain personal
statistics).
Run python3 evals/run_evals.py (from the skill directory) to lint an
experiment script/results pair against the standards (pre-registration
present, fixed literal m, exact perm usage, caveats, no raw text in
outputs). A diagnostic/triage run that intentionally mints no new tests
sets descriptive_only: true in its results JSON to satisfy the
"has tests" check. Eval cases in evals/cases/ document expected
pass/fail examples.
development
Create Tufte-inspired data reports and infographic dashboards as standalone HTML files. Uses EB Garamond for text, Monaspace Argon for numbers, Chart.js for interactive charts, and inline SVG sparklines. Produces publication-quality reports with 2-column narrative+data layouts, status dashboards, scroll animations, and responsive mobile support. Use this skill whenever the user wants to create a data report, activity dashboard, infographic, personal analytics page, health tracker visualization, or any document that combines narrative text with interactive charts and tables. Also triggers for "make a report like Tufte", "create an infographic", "build a dashboard", "visualize my data", or requests for beautiful data-driven documents.
documentation
Cut a software release and maintain a tiered compatibility policy. Use when the user wants to release, ship a version, bump the version, tag a release, write a changelog, or update COMPATIBILITY. Config-driven via release.config.json; bumps version files, runs a readiness gate, updates COMPATIBILITY.md tiers and deprecations, tags (→ release workflow), and reports closed issues. Teaches the underlying standards as it runs.
development
Sync and manage bilingual (EN/RU) library content for agency-docs. Use when adding, updating, or reviewing library articles. Handles translation, sync checks, and Russian stylistic review.
development
This skill should be used to watch a long-running background job (ffmpeg/media encode, qmd or other embedding/vector-DB run, batch agent/LLM pipeline, or a real-browser/agent-browser daemon) until it finishes or wedges, then deliver a verdict (done, needs-attention, or blocked) plus the exact next command, without burning dozens of manual poll commands. Triggers on "babysit this job", "watch this until it's done", "ping me when the encode/embed/batch finishes", "is this background process stuck", "monitor this ffmpeg/qmd run", or any request to wait on a long-running process and be told when it's complete or hung.