Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

glebis/rigorous-experiments

Name: rigorous-experiments
Author: glebis

rigorous-experiments/SKILL.md

npx skillsauth add glebis/claude-skills rigorous-experiments

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Rigorous Experiments

Run statistical experiments on observational/personal time-series data that survive scrutiny. Distilled from a 54-experiment n-of-1 program in which sampled permutation tests, missing-data artifacts, app-categorization bugs and collinear mechanisms repeatedly manufactured — and then destroyed — "findings". Every rule here exists because its absence once produced a wrong conclusion.

Modes

Pick the mode matching the request; chain them for a full study.

| Mode | When | Reference | |------|------|-----------| | design | New hypothesis or study | references/design.md | | conduct | Implementing + running the experiment | references/statistics.md | | validate-data | Before trusting ANY new data source | references/data-validation.md | | cross-validate | Findings worth defending; code review; external model review (e.g. GPT Pro) | references/cross-validation.md | | investigate-leads | A sweep/run produced leads (p<0.06, not FDR-confirmed) | references/lead-investigation.md | | audit | Re-examining past claims, registries of findings | references/statistics.md §Audit |

Non-negotiable core (all modes)

Pre-register before computing. Hypotheses, exact tests, family size m, and the acceptance threshold go in the script docstring BEFORE the first run. Post-hoc tests are reported as descriptive, never promoted.
Exact permutation, never sampled, on small n. A session sequence of n=19 has 18 circular shifts: the minimum honest p is ~1/19≈0.05. Sampling 2000 shifts with replacement fabricates precision (this killed a flagship "q=0.028" finding). Use scripts/perm_stats.py.
Permute over the full calendar, not the compressed series. Shifting a gap-compressed series breaks the timeline; keep missingness as NaN masks re-applied per shift. Event indicators must be pure 0/1 with no gaps — missingness lives only in the outcome series.
BH with FIXED family size m, a LITERAL CONSTANT declared at design time — never len(tests) (that defeats pre-registration; the linter rejects it). Assert the run matches the declared m. Confirmatory families small and separate from exploratory sweeps; pooling everything into one BH buries true effects, cherry-picking families manufactures them. Plain BH assumes independent/positively-dependent tests; for strongly dependent lag families use BH-Yekutieli or maxT resampling.
Stationarity check before correlating trending series. Exact circular shift on a trending series is "exactly, reproducibly wrong": report prewhitened-r (AR1 residuals) and stationary bootstrap alongside.
Stratify before pooling (Simpson check): within group (e.g. therapy/coaching) and within regime (pre/post known breaks). A pooled r=−0.25 once hid therapy −0.64 vs coaching +0.53.
Controls can re-describe a finding, not just kill it. When a control collapses an effect, check collinearity of control and predictor — r(self-focus, session-length)=0.79 meant "mechanism ambiguous", not "effect fake". Report the decomposition.
Honest statuses: confirmed (q<0.10 exact) ≠ lead (p<0.06) ≠ null ≠ descriptive. Status flips are recorded, never silently edited. Nulls with adequate power are findings. Robust ≠ significant: a lead surviving leave-one-out at small n is still underpowered — a candidate for prospective test, not a finding. 8b. Series scope is part of the test. A lagged "[t+1]" means the next unit in the series the hypothesis is about, not the next pooled row; define scope before lagging (it once flipped a sign). When recomputing a prior result, reproduce a stored artifact on that scope first.
Privacy: raw text/audio never enters output files or external uploads — statistics, rates and embedding-derived scores only.
Plain-language reporting: every statistic carries its practical meaning inline; define r/p/q/n once per report; no untranslated jargon calques. Narrative first, numbers as support.

Workflow (full study)

validate-data gate on any new source (see reference — the checklist has caught: zero-vs-missing conflation, dedup semantics, substring category bugs, rolling purge windows, timezone conventions).
design: pre-registered hypotheses + family + power sanity.
conduct: implement with scripts/perm_stats.py; run; write results JSON with tests, statuses, and caveats including known limitations.
cross-validate: adversarial code review (e.g. Codex read-only) BEFORE trusting results; fix findings; re-run. For major claims, external model review with a privacy-screened archive.
investigate-leads on anything that surfaced as a lead (not at the same scale — the triage battery: LOO, directionality, detrend-vs-step, within-cycle, prewhiten+bootstrap; consolidate same-direction leads into one composite). Mark diagnostic runs descriptive_only: true.
Verdicts in honest prose (mixed/rejected allowed); report; registry update with status provenance.

Viewing results

Launch the bundled explorer over any directory of results JSONs:

python3 scripts/explorer.py <results_dir> [--port 8799] [--pattern "exp*.json"] [--sort newest|oldest]

Generates explorer.html in the directory, starts (or reuses) a loopback http server on the port, and opens the browser: experiment list with confirmed/lead badges, filter, sortable test tables color-coded by status, verdicts, caveats, raw JSON. The page fetches result files live — re-running experiments updates the view; re-run the script only when new result files appear. Serve over localhost, never file:// (CDN fonts) and never on a non-loopback interface (results may contain personal statistics).

Evals

Run python3 evals/run_evals.py (from the skill directory) to lint an experiment script/results pair against the standards (pre-registration present, fixed literal m, exact perm usage, caveats, no raw text in outputs). A diagnostic/triage run that intentionally mints no new tests sets descriptive_only: true in its results JSON to satisfy the "has tests" check. Eval cases in evals/cases/ document expected pass/fail examples.

glebis/rigorous-experiments

rigorous-experiments/SKILL.md

This skill should be used when designing, running, validating, or auditing statistical experiments on personal or observational time-series data (health metrics, speech/text corpora, behavioral logs, diaries, n-of-1 self-tracking). It enforces pre-registration, exact permutation tests, FDR discipline, data-validation gates, adversarial code review, and cross-validation with external models. Triggers on "design an experiment", "test this hypothesis on my data", "is this correlation real", "audit these findings", "pre-register", "validate this dataset", or any n-of-1 / quantified-self analysis request.

251 stars

development

Updated Jun 9, 2026

$ install --global

skillsauth

npx skillsauth add glebis/claude-skills rigorous-experiments

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Jun 9, 2026, 2:02 AM52.7s15 files scanned

SKILL.md

name:: rigorous-experiments
description:: This skill should be used when designing, running, validating, or auditing statistical experiments on personal or observational time-series data (health metrics, speech/text corpora, behavioral logs, diaries, n-of-1 self-tracking). It enforces pre-registration, exact permutation tests, FDR discipline, data-validation gates, adversarial code review, and cross-validation with external models. Triggers on "design an experiment", "test this hypothesis on my data", "is this correlation real", "audit these findings", "pre-register", "validate this dataset", or any n-of-1 / quantified-self analysis request.

Rigorous Experiments

Modes

Pick the mode matching the request; chain them for a full study.

Non-negotiable core (all modes)

Pre-register before computing. Hypotheses, exact tests, family size m, and the acceptance threshold go in the script docstring BEFORE the first run. Post-hoc tests are reported as descriptive, never promoted.
Exact permutation, never sampled, on small n. A session sequence of n=19 has 18 circular shifts: the minimum honest p is ~1/19≈0.05. Sampling 2000 shifts with replacement fabricates precision (this killed a flagship "q=0.028" finding). Use scripts/perm_stats.py.
Permute over the full calendar, not the compressed series. Shifting a gap-compressed series breaks the timeline; keep missingness as NaN masks re-applied per shift. Event indicators must be pure 0/1 with no gaps — missingness lives only in the outcome series.
BH with FIXED family size m, a LITERAL CONSTANT declared at design time — never len(tests) (that defeats pre-registration; the linter rejects it). Assert the run matches the declared m. Confirmatory families small and separate from exploratory sweeps; pooling everything into one BH buries true effects, cherry-picking families manufactures them. Plain BH assumes independent/positively-dependent tests; for strongly dependent lag families use BH-Yekutieli or maxT resampling.
Stationarity check before correlating trending series. Exact circular shift on a trending series is "exactly, reproducibly wrong": report prewhitened-r (AR1 residuals) and stationary bootstrap alongside.
Stratify before pooling (Simpson check): within group (e.g. therapy/coaching) and within regime (pre/post known breaks). A pooled r=−0.25 once hid therapy −0.64 vs coaching +0.53.
Controls can re-describe a finding, not just kill it. When a control collapses an effect, check collinearity of control and predictor — r(self-focus, session-length)=0.79 meant "mechanism ambiguous", not "effect fake". Report the decomposition.
Honest statuses: confirmed (q<0.10 exact) ≠ lead (p<0.06) ≠ null ≠ descriptive. Status flips are recorded, never silently edited. Nulls with adequate power are findings. Robust ≠ significant: a lead surviving leave-one-out at small n is still underpowered — a candidate for prospective test, not a finding. 8b. Series scope is part of the test. A lagged "[t+1]" means the next unit in the series the hypothesis is about, not the next pooled row; define scope before lagging (it once flipped a sign). When recomputing a prior result, reproduce a stored artifact on that scope first.
Privacy: raw text/audio never enters output files or external uploads — statistics, rates and embedding-derived scores only.
Plain-language reporting: every statistic carries its practical meaning inline; define r/p/q/n once per report; no untranslated jargon calques. Narrative first, numbers as support.

Workflow (full study)

validate-data gate on any new source (see reference — the checklist has caught: zero-vs-missing conflation, dedup semantics, substring category bugs, rolling purge windows, timezone conventions).
design: pre-registered hypotheses + family + power sanity.
conduct: implement with scripts/perm_stats.py; run; write results JSON with tests, statuses, and caveats including known limitations.
cross-validate: adversarial code review (e.g. Codex read-only) BEFORE trusting results; fix findings; re-run. For major claims, external model review with a privacy-screened archive.
investigate-leads on anything that surfaced as a lead (not at the same scale — the triage battery: LOO, directionality, detrend-vs-step, within-cycle, prewhiten+bootstrap; consolidate same-direction leads into one composite). Mark diagnostic runs descriptive_only: true.
Verdicts in honest prose (mixed/rejected allowed); report; registry update with status provenance.

Viewing results

Launch the bundled explorer over any directory of results JSONs:

python3 scripts/explorer.py <results_dir> [--port 8799] [--pattern "exp*.json"] [--sort newest|oldest]

Evals

Related Skills

glebis/agency-docs-updater

development

VerifiedTrustedCommunity

--- name: agency-docs-updater description: End-to-end pipeline for publishing Claude Code lab meetings. Accepts optional args: date (YYYYMMDD, "yesterday", "today") and lab number (e.g. "04"). Examples: "yesterday 04", "20260420 05", "04" (today, lab 04), "" (today, auto-detect lab). --- # Agency Docs Updater Execute ALL steps automatically in sequence. Only pause if a step fails and cannot be recovered. Read `references/learnings.md` before starting for known pitfalls. **Configuration**: pat

331SKILL.mdUpdated Apr 22, 2026

glebis/agency-docs-updater

glebis/typography

tools

VerifiedTrustedCommunity

This skill should be used when applying proper typography to prose text or files in Russian, English, German, or French — smart quotes per locale («ёлочки», “curly”, „Gänsefüßchen“, « guillemets »), correct dashes (тире, em/en dash, Gedankenstrich, tiret), non-breaking spaces, ranges, ellipsis, and French espaces insécables before ! ? ; :. Fully deterministic via a pinned typograf-based CLI; never apply these rules by hand. Triggers on "типографика", "typograf", "оттипографь", "smart quotes", "fix typography", "неразрывные пробелы".

329SKILL.mdUpdated Jul 24, 2026

glebis/font-features

development

VerifiedTrustedCommunity

This skill should be used when inspecting or applying advanced OpenType features of a font (woff2/otf/ttf) — ligatures, stylistic sets (ss01–ss20), character variants (cvXX), texture healing, slashed zero, tabular/oldstyle figures, fractions, small caps, case-sensitive forms — and generating the CSS to enable them. Interviews the user via cenno to pick features. Triggers on "OpenType features", "font features", "stylistic sets", "ligatures", "texture healing", "tabular figures", "what can this font do".

329SKILL.mdUpdated Jul 24, 2026

glebis/pre-session-portrait

tools

VerifiedTrustedCommunity

--- name: pre-session-portrait description: Build a compressed, visualizable "portrait" of a consulting/coaching client before a session, so the paid hour is spent solving, not scoping. Runs a 7-lens JTBD-inspired interview (where / how / what / problem / ideal / tension / jobs-to-be-done) that takes rich open answers in and compresses them to an 11-field YAML portrait out. Delivers three ways: raw paste-into-a-clean-chat prompt, a secret GitHub gist link, or a Codex CLI one-liner. Use when prep

329SKILL.mdUpdated Jul 22, 2026

glebis/pre-session-portrait

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/glebis/claude-skills.git

# Copy into Claude Code skills folder (global)
cp -r claude-skills/rigorous-experiments ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

glebis/claude-skills

251 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT