plugins/autonomous-dev/skills/scientific-validation/SKILL.md
Scientific method for validating claims with pre-registration, power analysis, statistical rigor, and Bayesian methods. Use when testing hypotheses, running experiments, or validating claims from papers. TRIGGER when: validate, hypothesis, experiment, backtest, evidence, statistical test. DO NOT TRIGGER when: routine coding, config changes, documentation, non-experimental tasks.
npx skillsauth add akaszubski/autonomous-dev scientific-validationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Rigorous methodology for validating claims from any source - books, papers, theories, or intuition.
Data is the arbiter. Sources can be wrong.
| Phase | Name | Key Requirement | |-------|------|-----------------| | 0 | Claim Verification | Understand what source ACTUALLY claims | | 1 | Claims Extraction | Document with source citations | | 1.5 | Publication Bias Prevention | Document ALL claims before selecting | | 2 | Pre-Registration | Hypothesis BEFORE seeing results | | 2.3 | Power Analysis | Calculate required n (MANDATORY) | | 3 | Bias Prevention | Look-ahead, survivorship, selection | | 3.5 | Walk-Forward | Required for time series (MANDATORY) | | 4 | Statistical Requirements | p-values, effect sizes, corrections | | 4.7 | Bayesian Complement | Bayes Factors for ambiguous results | | 5 | Multi-Source Validation | Test across 3+ contexts | | 5.3 | Sensitivity Analysis | ±20% parameter stability (MANDATORY) | | 5.5 | Adversarial Review | Invoke experiment-critic agent | | 6 | Classification | VALIDATED / REJECTED / INSUFFICIENT | | 7 | Documentation | Complete audit trail | | 7.3 | Negative Results | Structured failure documentation |
See: workflow.md for detailed step-by-step instructions per phase.
| Type | Testable? | Example | |------|-----------|---------| | PERFORMANCE | YES | "A beats B on metric X" | | METHODOLOGICAL | YES | "A enables capability X" | | PHILOSOPHICAL | MAYBE | "X is important because Y" | | BEHAVIORAL | HARD | "Humans do X in situation Y" |
| Effect Size | Cohen's d | Required n | |-------------|-----------|------------| | Small | 0.2 | 394 | | Medium | 0.5 | 64 | | Large | 0.8 | 26 |
See: code-examples.md#power-analysis for calculation code.
| Status | Criteria | |--------|----------| | VALIDATED | OOS meets all criteria + critic PROCEED | | CONDITIONAL | OOS meets relaxed criteria (p < 0.10) | | REJECTED | OOS fails OR negative effect | | INSUFFICIENT | n < 15 in OOS | | UNTESTABLE | Required data unavailable | | INVALID | Circular validation detected |
| Metric | Minimum | Strong | Exceptional | |--------|---------|--------|-------------| | Sharpe Ratio | > 0.5 | > 1.0 | > 2.0 | | Win Rate | > 55% | > 60% | > 70% | | Profit Factor | > 1.2 | > 1.5 | > 2.0 |
See: code-examples.md#effect-thresholds for other domains.
| BF | Evidence | |----|----------| | < 1 | Supports null | | 1-3 | Anecdotal | | 3-10 | Moderate | | 10-30 | Strong | | > 30 | Very strong |
from statsmodels.stats.power import TTestIndPower
n = TTestIndPower().solve_power(effect_size=0.5, power=0.80, alpha=0.05)
Rule: Underpowered studies cannot achieve VALIDATED status.
See: code-examples.md#walk-forward for implementation.
alpha_corrected = 0.05 / num_claims # Bonferroni
For trading claims: require t-ratio > 3.0 (Harvey et al. standard).
Test ±20% parameter variation:
See: code-examples.md#sensitivity-analysis for implementation.
Use Task tool:
subagent_type: "experiment-critic"
prompt: "Review experiment EXP-XXX"
MANDATORY before any classification.
| Bias | Prevention | |------|------------| | Look-ahead | Process data sequentially, compare batch vs streaming | | Survivorship | Track ALL attempts, not just completions | | Selection | Report ALL experiments including failures | | Data snooping | Strict train/test split, no tuning on test data | | Publication | Document ALL claims before selecting which to test |
| Topic | File |
|-------|------|
| Step-by-step workflow | workflow.md |
| Python code examples | code-examples.md |
| Markdown templates | templates.md |
| Adversarial review | ../../agents/experiment-critic.md |
FORBIDDEN:
REQUIRED:
development
One topic, one home. Routes content to its canonical store (CLAUDE.md, PROJECT.md, MEMORY.md, docs/, memory/) and audits for duplication. TRIGGER when: auditing CLAUDE.md/PROJECT.md/MEMORY.md sizes, deduplicating docs, applying the content-allocation pattern to a new repo, running /align --content. DO NOT TRIGGER when: implementing features, writing tests, routine code edits, debugging.
development
GenAI-first testing with structural assertions, congruence validation, and tier-based test structure. Use when writing tests, setting up test infrastructure, or validating coverage. TRIGGER when: test, pytest, coverage, TDD, test patterns, congruence, validation. DO NOT TRIGGER when: production code implementation, documentation, config-only changes.
testing
Prompt engineering patterns for writing agent prompts and skill files — constraint budgets, register shifting, HARD GATE patterns, anti-personas. Use when writing or reviewing agents/*.md or skills/*/SKILL.md. TRIGGER when: agent prompt, skill file, prompt engineering, model-tier compensation, HARD GATE, prompt quality. DO NOT TRIGGER when: user-facing docs, README, CHANGELOG, config files.
testing
7-step planning workflow for pre-implementation design. Enforced by plan_gate hook, critiqued by plan-critic agent. Use when creating plans, design documents, or architecture decisions before implementation. TRIGGER when: plan, planning, /plan, design document, architecture decision. DO NOT TRIGGER when: implementation, coding, testing.