Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

akaszubski/scientific-validation

Name: scientific-validation
Author: akaszubski

plugins/autonomous-dev/skills/scientific-validation/SKILL.md

npx skillsauth add akaszubski/autonomous-dev scientific-validation

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Scientific Validation Skill

Rigorous methodology for validating claims from any source - books, papers, theories, or intuition.

When This Skill Activates

Testing claims from books, papers, or expert sources
Validating rules, strategies, or hypotheses
Running experiments or backtests
Keywords: "validate", "test hypothesis", "experiment", "backtest", "prove", "evidence"

Core Principle

Data is the arbiter. Sources can be wrong.

Expert books can be wrong
Only empirical validation decides what works
Document negative results - they're valuable

Phase Overview

| Phase | Name | Key Requirement | |-------|------|-----------------| | 0 | Claim Verification | Understand what source ACTUALLY claims | | 1 | Claims Extraction | Document with source citations | | 1.5 | Publication Bias Prevention | Document ALL claims before selecting | | 2 | Pre-Registration | Hypothesis BEFORE seeing results | | 2.3 | Power Analysis | Calculate required n (MANDATORY) | | 3 | Bias Prevention | Look-ahead, survivorship, selection | | 3.5 | Walk-Forward | Required for time series (MANDATORY) | | 4 | Statistical Requirements | p-values, effect sizes, corrections | | 4.7 | Bayesian Complement | Bayes Factors for ambiguous results | | 5 | Multi-Source Validation | Test across 3+ contexts | | 5.3 | Sensitivity Analysis | ±20% parameter stability (MANDATORY) | | 5.5 | Adversarial Review | Invoke experiment-critic agent | | 6 | Classification | VALIDATED / REJECTED / INSUFFICIENT | | 7 | Documentation | Complete audit trail | | 7.3 | Negative Results | Structured failure documentation |

See: workflow.md for detailed step-by-step instructions per phase.

Quick Reference

Claim Types

| Type | Testable? | Example | |------|-----------|---------| | PERFORMANCE | YES | "A beats B on metric X" | | METHODOLOGICAL | YES | "A enables capability X" | | PHILOSOPHICAL | MAYBE | "X is important because Y" | | BEHAVIORAL | HARD | "Humans do X in situation Y" |

Sample Size Requirements (80% Power)

| Effect Size | Cohen's d | Required n | |-------------|-----------|------------| | Small | 0.2 | 394 | | Medium | 0.5 | 64 | | Large | 0.8 | 26 |

See: code-examples.md#power-analysis for calculation code.

Classification Criteria

| Status | Criteria | |--------|----------| | VALIDATED | OOS meets all criteria + critic PROCEED | | CONDITIONAL | OOS meets relaxed criteria (p < 0.10) | | REJECTED | OOS fails OR negative effect | | INSUFFICIENT | n < 15 in OOS | | UNTESTABLE | Required data unavailable | | INVALID | Circular validation detected |

Domain Effect Thresholds (Trading)

| Metric | Minimum | Strong | Exceptional | |--------|---------|--------|-------------| | Sharpe Ratio | > 0.5 | > 1.0 | > 2.0 | | Win Rate | > 55% | > 60% | > 70% | | Profit Factor | > 1.2 | > 1.5 | > 2.0 |

See: code-examples.md#effect-thresholds for other domains.

Bayes Factor Interpretation

| BF | Evidence | |----|----------| | < 1 | Supports null | | 1-3 | Anecdotal | | 3-10 | Moderate | | 10-30 | Strong | | > 30 | Very strong |

Critical Rules

1. Pre-Registration

Document hypothesis BEFORE seeing any results
Define success criteria BEFORE testing
No peeking at test data

2. Power Analysis (Phase 2.3)

from statsmodels.stats.power import TTestIndPower
n = TTestIndPower().solve_power(effect_size=0.5, power=0.80, alpha=0.05)

Rule: Underpowered studies cannot achieve VALIDATED status.

3. Walk-Forward for Time Series (Phase 3.5)

Standard K-fold CV → INVALID (temporal leakage)
Single train/test → CONDITIONAL at best
Walk-forward → Can achieve VALIDATED

See: code-examples.md#walk-forward for implementation.

4. Multiple Comparison Correction

alpha_corrected = 0.05 / num_claims  # Bonferroni

For trading claims: require t-ratio > 3.0 (Harvey et al. standard).

5. Sensitivity Analysis (Phase 5.3)

Test ±20% parameter variation:

All variations positive → Can achieve VALIDATED
1-2 sign flips → CONDITIONAL at best
3+ sign flips → REJECTED (fragile)

See: code-examples.md#sensitivity-analysis for implementation.

6. Adversarial Review (Phase 5.5)

Use Task tool:
  subagent_type: "experiment-critic"
  prompt: "Review experiment EXP-XXX"

MANDATORY before any classification.

Bias Prevention Checklist

| Bias | Prevention | |------|------------| | Look-ahead | Process data sequentially, compare batch vs streaming | | Survivorship | Track ALL attempts, not just completions | | Selection | Report ALL experiments including failures | | Data snooping | Strict train/test split, no tuning on test data | | Publication | Document ALL claims before selecting which to test |

Pre-Experiment Checklist

[ ] Claim extracted with source citation
[ ] ALL claims documented (not just tested ones)
[ ] Hypothesis documented BEFORE results
[ ] Power analysis: required n calculated
[ ] Success criteria defined
[ ] Walk-forward configured (time series)
[ ] Costs/constraints specified

Post-Experiment Checklist

[ ] Sample size adequate per power analysis
[ ] p-value AND effect size reported
[ ] Bayesian analysis if ambiguous
[ ] Sensitivity analysis passed
[ ] Adversarial review completed
[ ] Negative results documented if REJECTED

Red Flags

100% success rate → Possible bias
OOS better than training → Possible leakage
Result flips with ±20% params → Fragile
Only tested "interesting" claims → Selection bias

Key Principles

Hypothesis BEFORE data - No peeking
Power analysis BEFORE experiment - Know required n
Walk-forward for time series - Preserve temporal order
Sensitivity analysis - Results must survive ±20% changes
Adversarial self-critique - Challenge your methodology
Document negative results - Failures are valuable
Sources can be wrong - Even experts, even textbooks

Detailed Documentation

| Topic | File | |-------|------| | Step-by-step workflow | workflow.md | | Python code examples | code-examples.md | | Markdown templates | templates.md | | Adversarial review | ../../agents/experiment-critic.md |

Hard Rules

FORBIDDEN:

Reporting results without confidence intervals or statistical significance
Cherry-picking favorable metrics while ignoring unfavorable ones
Claiming causation from correlation without controlled experiments

REQUIRED:

All experiments MUST have a documented hypothesis before execution
All results MUST include sample size, variance, and statistical test used
Negative results MUST be reported with the same rigor as positive results
Baselines MUST be established and compared against for every metric

akaszubski/scientific-validation

plugins/autonomous-dev/skills/scientific-validation/SKILL.md

Scientific method for validating claims with pre-registration, power analysis, statistical rigor, and Bayesian methods. Use when testing hypotheses, running experiments, or validating claims from papers. TRIGGER when: validate, hypothesis, experiment, backtest, evidence, statistical test. DO NOT TRIGGER when: routine coding, config changes, documentation, non-experimental tasks.

19 stars

development

Updated Apr 3, 2026

$ install --global

skillsauth

npx skillsauth add akaszubski/autonomous-dev scientific-validation

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 3, 2026, 9:19 AM234.7s1 file scanned

SKILL.md

name:: scientific-validation
description:: Scientific method for validating claims with pre-registration, power analysis, statistical rigor, and Bayesian methods. Use when testing hypotheses, running experiments, or validating claims from papers. TRIGGER when: validate, hypothesis, experiment, backtest, evidence, statistical test. DO NOT TRIGGER when: routine coding, config changes, documentation, non-experimental tasks.
allowed-tools:: [Read, Grep, Glob, Bash, Write]

Scientific Validation Skill

Rigorous methodology for validating claims from any source - books, papers, theories, or intuition.

When This Skill Activates

Testing claims from books, papers, or expert sources
Validating rules, strategies, or hypotheses
Running experiments or backtests
Keywords: "validate", "test hypothesis", "experiment", "backtest", "prove", "evidence"

Core Principle

Data is the arbiter. Sources can be wrong.

Expert books can be wrong
Only empirical validation decides what works
Document negative results - they're valuable

Phase Overview

See: workflow.md for detailed step-by-step instructions per phase.

Quick Reference

Claim Types

Sample Size Requirements (80% Power)

| Effect Size | Cohen's d | Required n | |-------------|-----------|------------| | Small | 0.2 | 394 | | Medium | 0.5 | 64 | | Large | 0.8 | 26 |

See: code-examples.md#power-analysis for calculation code.

Classification Criteria

Domain Effect Thresholds (Trading)

See: code-examples.md#effect-thresholds for other domains.

Bayes Factor Interpretation

| BF | Evidence | |----|----------| | < 1 | Supports null | | 1-3 | Anecdotal | | 3-10 | Moderate | | 10-30 | Strong | | > 30 | Very strong |

Critical Rules

1. Pre-Registration

Document hypothesis BEFORE seeing any results
Define success criteria BEFORE testing
No peeking at test data

2. Power Analysis (Phase 2.3)

from statsmodels.stats.power import TTestIndPower
n = TTestIndPower().solve_power(effect_size=0.5, power=0.80, alpha=0.05)

Rule: Underpowered studies cannot achieve VALIDATED status.

3. Walk-Forward for Time Series (Phase 3.5)

Standard K-fold CV → INVALID (temporal leakage)
Single train/test → CONDITIONAL at best
Walk-forward → Can achieve VALIDATED

See: code-examples.md#walk-forward for implementation.

4. Multiple Comparison Correction

alpha_corrected = 0.05 / num_claims  # Bonferroni

For trading claims: require t-ratio > 3.0 (Harvey et al. standard).

5. Sensitivity Analysis (Phase 5.3)

Test ±20% parameter variation:

All variations positive → Can achieve VALIDATED
1-2 sign flips → CONDITIONAL at best
3+ sign flips → REJECTED (fragile)

See: code-examples.md#sensitivity-analysis for implementation.

6. Adversarial Review (Phase 5.5)

Use Task tool:
  subagent_type: "experiment-critic"
  prompt: "Review experiment EXP-XXX"

MANDATORY before any classification.

Bias Prevention Checklist

Pre-Experiment Checklist

[ ] Claim extracted with source citation
[ ] ALL claims documented (not just tested ones)
[ ] Hypothesis documented BEFORE results
[ ] Power analysis: required n calculated
[ ] Success criteria defined
[ ] Walk-forward configured (time series)
[ ] Costs/constraints specified

Post-Experiment Checklist

[ ] Sample size adequate per power analysis
[ ] p-value AND effect size reported
[ ] Bayesian analysis if ambiguous
[ ] Sensitivity analysis passed
[ ] Adversarial review completed
[ ] Negative results documented if REJECTED

Red Flags

100% success rate → Possible bias
OOS better than training → Possible leakage
Result flips with ±20% params → Fragile
Only tested "interesting" claims → Selection bias

Key Principles

Hypothesis BEFORE data - No peeking
Power analysis BEFORE experiment - Know required n
Walk-forward for time series - Preserve temporal order
Sensitivity analysis - Results must survive ±20% changes
Adversarial self-critique - Challenge your methodology
Document negative results - Failures are valuable
Sources can be wrong - Even experts, even textbooks

Detailed Documentation

Hard Rules

FORBIDDEN:

Reporting results without confidence intervals or statistical significance
Cherry-picking favorable metrics while ignoring unfavorable ones
Claiming causation from correlation without controlled experiments

REQUIRED:

All experiments MUST have a documented hypothesis before execution
All results MUST include sample size, variance, and statistical test used
Negative results MUST be reported with the same rigor as positive results
Baselines MUST be established and compared against for every metric

Related Skills

akaszubski/testing-guide

development

VerifiedTrustedCommunity

GenAI-first testing with structural assertions, congruence validation, and tier-based test structure. Use when writing tests, setting up test infrastructure, or validating coverage. TRIGGER when: test, pytest, coverage, TDD, test patterns, congruence, validation. DO NOT TRIGGER when: production code implementation, documentation, config-only changes.

30SKILL.mdUpdated Apr 3, 2026

akaszubski/testing-guide

akaszubski/content-allocation

development

VerifiedTrustedCommunity

One topic, one home. Routes content to its canonical store (CLAUDE.md, PROJECT.md, MEMORY.md, docs/, memory/) and audits for duplication. TRIGGER when: auditing CLAUDE.md/PROJECT.md/MEMORY.md sizes, deduplicating docs, applying the content-allocation pattern to a new repo, running /align --content. DO NOT TRIGGER when: implementing features, writing tests, routine code edits, debugging.

28SKILL.mdUpdated May 28, 2026

akaszubski/content-allocation

akaszubski/prompt-engineering

testing

VerifiedTrustedCommunity

Prompt engineering patterns for writing agent prompts and skill files — constraint budgets, register shifting, HARD GATE patterns, anti-personas. Use when writing or reviewing agents/*.md or skills/*/SKILL.md. TRIGGER when: agent prompt, skill file, prompt engineering, model-tier compensation, HARD GATE, prompt quality. DO NOT TRIGGER when: user-facing docs, README, CHANGELOG, config files.

21SKILL.mdUpdated Apr 15, 2026

akaszubski/prompt-engineering

akaszubski/planning-workflow

testing

VerifiedTrustedCommunity

7-step planning workflow for pre-implementation design. Enforced by plan_gate hook, critiqued by plan-critic agent. Use when creating plans, design documents, or architecture decisions before implementation. TRIGGER when: plan, planning, /plan, design document, architecture decision. DO NOT TRIGGER when: implementation, coding, testing.

21SKILL.mdUpdated Apr 15, 2026

akaszubski/planning-workflow

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/akaszubski/autonomous-dev.git

# Copy into Claude Code skills folder (global)
cp -r autonomous-dev/plugins/autonomous-dev/skills/scientific-validation ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

akaszubski/autonomous-dev

19 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT