Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

casemark/conducting-backtest-validation

Name: conducting-backtest-validation
Author: casemark

skills/capital/conducting-backtest-validation/SKILL.md

npx skillsauth add casemark/skills conducting-backtest-validation

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Conducting Backtest Validation

Structures backtesting methodology with out-of-sample testing, cross-validation, and overfitting detection techniques for systematic and factor-based investment strategies.

When To Use

Validating a new trading strategy or alpha signal before live deployment
Auditing an existing backtest for overfitting, look-ahead bias, or survivorship bias
Comparing multiple strategy variants to select a robust candidate
Reviewing third-party backtest results (fund managers, vendors, research papers)
Stress-testing a strategy across regime changes, drawdown periods, or tail events

Inputs To Gather

Strategy specification: signal logic, universe definition, rebalance frequency, position sizing rules, and transaction cost assumptions
Data sources: price/return series, factor exposures, benchmark indices — confirm whether adjusted for survivorship bias and corporate actions
Backtest parameters: start/end dates, in-sample vs. out-of-sample split dates, walk-forward window lengths
Cost model: commissions, slippage estimates, borrowing costs, market impact assumptions [VERIFY against actual execution data if available]
Benchmark(s): relevant index or factor portfolio for performance attribution
Prior test count: number of strategy variations already tested on the same dataset (needed for multiple-testing adjustment)

Workflow

1. Data Integrity Audit

Confirm no look-ahead bias: signals use only information available at decision time
Check for survivorship bias in the universe (delisted securities, index reconstitution)
Verify point-in-time correctness for fundamental data (restatements, reporting lags)
Validate price data for stock splits, dividends, and corporate actions
Flag any gaps, stale prices, or anomalous returns exceeding ±50% in a single day

2. In-Sample / Out-of-Sample Design

Split the dataset with a minimum 30% out-of-sample holdout; prefer chronological split over random
Define walk-forward windows: typical choices are 3–5 year in-sample with 1-year forward test steps
For shorter histories, use k-fold combinatorial purged cross-validation (CPCV) with an embargo period equal to the strategy's maximum holding period to prevent leakage

3. Overfitting Detection

Multiple-testing adjustment: apply the Deflated Sharpe Ratio (DSR) or Bonferroni/BHY correction based on the total number of strategy trials
Parameter sensitivity: vary key parameters ±20% and check whether Sharpe ratio degrades more than 30% — flag fragile strategies
Minimum Backtest Length (MinBTL): estimate required sample size for statistical significance given observed Sharpe; reject if actual history is shorter [VERIFY formula assumptions against strategy frequency]
Probability of Backtest Overfitting (PBO): run CPCV and compute the share of OOS combinations that underperform the benchmark — PBO > 0.40 is a red flag

4. Performance & Risk Decomposition

Report annualized return, volatility, Sharpe ratio, Sortino ratio, max drawdown, and Calmar ratio for both IS and OOS periods
Decompose returns via factor attribution (market, size, value, momentum, quality at minimum) to isolate residual alpha
Examine hit rate, profit factor, and average win/loss ratio at the trade level
Compute turnover and net-of-cost performance; reject strategies where costs consume >50% of gross alpha

5. Regime & Stress Analysis

Segment performance by market regime: rising rates, falling rates, high vol (VIX > 25), low vol, recession (NBER-dated), expansion
Identify maximum drawdown duration and recovery period
Run Monte Carlo reshuffling of trade returns to build confidence intervals around key metrics
Test sensitivity to execution delay (T+0 vs. T+1 vs. T+2 entry)

6. Replication & Documentation

Record exact signal definitions, universe filters, and rebalance rules so the backtest is fully reproducible
Log software version, random seeds, and data vendor/snapshot date
Archive parameter search space and total trial count for future multiple-testing reference

Output

Produce a Backtest Validation Report containing:

Executive summary: strategy description, headline OOS metrics, and pass/fail recommendation
Data quality findings: any biases detected, data gaps, or corrections applied
IS vs. OOS comparison table: side-by-side metrics with statistical significance notes
Overfitting diagnostics: DSR, PBO score, parameter sensitivity heatmap
Factor attribution: gross vs. residual alpha, factor loading stability over time
Regime analysis: performance table segmented by macro regime
Cost impact: gross vs. net Sharpe, breakeven cost threshold
Recommendation: deploy, refine, or reject — with specific conditions or thresholds for promotion to paper trading

Quality Checks

OOS Sharpe ratio is statistically distinguishable from zero at the 95% level (t-stat > 1.96 after multiple-testing adjustment)
PBO < 0.40 and DSR remains positive after accounting for all trials
No single regime drives more than 60% of cumulative OOS profit
Parameter sensitivity analysis shows smooth, not cliff-edge, degradation
Transaction cost assumptions are realistic — cross-check slippage with actual fill data or TCA reports [VERIFY against broker/execution platform data]
Factor exposures are stable and intentional; unintended loadings are flagged
All data sources and methodology steps are documented sufficiently for independent replication

casemark/conducting-backtest-validation

skills/capital/conducting-backtest-validation/SKILL.md

Structures backtesting methodology with out-of-sample testing, cross-validation, and overfitting detection techniques. Use when validating backtests, detecting overfitting, or ensuring backtest robustness.

14 stars

testing

Updated Apr 25, 2026

$ install --global

skillsauth

npx skillsauth add casemark/skills conducting-backtest-validation

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 25, 2026, 5:04 AM12.5s1 file scanned

SKILL.md

name:: conducting-backtest-validation
language:: en
description:: Structures backtesting methodology with out-of-sample testing, cross-validation, and overfitting detection techniques. Use when validating backtests, detecting overfitting, or ensuring backtest robustness.
author:: casemark

Conducting Backtest Validation

Structures backtesting methodology with out-of-sample testing, cross-validation, and overfitting detection techniques for systematic and factor-based investment strategies.

When To Use

Validating a new trading strategy or alpha signal before live deployment
Auditing an existing backtest for overfitting, look-ahead bias, or survivorship bias
Comparing multiple strategy variants to select a robust candidate
Reviewing third-party backtest results (fund managers, vendors, research papers)
Stress-testing a strategy across regime changes, drawdown periods, or tail events

Inputs To Gather

Strategy specification: signal logic, universe definition, rebalance frequency, position sizing rules, and transaction cost assumptions
Data sources: price/return series, factor exposures, benchmark indices — confirm whether adjusted for survivorship bias and corporate actions
Backtest parameters: start/end dates, in-sample vs. out-of-sample split dates, walk-forward window lengths
Cost model: commissions, slippage estimates, borrowing costs, market impact assumptions [VERIFY against actual execution data if available]
Benchmark(s): relevant index or factor portfolio for performance attribution
Prior test count: number of strategy variations already tested on the same dataset (needed for multiple-testing adjustment)

Workflow

1. Data Integrity Audit

Confirm no look-ahead bias: signals use only information available at decision time
Check for survivorship bias in the universe (delisted securities, index reconstitution)
Verify point-in-time correctness for fundamental data (restatements, reporting lags)
Validate price data for stock splits, dividends, and corporate actions
Flag any gaps, stale prices, or anomalous returns exceeding ±50% in a single day

2. In-Sample / Out-of-Sample Design

Split the dataset with a minimum 30% out-of-sample holdout; prefer chronological split over random
Define walk-forward windows: typical choices are 3–5 year in-sample with 1-year forward test steps
For shorter histories, use k-fold combinatorial purged cross-validation (CPCV) with an embargo period equal to the strategy's maximum holding period to prevent leakage

3. Overfitting Detection

Multiple-testing adjustment: apply the Deflated Sharpe Ratio (DSR) or Bonferroni/BHY correction based on the total number of strategy trials
Parameter sensitivity: vary key parameters ±20% and check whether Sharpe ratio degrades more than 30% — flag fragile strategies
Minimum Backtest Length (MinBTL): estimate required sample size for statistical significance given observed Sharpe; reject if actual history is shorter [VERIFY formula assumptions against strategy frequency]
Probability of Backtest Overfitting (PBO): run CPCV and compute the share of OOS combinations that underperform the benchmark — PBO > 0.40 is a red flag

4. Performance & Risk Decomposition

Report annualized return, volatility, Sharpe ratio, Sortino ratio, max drawdown, and Calmar ratio for both IS and OOS periods
Decompose returns via factor attribution (market, size, value, momentum, quality at minimum) to isolate residual alpha
Examine hit rate, profit factor, and average win/loss ratio at the trade level
Compute turnover and net-of-cost performance; reject strategies where costs consume >50% of gross alpha

5. Regime & Stress Analysis

Segment performance by market regime: rising rates, falling rates, high vol (VIX > 25), low vol, recession (NBER-dated), expansion
Identify maximum drawdown duration and recovery period
Run Monte Carlo reshuffling of trade returns to build confidence intervals around key metrics
Test sensitivity to execution delay (T+0 vs. T+1 vs. T+2 entry)

6. Replication & Documentation

Record exact signal definitions, universe filters, and rebalance rules so the backtest is fully reproducible
Log software version, random seeds, and data vendor/snapshot date
Archive parameter search space and total trial count for future multiple-testing reference

Output

Produce a Backtest Validation Report containing:

Executive summary: strategy description, headline OOS metrics, and pass/fail recommendation
Data quality findings: any biases detected, data gaps, or corrections applied
IS vs. OOS comparison table: side-by-side metrics with statistical significance notes
Overfitting diagnostics: DSR, PBO score, parameter sensitivity heatmap
Factor attribution: gross vs. residual alpha, factor loading stability over time
Regime analysis: performance table segmented by macro regime
Cost impact: gross vs. net Sharpe, breakeven cost threshold
Recommendation: deploy, refine, or reject — with specific conditions or thresholds for promotion to paper trading

Quality Checks

OOS Sharpe ratio is statistically distinguishable from zero at the 95% level (t-stat > 1.96 after multiple-testing adjustment)
PBO < 0.40 and DSR remains positive after accounting for all trials
No single regime drives more than 60% of cumulative OOS profit
Parameter sensitivity analysis shows smooth, not cliff-edge, degradation
Transaction cost assumptions are realistic — cross-check slippage with actual fill data or TCA reports [VERIFY against broker/execution platform data]
Factor exposures are stable and intentional; unintended loadings are flagged
All data sources and methodology steps are documented sufficiently for independent replication

Related Skills

casemark/skills/legal/automated-contract-summary

development

VerifiedTrustedCommunity

name: automated-contract-summary language: en description: Generates structured executive summaries of contracts using ML — captures key terms, party obligations, risk allocations, and compliance requirements in a standardized format. Optimized for high-volume review where speed and consistency matter. tags: - summarization - agreement - corporate --- # Automated Contract Summarization Produces standardized executive summaries of contracts using machine learning, capturing essential term

21SKILL.mdUpdated May 26, 2026

casemark/skills/legal/automated-contract-summary

casemark/obligation-mapping

tools

VerifiedTrustedCommunity

Extracts regulatory obligations from dense regulations across jurisdictions. Breaks down multi-level regulations into clear article-level obligations, classifies applicability to a business, and prioritizes by risk level. Use when translating regulations into actionable compliance requirements.

21SKILL.mdUpdated May 25, 2026

casemark/obligation-mapping

casemark/horizon-scanning

development

VerifiedTrustedCommunity

Continuously monitors regulatory landscapes for changes relevant to a specific business. Ingests global regulatory updates, filters by relevance, summarizes impact, and produces an actionable change advisory. Use when tracking regulatory developments affecting a particular product or market.

21SKILL.mdUpdated May 25, 2026

casemark/horizon-scanning

casemark/gap-analysis

testing

VerifiedTrustedCommunity

Compares an organization's existing compliance controls, policies, and procedures against extracted regulatory obligations to identify coverage gaps. Produces a remediation plan with prioritized actions. Use when assessing compliance maturity or preparing for regulatory audits.

21SKILL.mdUpdated May 25, 2026

casemark/gap-analysis

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/casemark/skills.git

# Copy into Claude Code skills folder (global)
cp -r skills/skills/capital/conducting-backtest-validation ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

casemark/skills

14 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT