skills/walk-forward-validation/SKILL.md
Walk-forward validation framework for trading strategies and ML models with time-series-aware splits, overfit detection, and regime-aware validation
npx skillsauth add agiprolabs/claude-trading-skills walk-forward-validationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Walk-forward validation framework for trading strategies and ML models. Standard cross-validation (k-fold, random splits) fails catastrophically for financial time series because it introduces lookahead bias and ignores autocorrelation. This skill covers proper time-series validation techniques including rolling and expanding windows, purged cross-validation, combinatorial purged cross-validation (CPCV), and overfit detection metrics.
Standard k-fold CV assumes data points are independent and identically distributed (IID). Financial time series violate both assumptions:
The train window has a fixed size and slides forward in time. This is preferred when you believe older data is less relevant (common in crypto).
Window 1: [===TRAIN===][=TEST=]
Window 2: [===TRAIN===][=TEST=]
Window 3: [===TRAIN===][=TEST=]
Parameters:
train_size: Number of bars/days in the training windowtest_size: Number of bars/days in the test windowstep_size: How far to advance between folds (often equals test_size)The train window starts at the beginning and expands forward. This uses all available historical data, which helps when data is scarce.
Window 1: [==TRAIN==][=TEST=]
Window 2: [====TRAIN====][=TEST=]
Window 3: [======TRAIN======][=TEST=]
Parameters:
min_train_size: Minimum training samples before first foldtest_size: Fixed test window sizestep_size: How far to advance between folds| Factor | Rolling | Expanding | |---|---|---| | Data recency | Prioritizes recent data | Uses all history | | Regime changes | Better adapts to new regimes | May dilute recent regime | | Sample size | Fixed, may be small | Grows over time | | Crypto preference | Preferred for < 6mo horizons | Better for regime-stable models |
Remove training samples whose labels overlap with the test set's time range. If a label is computed as the 24h forward return starting at time t, any training sample where t + 24h extends into the test period must be purged.
def purge_train_indices(
train_idx: list[int],
test_start: int,
label_horizon: int,
timestamps: list[int],
) -> list[int]:
"""Remove train samples whose label windows overlap test period."""
test_start_time = timestamps[test_start]
return [
i for i in train_idx
if timestamps[i] + label_horizon < test_start_time
]
Add a buffer gap between the end of training and start of testing to account for serial correlation that purging alone does not eliminate.
[===TRAIN===][--EMBARGO--][=TEST=]
Typical embargo sizes:
CPCV (Lopez de Prado, 2018) generates all possible train/test combinations from N groups while maintaining temporal ordering. This produces far more test paths than standard walk-forward, enabling statistical tests for overfitting.
Key properties:
N contiguous groupsk test groups, the remaining N-k groups form the training setC(N, k) backtest paths (e.g., N=6, k=2 gives 15 paths)See references/methodology.md for the full CPCV algorithm and formulas.
The observed Sharpe ratio must be adjusted for:
import numpy as np
from scipy.stats import norm
def deflated_sharpe_ratio(
observed_sr: float,
num_trials: int,
backtest_length: int,
skewness: float = 0.0,
kurtosis: float = 3.0,
) -> float:
"""Compute the probability that observed SR > 0 after deflation.
Args:
observed_sr: Annualized Sharpe ratio of the selected strategy.
num_trials: Number of strategies tested (including discarded ones).
backtest_length: Number of return observations.
skewness: Skewness of returns.
kurtosis: Excess kurtosis of returns.
Returns:
p-value (probability SR is genuinely > 0).
"""
sr_std = np.sqrt(
(1 - skewness * observed_sr + (kurtosis - 1) / 4 * observed_sr**2)
/ (backtest_length - 1)
)
# Expected max SR under null (Euler-Mascheroni approximation)
euler_mascheroni = 0.5772156649
expected_max_sr = norm.ppf(1 - 1 / num_trials) * (
1 - euler_mascheroni
) + euler_mascheroni * norm.ppf(1 - 1 / (num_trials * np.e))
dsr = norm.cdf((observed_sr - expected_max_sr) / sr_std)
return dsr
A DSR below 0.95 suggests the observed performance is likely due to overfitting across the trials tested.
PBO uses CPCV to measure the fraction of backtest paths where the in-sample optimal strategy underperforms the median out-of-sample. A PBO above 0.50 indicates more-likely-than-not overfitting.
See references/overfit_detection.md for complete derivations and implementation details.
min_train_size may be necessary.| Strategy Timeframe | Train Window | Test Window | Embargo | |---|---|---|---| | Scalping (1-5min) | 3-7 days | 1 day | 2-4 hours | | Intraday (15min-1h) | 14-30 days | 3-7 days | 12-24 hours | | Swing (4h-daily) | 30-90 days | 7-14 days | 2-5 days | | Position (daily-weekly) | 90-180 days | 30 days | 5-10 days |
from walk_forward import WalkForwardValidator, WalkForwardConfig
config = WalkForwardConfig(
train_size=90,
test_size=14,
step_size=14,
window_type="rolling",
embargo_size=3,
purge_horizon=1,
)
validator = WalkForwardValidator(config)
for fold in validator.split(price_data):
model.fit(fold.train_X, fold.train_y)
predictions = model.predict(fold.test_X)
fold.record_performance(predictions, fold.test_y)
results = validator.aggregate_results()
print(f"OOS Sharpe: {results.oos_sharpe:.3f}")
print(f"Train/Test Sharpe ratio: {results.sharpe_ratio_ratio:.2f}")
references/methodology.md — Walk-forward theory, window types, purging, embargo, CPCV algorithm with formulasreferences/overfit_detection.md — Deflated Sharpe ratio, probability of backtest overfitting, multiple testing correctionsreferences/practical_guide.md — Window size selection for crypto, regime considerations, common validation mistakesscripts/walk_forward.py — Walk-forward validation engine with rolling and expanding windows; --demo mode with synthetic datascripts/overfit_detector.py — Deflated Sharpe ratio and PBO computation; --demo mode with synthetic backtest resultsdata-ai
DeFi yield evaluation including fee APR, real vs nominal yield, net APY after costs, and yield sustainability analysis
tools
Real-time Solana transaction and account streaming via Yellowstone gRPC (Geyser plugin)
tools
Large wallet monitoring, accumulation and distribution detection, and smart money signal generation for Solana tokens
tools
Wash sale detection under 2025 US crypto rules with 61-day window monitoring, disallowed loss tracking, and safe re-entry countdown