skills/feature-engineering/SKILL.md
Feature construction from market data for ML trading models including price, volume, on-chain, and microstructure features
npx skillsauth add agiprolabs/claude-trading-skills feature-engineeringInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Feature engineering is the single highest-leverage activity in building ML trading models. Model selection (XGBoost vs. neural net vs. logistic regression) matters far less than the quality and diversity of input features. A simple model on great features will outperform a complex model on raw prices every time.
This skill covers constructing, validating, and selecting features from market data for use in classification (signal-classification) and regression models targeting crypto/Solana token trading.
Raw OHLCV data is non-stationary, noisy, and high-dimensional. Models trained directly on price series will overfit. Feature engineering transforms raw data into stationary, informative signals that capture distinct aspects of market behavior:
Derived purely from OHLCV price columns. These capture trend, momentum, and volatility from the price series itself.
| Feature | Formula | Lookback |
|---------|---------|----------|
| log_return | ln(close_t / close_{t-1}) | 1 bar |
| abs_return | abs(log_return) | 1 bar |
| return_volatility | std(log_return, N) | 20 bars |
| momentum_N | close_t / close_{t-N} - 1 | 5, 10, 20 |
| acceleration | momentum_5 - momentum_5[5] | 10 bars |
| high_low_range | (high - low) / close | 1 bar |
| close_position | (close - low) / (high - low) | 1 bar |
| gap | open_t / close_{t-1} - 1 | 1 bar |
| rolling_skew | skew(log_return, N) | 20 bars |
| rolling_kurtosis | kurtosis(log_return, N) | 20 bars |
Volume confirms or contradicts price movements. Divergences between price and volume are among the most reliable signals in short-term trading.
| Feature | Formula | Lookback |
|---------|---------|----------|
| volume_ratio | volume_t / mean(volume, N) | 20 bars |
| volume_ma_ratio | sma(volume, 5) / sma(volume, 20) | 20 bars |
| obv_slope | slope(OBV, N) | 10 bars |
| vwap_deviation | (close - VWAP) / VWAP | intraday |
| volume_acceleration | volume_ratio_t - volume_ratio_{t-1} | 21 bars |
| buy_volume_ratio | buy_volume / total_volume | 1 bar |
| dollar_volume | close * volume | 1 bar |
| volume_cv | std(volume, N) / mean(volume, N) | 20 bars |
Standard technical indicators computed via pandas-ta. Use the pandas-ta skill
for full parameter documentation.
| Feature | Source | Lookback |
|---------|--------|----------|
| rsi | RSI(14) | 14 bars |
| macd_histogram | MACD(12,26,9) histogram | 33 bars |
| bb_position | (close - BB_lower) / (BB_upper - BB_lower) | 20 bars |
| bb_width | (BB_upper - BB_lower) / BB_mid | 20 bars |
| atr_ratio | ATR(14) / close | 14 bars |
| adx | ADX(14) | 14 bars |
| stoch_k | Stochastic %K(14,3) | 14 bars |
| cci | CCI(20) | 20 bars |
| mfi | MFI(14) | 14 bars |
| supertrend_direction | Supertrend direction (+1/-1) | 10 bars |
Derived from trade-level data (individual swaps/transactions). Require on-chain or DEX API data.
| Feature | Description |
|---------|-------------|
| trade_count_ratio | Trades this bar / avg trades per bar |
| avg_trade_size | Mean trade size in USD |
| large_trade_pct | % of volume from trades > $10k |
| unique_traders | Count of distinct wallet addresses |
| buy_count_ratio | Buy trades / total trades |
| trade_size_entropy | Shannon entropy of trade size distribution |
Derived from blockchain state changes. Require Helius or Solana RPC data.
| Feature | Description |
|---------|-------------|
| holder_count_change | Change in unique holders over N periods |
| whale_net_flow | Net tokens moved by top-10 holders |
| token_velocity | Transfer volume / circulating supply |
| liquidity_change | Change in DEX liquidity pool TVL |
Capture relationships between the target token and broader market.
| Feature | Description |
|---------|-------------|
| sol_correlation | Rolling correlation with SOL price |
| btc_beta | Rolling beta to BTC returns |
| sector_momentum | Average return of tokens in same sector |
Cyclical encoding of calendar time. Use sin/cos encoding to preserve cyclical continuity (hour 23 is close to hour 0).
import numpy as np
hour_sin = np.sin(2 * np.pi * hour / 24)
hour_cos = np.cos(2 * np.pi * hour / 24)
day_of_week = np.sin(2 * np.pi * day / 7)
Non-stationary features will cause your model to fail on new data. A feature is stationary if its statistical properties (mean, variance) don't change over time.
Use the Augmented Dickey-Fuller (ADF) test:
from scipy.stats import adfuller
result = adfuller(feature_series.dropna())
p_value = result[1]
is_stationary = p_value < 0.05
| Non-Stationary | Stationary Transform | |----------------|---------------------| | Price | Log return | | Volume | Volume ratio (vol / avg vol) | | OBV | OBV slope (regression coefficient) | | Holder count | Holder count change | | RSI | Already stationary (bounded 0-100) | | Dollar volume | Dollar volume / rolling mean |
Rule: If a feature trends upward or downward over time, it is non-stationary. Transform it into a ratio, difference, or rate of change.
After computing features, normalize them so that all features have comparable scales. This is critical for distance-based models (KNN, SVM) and helpful for tree models.
| Method | Formula | When to Use |
|--------|---------|-------------|
| Z-score | (x - mean) / std | Gaussian-like distributions |
| Min-max | (x - min) / (max - min) | Bounded features (RSI, BB position) |
| Rank | rank(x) / len(x) | Heavy-tailed distributions |
Critical: Use rolling statistics for normalization. Never use full-sample mean/std — that introduces lookahead bias.
# CORRECT: rolling z-score
z = (feature - feature.rolling(60).mean()) / feature.rolling(60).std()
# WRONG: full-sample z-score (lookahead bias!)
z = (feature - feature.mean()) / feature.std()
The most dangerous bug in trading ML is lookahead bias — using future information to compute features or targets. Follow these rules absolutely:
.mean() or .std() on the full
series. Always use .rolling(N).mean().close.shift(-N) / close - 1 (future return), not close / close.shift(N) - 1
(past return used as target).t is paired with target row t (where target already
contains the forward shift).train = data[:split_idx], test = data[split_idx:].After computing many features, select the most predictive and least redundant:
from sklearn.feature_selection import VarianceThreshold
selector = VarianceThreshold(threshold=0.01)
X_filtered = selector.fit_transform(X)
Remove features with > 0.9 correlation to another feature (keep the one with higher target correlation):
corr_matrix = X.corr().abs()
upper = corr_matrix.where(np.triu(np.ones(corr_matrix.shape), k=1).astype(bool))
to_drop = [col for col in upper.columns if any(upper[col] > 0.9)]
Train a random forest and rank by importance:
from sklearn.ensemble import RandomForestClassifier
rf = RandomForestClassifier(n_estimators=100, random_state=42)
rf.fit(X_train, y_train)
importances = pd.Series(rf.feature_importances_, index=X.columns).sort_values(ascending=False)
Non-linear alternative to correlation:
from sklearn.feature_selection import mutual_info_classif
mi = mutual_info_classif(X_train, y_train, random_state=42)
mi_scores = pd.Series(mi, index=X.columns).sort_values(ascending=False)
Labels (targets) define what the model learns to predict.
forward_return = close.shift(-N) / close - 1
label = (forward_return > threshold).astype(int) # 1 = up, 0 = not up
Typical thresholds: 1% for 1h bars, 3% for 4h bars, 5% for daily bars.
label = pd.cut(forward_return,
bins=[-np.inf, -threshold, threshold, np.inf],
labels=[0, 1, 2]) # 0=down, 1=flat, 2=up
target = forward_return # Predict exact return magnitude
Binary classification is recommended for initial models — it's simpler and more robust to noise.
pandas-ta: Compute technical indicators that become featuresbirdeye-api: Fetch OHLCV and trade data for feature computationhelius-api: Fetch on-chain data for holder/whale featuressignal-classification: Use engineered features as model inputsregime-detection: Regime labels as features or for regime-conditional modelsohlcv-processing: Clean and resample raw data before feature computationreferences/feature_catalog.md — Complete catalog of ~40 features with formulas,
lookbacks, stationarity status, and interpretation notesreferences/pitfalls.md — Common mistakes in trading feature engineering:
lookahead bias, overfitting, survivorship bias, data snooping, non-stationarityscripts/build_features.py — Compute 25+ features from OHLCV data with
stationarity testing and quality reporting. Supports demo mode with synthetic data
or live data via Birdeye API.scripts/feature_importance.py — Rank features by predictive power using
tree-based importance and permutation importance. Identifies redundant features
via correlation analysis.data-ai
DeFi yield evaluation including fee APR, real vs nominal yield, net APY after costs, and yield sustainability analysis
tools
Real-time Solana transaction and account streaming via Yellowstone gRPC (Geyser plugin)
tools
Large wallet monitoring, accumulation and distribution detection, and smart money signal generation for Solana tokens
tools
Wash sale detection under 2025 US crypto rules with 61-day window monitoring, disallowed loss tracking, and safe re-entry countdown