skills/tabular/threaded-parquet-describe-features/SKILL.md
Parallel-load per-subject parquet time-series files with ThreadPoolExecutor and flatten describe() statistics into tabular feature vectors
npx skillsauth add wenmin-wu/ds-skills tabular-threaded-parquet-describe-featuresInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When time-series data is stored as one parquet file per subject (common in wearable/sensor competitions), sequential loading is slow. Use ThreadPoolExecutor to read all files in parallel, compute df.describe() per subject, and flatten the 8-stat-per-column summary into a single feature row. Converts variable-length time-series into fixed-width tabular features in seconds.
import os
import pandas as pd
from concurrent.futures import ThreadPoolExecutor
def process_subject(subject_dir, base_dir):
path = os.path.join(base_dir, subject_dir, 'part-0.parquet')
df = pd.read_parquet(path)
df = df.drop(columns=['step'], errors='ignore')
stats = df.describe().values.reshape(-1)
sid = subject_dir.split('=')[-1]
return stats, sid
def load_timeseries_features(base_dir):
subjects = os.listdir(base_dir)
with ThreadPoolExecutor() as pool:
results = list(pool.map(
lambda s: process_subject(s, base_dir), subjects))
stats, ids = zip(*results)
n_feats = len(stats[0])
df = pd.DataFrame(list(stats),
columns=[f'ts_stat_{i}' for i in range(n_feats)])
df['id'] = ids
return df
ts_features = load_timeseries_features('data/series_train.parquet/')
train = train.merge(ts_features, on='id', how='left')
ThreadPoolExecutor.map() reads each parquet and computes describe()step column: monotonic index adds no informationts_stat_i columns; map back via (stat_idx, col_name) if interpretability neededdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF