skills/tabular/chunked-hdf5-streaming/SKILL.md
Streams large HDF5 files in fixed-size row chunks to compute summary statistics without loading the full dataset into memory.
npx skillsauth add wenmin-wu/ds-skills tabular-chunked-hdf5-streamingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Genomics and single-cell datasets often exceed 10GB as HDF5 files — loading them fully into a DataFrame causes OOM. pd.read_hdf supports start/stop parameters for chunk-based reading. Process each chunk independently (compute row sums, nonzero counts, per-group statistics), accumulate results, then discard the chunk. This pattern handles arbitrarily large files with constant memory usage, and is essential for EDA and feature engineering on competition datasets stored in HDF5 format.
import pandas as pd
import numpy as np
def stream_hdf5_stats(filepath, chunksize=5000):
"""Compute per-row summary stats from large HDF5 in chunks."""
summaries = []
start = 0
while True:
chunk = pd.read_hdf(filepath, start=start, stop=start + chunksize)
if len(chunk) == 0:
break
summaries.append(pd.DataFrame({
'row_sum': chunk.sum(axis=1),
'nonzero_count': (chunk != 0).sum(axis=1),
'row_mean': chunk.mean(axis=1),
}, index=chunk.index))
if len(chunk) < chunksize:
break
start += chunksize
return pd.concat(summaries)
stats = stream_hdf5_stats('train_multi_inputs.h5', chunksize=5000)
pd.read_hdf(path, start=, stop=)pd.concat at the endscipy.sparse.csr_matrix if >90% zerosdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF