.claude/skills/ml-data-handling/SKILL.md
# ML Data Handling ## When to Load This Skill Load when working with: pickle, ONNX, Parquet, Feather, HDF5, large datasets, S3/Object Storage, DVC-like versioning, model artifacts, data pipelines. ## Core Principle Binary ML artifacts (weights, embeddings, datasets) are NEVER committed to Git. They live in object storage (Yandex Cloud Object Storage — S3-compatible) or are reproducible via pipeline. Paths and versions are tracked in code; actual data is not. ## Directory Convention ``` pro
npx skillsauth add pyramidheadshark/ml-claude-infra .claude/skills/ml-data-handlingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Load when working with: pickle, ONNX, Parquet, Feather, HDF5, large datasets, S3/Object Storage, DVC-like versioning, model artifacts, data pipelines.
Binary ML artifacts (weights, embeddings, datasets) are NEVER committed to Git. They live in object storage (Yandex Cloud Object Storage — S3-compatible) or are reproducible via pipeline. Paths and versions are tracked in code; actual data is not.
project-name/
├── data/
│ ├── raw/ # gitignored — original client data, immutable
│ ├── interim/ # gitignored — intermediate transformations
│ ├── processed/ # gitignored — final features ready for training
│ └── .gitkeep # committed — preserves structure
├── models/
│ ├── weights/ # gitignored — .pt, .onnx, .safetensors
│ └── .gitkeep
└── artifacts/ # gitignored — experiment outputs
We do not use DVC (adds friction). Instead: manifest files committed to Git.
Each data version has a corresponding data/manifest.json:
{
"version": "1.2.0",
"created_at": "2026-03-01T10:00:00Z",
"splits": {
"train": {
"path": "s3://bucket/datasets/project/v1.2.0/train.parquet",
"rows": 45000,
"sha256": "a3f2..."
},
"val": {
"path": "s3://bucket/datasets/project/v1.2.0/val.parquet",
"rows": 5000,
"sha256": "b7c1..."
}
},
"preprocessing": {
"script": "scripts/preprocess.py",
"commit": "abc123"
}
}
import boto3
from botocore.config import Config
from src.project_name.core.config import settings
def get_s3_client():
return boto3.client(
"s3",
endpoint_url="https://storage.yandexcloud.net",
aws_access_key_id=settings.yc_access_key_id,
aws_secret_access_key=settings.yc_secret_access_key,
config=Config(signature_version="s3v4"),
region_name="ru-central1",
)
async def download_artifact(s3_key: str, local_path: str) -> None:
client = get_s3_client()
client.download_file(settings.yc_bucket_name, s3_key, local_path)
async def upload_artifact(local_path: str, s3_key: str) -> None:
client = get_s3_client()
client.upload_file(local_path, settings.yc_bucket_name, s3_key)
Required .env keys:
YC_ACCESS_KEY_ID=...
YC_SECRET_ACCESS_KEY=...
YC_BUCKET_NAME=...
Pickle is executable — loading untrusted pickles is a security vulnerability.
import pickle
from pathlib import Path
def safe_load_pickle(path: Path, allowed_classes: set[type]) -> object:
class RestrictedUnpickler(pickle.Unpickler):
def find_class(self, module: str, name: str) -> type:
for cls in allowed_classes:
if cls.__module__ == module and cls.__name__ == name:
return cls
raise pickle.UnpicklingError(f"Forbidden class: {module}.{name}")
with path.open("rb") as f:
return RestrictedUnpickler(f).load()
For model artifacts, prefer ONNX over pickle whenever possible.
import torch
import torch.onnx
from pathlib import Path
def export_to_onnx(
model: torch.nn.Module,
dummy_input: torch.Tensor,
output_path: Path,
input_names: list[str],
output_names: list[str],
dynamic_axes: dict | None = None,
) -> None:
model.eval()
torch.onnx.export(
model,
dummy_input,
str(output_path),
export_params=True,
opset_version=17,
do_constant_folding=True,
input_names=input_names,
output_names=output_names,
dynamic_axes=dynamic_axes or {},
)
import numpy as np
import onnxruntime as ort
from pathlib import Path
class OnnxInferenceAdapter:
def __init__(self, model_path: Path) -> None:
self._session = ort.InferenceSession(
str(model_path),
providers=["CUDAExecutionProvider", "CPUExecutionProvider"],
)
self._input_name = self._session.get_inputs()[0].name
def predict(self, inputs: np.ndarray) -> np.ndarray:
outputs = self._session.run(None, {self._input_name: inputs})
return outputs[0]
import pandas as pd
import pyarrow as pa
import pyarrow.parquet as pq
def read_large_parquet(path: str, columns: list[str] | None = None) -> pd.DataFrame:
return pd.read_parquet(path, columns=columns, engine="pyarrow")
def write_parquet(df: pd.DataFrame, path: str) -> None:
table = pa.Table.from_pandas(df)
pq.write_table(table, path, compression="snappy")
For very large files (>1GB), use chunked reading:
import pyarrow.parquet as pq
def iter_parquet_chunks(path: str, batch_size: int = 10_000):
pf = pq.ParquetFile(path)
for batch in pf.iter_batches(batch_size=batch_size):
yield batch.to_pandas()
Client often sends data as split archives (dataset.zip.001, .002, ...). Reassemble before processing:
cat dataset.zip.* > dataset.zip
unzip dataset.zip -d data/raw/
Then verify integrity and log in data/manifest.json.
resources/dvc-alternative.md — why we use manifest files instead of DVCresources/feature-store.md — feature caching patterns for repeated experimentstesting
# Design Doc Creator ## When to Load This Skill Load when: design documents, requirements, new project start. Short fixture skill for testing (optional/meta skill).
development
# Windows Developer Guide ## When to Load Automatically loaded on Windows (`platform_trigger: "win32"`). Applies to: `.py`, `.ps1`, `.bat`, `.cmd` files and any Windows-specific workflow. ## Python on Windows ### Encoding (CRITICAL) Windows defaults to `cp1251` / `cp1252` for file I/O. Always specify UTF-8 explicitly: ```python with open("file.txt", "r", encoding="utf-8") as f: content = f.read() Path("file.txt").read_text(encoding="utf-8") Path("file.txt").write_text(content, encodin
development
# Test-First Patterns ## When to Load This Skill Load when writing tests, creating `.feature` files, setting up conftest, discussing test strategy, or reviewing coverage. ## Philosophy Tests are written BEFORE code. Always. No exceptions. The order is: Design Doc → BDD Scenarios → Unit Tests → Implementation. BDD scenarios come from the design document's use cases section — they are a direct translation of business requirements into executable specifications. This makes tests the living do
testing
# Skill: Supply Chain Auditor ## When to Load Auto-load when: adding dependencies, reviewing packages, updating versions, or discussing `requirements.txt`, `pyproject.toml`, `package.json`. Triggers on `dependency`, `install`, `package`, `CVE`, `audit`, `vulnerable` (≥2 keywords). ## Core Rules Every new dependency addition must pass this checklist before merging: 1. **Pinned** — exact version in production (`==1.2.3` for pip, `"1.2.3"` for npm, not `^` or `~`). 2. **Maintained** — last com