skills/data-pipeline-manager/SKILL.md
Manage ML dataset pipelines before training. Use when the user needs to acquire, preprocess, split, or version datasets, design train/val/test protocols, audit data quality, check for train/test contamination, or make data decisions that affect experimental validity and reviewer trust.
npx skillsauth add a-green-hand-jack/ml-research-skills data-pipeline-managerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Make deliberate, documented data decisions before training. This skill prevents the most common experimental validity failures: undocumented preprocessing, leaky splits, contaminated test sets, and irreproducible data pipelines.
Use this skill when:
Do not use this skill for running experiments — use run-experiment once the pipeline is validated. Do not use this skill to write result sections — use experiment-story-writer after results exist.
Pair this skill with:
experiment-design-planner to ensure the split protocol matches the experimental design's evaluation criteriabaseline-selection-audit when dataset choice or protocol affects which baselines are comparableresearch-project-memory to record data decisions as claims, evidence, and risksresult-diagnosis when surprising results may trace to a data artifact rather than a method failurepaper-evidence-board when data protocol details need to appear in the paper's methods and evidence slotsinit-python-project when the project's data/ layout and pipeline scripts need to be scaffolded<installed-skill-dir>/
├── SKILL.md
├── references/
│ ├── split-protocols.md
│ └── contamination-guide.md
└── templates/
└── pipeline-plan.md
references/split-protocols.md when designing or auditing split protocols.references/contamination-guide.md when handling LLM evaluation, pre-trained model probing, retrieval benchmarks, or any setting where the model may have seen test-adjacent data.templates/pipeline-plan.md before writing a pipeline plan.memory/claim-board.md and memory/evidence-board.md when data decisions need to be linked to paper claims.code/.agent/benchmark-plan.md or equivalent when a benchmark protocol already exists.Locate:
data/, datasets/, code/data/, or user-provided pathdata_utils.py, preprocess.py, scripts/prepare_data.sh, or equivalentdocs/data/, data/README.md, or any dataset descriptioncode/.agent/benchmark-plan.md, memory/claim-board.mdpaper/sections/experiments.tex or equivalent when presentRecord:
If nothing exists, produce a draft pipeline plan and mark uncertain fields as TBD.
Choose based on the user's request or the current gap:
audit: review an existing pipeline for completeness, reproducibility, and reviewer risksplit: design or validate the train/val/test split protocolquality: run or plan a data quality audit (missing values, label noise, distribution shifts, class imbalance)contamination: check for data leakage between splits and pre-training / retrieved dataversioning: pin dataset version, hashes, download scripts, and processing seedsplan: create a full pipeline plan covering all of the above for a new datasetDefault: use audit when a pipeline exists, plan when none exists.
Read references/split-protocols.md.
For every dataset split, record:
- Split: train | val | test | held-out
- Size: N examples (% of total)
- Stratification: class-stratified | time-ordered | random | grouped | domain-separated
- Overlap policy: none | explicit-allowed (state why)
- Leakage controls: [list controls]
- Assignment seed: <integer>
- Purpose: what hypothesis or metric this split answers
- Forbidden use: what must NOT touch this split before final evaluation
Flag any split designed after the method was partially tuned — this is a reviewer-visible risk. Flag any split where the validation set was used as a proxy test set.
Check:
For each finding, classify as:
blocker: will cause invalid results or comparison failuresrisk: should be disclosed in the paper's limitations or experiment settingsnote: worth tracking but not blockingRead references/contamination-guide.md.
For LLM evaluation, retrieval, or pre-trained model probing:
For standard supervised tasks:
For every contamination finding, record the severity and the available mitigation.
For reproducible experiments, the pipeline plan must include:
- Dataset name and version: <name>==<version> or commit/hash
- Download source: <url or repo> (verified checksum or hash)
- Raw data hash: sha256=<hash>
- Processing script: <path/to/script> at git commit <sha>
- Processing seed: <integer> for every random step
- Processed data hash: sha256=<hash> of final splits
- Storage location: <path or remote>
- Excluded samples: <rule and count>
- Known issues: <list>
If any of these fields cannot be filled, mark as TODO and add to memory/action-board.md.
Use templates/pipeline-plan.md.
Save to:
code/.agent/data-pipeline-plan.md
If no code/ directory exists, save to:
.agent/data-pipeline-plan.md
When updating, preserve stable decisions and add a compact change note.
After completing a pipeline plan or audit:
memory/claim-board.mdmemory/risk-board.mdmemory/action-board.mdmemory/evidence-board.md when they are cited in the papermemory/current-status.md when the dataset version or split protocol changesexperiment-design-planner: align experimental design with the finalized split protocolbaseline-selection-audit: recheck baseline comparability after split or preprocessing changesresult-diagnosis: diagnose surprising results that may trace to a data issuepaper-evidence-board: add dataset protocol details to the paper's evidence slotsinit-python-project: scaffold data/ layout and pipeline scriptsBefore finalizing:
testing
Bootstrap project-local ml-research-skills. Use from global installs when creating a new ML research project, enabling this collection in an existing ML research repo, or deciding whether to install the full bundle locally. Route to project-init for new projects; do not handle paper or experiment work directly.
development
Route project operations tasks — git, memory, bootstrap, remote, workspace, code review, timeline, ops — to the correct skill. Use when the task involves commits, pushes, worktrees, project memory, enabling project-local skills, SSH/server coordination, sidecar runners, or audits. Do not solve the ops task directly.
testing
Route ML/AI paper writing tasks to the correct skill — contract planning, prose drafting, section writing, consistency editing, review simulation, rebuttal, submission, or citation work. Use when the task involves writing, revising, reviewing, or submitting a paper instead of guessing between paper-writing-assistant, paper-writing-contract-planner, paper-reviewer-simulator, auto-paper-improvement-loop, or citation skills. Do not draft prose directly.
data-ai
Project-local router for ML research skill selection. Use inside an initialized ML research project, or while maintaining this skill repo, when the user describes an ML research/paper/experiment/discovery/ops/release workflow and may not know the skill; route to a domain router or high-signal leaf. Do not use for generic non-ML projects.