.claude/skills/data-prep/SKILL.md
WF4 Data engineering and subset generation. Analyzes dataset format and distribution, generates appropriate training subset strategies by project type (NVS/detection/segmentation, etc.), creates data pipeline scripts, and outputs a statistics report.
npx skillsauth add linzhe001/Harness-Research data-prepInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Note: WF4 itself must ensure dataset paths are written into CLAUDE.md; do not leave this as a best-effort downstream refresh.
First, read PROJECT_STATE.json to get dataset_name and codebase_path. For the output format, see templates/dataset-stats.md. For language behavior, see ../../shared/language-policy.md. </context>
<instructions> 1. **Parse Input**Obtain from PROJECT_STATE.json and $ARGUMENTS:
dataset_name: Dataset namedataset_path: Dataset storage pathsubset_strategy: Subset strategy (optional, auto-inferred)Auto-detect Data Format
Check file types in dataset_path to determine the project type:
| Format Indicator | Project Type | Recommended Subset Strategy |
|-----------------|-------------|---------------------------|
| transforms_*.json (Blender JSON) | NVS / 3DGS | Downscale resolution / select scenes |
| instances_*.json (COCO) | Object Detection | Stratified sampling 10% |
| images/ + labels/ (YOLO) | Object Detection | Stratified sampling 10% |
| point_cloud/ + images/ | 3D Reconstruction | Downscale resolution / select viewpoints |
| COLMAP sparse/ | SfM / NeRF | Downscale resolution / select scenes |
| Other | Confirm with user | Custom |
Analyze Raw Data Distribution
Generate different statistics depending on data type:
NVS / 3DGS Projects (Blender JSON / COLMAP):
Object Detection Projects (COCO/YOLO):
Generate Subset Strategy
NVS / 3DGS Subset Strategy (cannot randomly drop views, as it would break reconstruction quality):
configs/subset_config.jsonObject Detection Subset Strategy:
configs/subset_indices.jsonGenerate Data Pipeline Script
Save the script to src/data/Data_Pipeline_Script.py (or adapt existing data loading code).
The script must include:
Output Statistics Report
Write to docs/Dataset_Stats.md, including:
Preserve the template structure, but localize headings and narrative text according to ../../shared/language-policy.md unless a field is explicitly marked English-only.
Update Project State
Update PROJECT_STATE.json:
current_stage.status → "completed"artifacts.data_pipeline_script → script pathartifacts.dataset_stats → statistics report pathdataset_paths → normalized dataset pathshistory append completion recordSync CLAUDE.md
Before WF4 concludes, trigger /init-project update or an equivalent section-safe update
to ensure CLAUDE.md's ### Dataset Paths is consistent with PROJECT_STATE.json.dataset_paths.
development
WF7.5 training pipeline validation. Before entering WF8 iteration, first use Codex to review code for baseline equivalence, then run a 100-step smoke test to verify end-to-end pipeline functionality.
business
WF1 Inspiration survey and gap analysis. Takes the user's research idea, performs literature search, gap analysis, competitor analysis, and feasibility scoring, then outputs Feasibility_Report.md. Use when the user has a new CV research idea that needs a feasibility assessment.
tools
WF10 Submission/Release Tool. Multi-scene training, result packaging, filename validation, dry-run submission checks. Used after ablation experiments are complete and before competition submission.
development
WF2 Architecture refinement and MVP design. Reads the feasibility report, analyzes the base codebase architecture, designs plug-and-play new modules, defines the MVP, provides A/B/C alternative plans, and outputs Technical_Spec.md. Use when a research idea needs to be translated into a concrete technical architecture design.