skills/training-data-curation/SKILL.md
Guidelines for creating high-quality datasets for LLM post-training (SFT/DPO/RLHF). Use when preparing data for fine-tuning, evaluating data quality, or designing data collection strategies.
npx skillsauth add sundial-org/skills training-data-curationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Best practices for gathering and preparing training data for LLM fine-tuning.
Quality over quantity. Llama 2 used only 27,540 high-quality SFT examples and outperformed models trained on larger noisy datasets [1]. Focus on clean, diverse, well-formatted data.
Garbage in, garbage out. The model will learn patterns from your data—including errors, biases, and formatting issues. Inspect samples manually before training.
Match the target distribution. Training data should reflect the tasks and style you want the model to perform. If you want formal responses, don't train on casual chat data.
Use the messages format (OpenAI/Anthropic/Tinker standard) [5]:
{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}
{"role": "system", "content": "..."}Requires paired comparisons [2]:
{"prompt": "...", "chosen": "...", "rejected": "..."}
chosen and rejected must respond to the same promptFor KTO, pairs aren't required—just binary labels on completions [7]:
{"prompt": "...", "completion": "...", "label": true/false}
Needs ranked responses [1]:
{"prompt": "...", "responses": ["best", "second", "worst"]}
Before training, verify:
| Issue | Detection | Fix | Source | |-------|-----------|-----|--------| | Duplicates | Hash-based dedup | Remove exact matches, MinHash for near-dupes | [3] | | Boilerplate | Keyword filter | Remove "subscribe", "cookie policy", etc. | [8] | | Repetitive text | N-gram analysis | Flag if <30% unique trigrams | [4] | | Low-quality text | Alpha ratio | Remove if <50% alphabetic characters | [8] | | Wrong language | Language detection | fastText classifier, filter to target | [3] | | Too short | Length check | Minimum 3-5 sentences, 100+ words for documents | [8] |
High quality:
Medium quality:
Use with caution:
| Dataset Size | Use Case | Source | |--------------|----------|--------| | 100-1K | Quick experiments, specific behaviors | — | | 1K-10K | Production SFT, domain adaptation | — | | 10K-100K | Comprehensive instruction tuning | [1] | | 1M+ preference pairs | Large-scale RLHF | [1] |
Llama 2 used ~27K SFT examples and 1M+ preference comparisons [1].
development
Data visualization design based on Stanford CS448B. Use for: (1) choosing chart types, (2) selecting visual encodings, (3) critiquing visualizations, (4) building D3.js visualizations, (5) designing interactions/animations, (6) choosing colors, (7) visualizing networks, (8) visualizing text. Covers Bertin, Mackinlay, Cleveland & McGill.
development
Fine-tune LLMs using the Tinker API. Covers supervised fine-tuning, reinforcement learning, LoRA training, vision-language models, and both high-level Cookbook patterns and low-level API usage.
data-ai
Calculate training costs for Tinker fine-tuning jobs. Use when estimating costs for Tinker LLM training, counting tokens in datasets, or comparing Tinker model training prices. Tokenizes datasets using the correct model tokenizer and provides accurate cost estimates.
data-ai
Find, install, create, improve, and publish AI agent skills through the Sundial ecosystem. Use when the user wants to find or search for skills, install a skill, create a new skill, improve or evaluate an existing skill, or publish a skill to Sundial Hub. Trigger phrases include "find a skill", "install skill", "create a skill", "make a skill", "improve this skill", "evaluate skill", "publish skill", "push skill", "search for skills".