Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

sundial-org/training-data-curation

Name: training-data-curation
Author: sundial-org

skills/training-data-curation/SKILL.md

npx skillsauth add sundial-org/skills training-data-curation

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Training Data Curation Guidelines

Best practices for gathering and preparing training data for LLM fine-tuning.

Data Quality Principles

Quality over quantity. Llama 2 used only 27,540 high-quality SFT examples and outperformed models trained on larger noisy datasets [1]. Focus on clean, diverse, well-formatted data.

Garbage in, garbage out. The model will learn patterns from your data—including errors, biases, and formatting issues. Inspect samples manually before training.

Match the target distribution. Training data should reflect the tasks and style you want the model to perform. If you want formal responses, don't train on casual chat data.

Format Requirements

Supervised Fine-Tuning (SFT)

Use the messages format (OpenAI/Anthropic/Tinker standard) [5]:

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Each sample is a complete conversation
Multi-turn: alternate user/assistant messages
System prompts optional: {"role": "system", "content": "..."}
JSONL format, one sample per line

Preference Learning (DPO/ORPO/KTO)

Requires paired comparisons [2]:

{"prompt": "...", "chosen": "...", "rejected": "..."}

chosen and rejected must respond to the same prompt
Quality difference should be clear and consistent
Annotator agreement >70% indicates usable samples [1]

For KTO, pairs aren't required—just binary labels on completions [7]:

{"prompt": "...", "completion": "...", "label": true/false}

Reward Modeling (RLHF)

Needs ranked responses [1]:

{"prompt": "...", "responses": ["best", "second", "worst"]}

Quality Checklist

Before training, verify:

[ ] No duplicates — exact and near-duplicate removal [3]
[ ] No empty fields — all required fields populated
[ ] Consistent format — schema matches throughout
[ ] Appropriate length — not too short (noise) or too long (truncation)
[ ] Clean text — proper encoding, no HTML/boilerplate artifacts [8]
[ ] Manual inspection — reviewed random sample of 50-100 examples
[ ] No PII/sensitive data — unless intentionally included
[ ] License verified — legal to use for training

Common Quality Issues

| Issue | Detection | Fix | Source | |-------|-----------|-----|--------| | Duplicates | Hash-based dedup | Remove exact matches, MinHash for near-dupes | [3] | | Boilerplate | Keyword filter | Remove "subscribe", "cookie policy", etc. | [8] | | Repetitive text | N-gram analysis | Flag if <30% unique trigrams | [4] | | Low-quality text | Alpha ratio | Remove if <50% alphabetic characters | [8] | | Wrong language | Language detection | fastText classifier, filter to target | [3] | | Too short | Length check | Minimum 3-5 sentences, 100+ words for documents | [8] |

Data Sources

High quality:

Curated human annotations [1]
Expert-written examples
Filtered high-quality web data [3]

Medium quality:

Synthetic data from stronger models (distillation)
Community Q&A with voting signals
Filtered user-generated content

Use with caution:

Raw web scrapes
Unfiltered synthetic data
Data without clear provenance [6]

Sizing Guidelines

| Dataset Size | Use Case | Source | |--------------|----------|--------| | 100-1K | Quick experiments, specific behaviors | — | | 1K-10K | Production SFT, domain adaptation | — | | 10K-100K | Comprehensive instruction tuning | [1] | | 1M+ preference pairs | Large-scale RLHF | [1] |

Llama 2 used ~27K SFT examples and 1M+ preference comparisons [1].

File Format

JSONL — one JSON object per line, human-readable
Parquet — efficient for large datasets, built-in compression [3]
Sharding — split files >500MB into chunks

References

Llama 2 Paper — Touvron et al. (2023). SFT/RLHF data quality practices, 27K SFT examples, >70% annotator agreement threshold
TRL Library — HuggingFace trainer implementations for SFT, DPO, KTO, ORPO
FineWeb Paper — Penedo et al. (2024). Large-scale filtering: MinHash dedup, language detection, quality classifiers
Data-Juicer — Alibaba's quality filtering toolkit with repetition filters, n-gram analysis
Tinker API — Training API using messages format for SFT, DPO/RLHF support
Data Provenance Initiative — Longpre et al. (2023). Dataset licensing and attribution audit
KTO Paper — Ethayarajh et al. (2024). Binary preference learning without pairs
C4/T5 Paper — Raffel et al. (2020). Foundational filtering: terminal punctuation, min sentences, alpha ratio, boilerplate removal

sundial-org/training-data-curation

skills/training-data-curation/SKILL.md

Guidelines for creating high-quality datasets for LLM post-training (SFT/DPO/RLHF). Use when preparing data for fine-tuning, evaluating data quality, or designing data collection strategies.

148 stars

testing

Updated Apr 15, 2026

$ install --global

skillsauth

npx skillsauth add sundial-org/skills training-data-curation

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 15, 2026, 5:11 AM15.4s1 file scanned

SKILL.md

name:: training-data-curation
description:: Guidelines for creating high-quality datasets for LLM post-training (SFT/DPO/RLHF). Use when preparing data for fine-tuning, evaluating data quality, or designing data collection strategies.

Training Data Curation Guidelines

Best practices for gathering and preparing training data for LLM fine-tuning.

Data Quality Principles

Quality over quantity. Llama 2 used only 27,540 high-quality SFT examples and outperformed models trained on larger noisy datasets [1]. Focus on clean, diverse, well-formatted data.

Garbage in, garbage out. The model will learn patterns from your data—including errors, biases, and formatting issues. Inspect samples manually before training.

Match the target distribution. Training data should reflect the tasks and style you want the model to perform. If you want formal responses, don't train on casual chat data.

Format Requirements

Supervised Fine-Tuning (SFT)

Use the messages format (OpenAI/Anthropic/Tinker standard) [5]:

{"messages": [{"role": "user", "content": "..."}, {"role": "assistant", "content": "..."}]}

Each sample is a complete conversation
Multi-turn: alternate user/assistant messages
System prompts optional: {"role": "system", "content": "..."}
JSONL format, one sample per line

Preference Learning (DPO/ORPO/KTO)

Requires paired comparisons [2]:

{"prompt": "...", "chosen": "...", "rejected": "..."}

chosen and rejected must respond to the same prompt
Quality difference should be clear and consistent
Annotator agreement >70% indicates usable samples [1]

For KTO, pairs aren't required—just binary labels on completions [7]:

{"prompt": "...", "completion": "...", "label": true/false}

Reward Modeling (RLHF)

Needs ranked responses [1]:

{"prompt": "...", "responses": ["best", "second", "worst"]}

Quality Checklist

Before training, verify:

[ ] No duplicates — exact and near-duplicate removal [3]
[ ] No empty fields — all required fields populated
[ ] Consistent format — schema matches throughout
[ ] Appropriate length — not too short (noise) or too long (truncation)
[ ] Clean text — proper encoding, no HTML/boilerplate artifacts [8]
[ ] Manual inspection — reviewed random sample of 50-100 examples
[ ] No PII/sensitive data — unless intentionally included
[ ] License verified — legal to use for training

Common Quality Issues

Data Sources

High quality:

Curated human annotations [1]
Expert-written examples
Filtered high-quality web data [3]

Medium quality:

Synthetic data from stronger models (distillation)
Community Q&A with voting signals
Filtered user-generated content

Use with caution:

Raw web scrapes
Unfiltered synthetic data
Data without clear provenance [6]

Sizing Guidelines

Llama 2 used ~27K SFT examples and 1M+ preference comparisons [1].

File Format

JSONL — one JSON object per line, human-readable
Parquet — efficient for large datasets, built-in compression [3]
Sharding — split files >500MB into chunks

References

Llama 2 Paper — Touvron et al. (2023). SFT/RLHF data quality practices, 27K SFT examples, >70% annotator agreement threshold
TRL Library — HuggingFace trainer implementations for SFT, DPO, KTO, ORPO
FineWeb Paper — Penedo et al. (2024). Large-scale filtering: MinHash dedup, language detection, quality classifiers
Data-Juicer — Alibaba's quality filtering toolkit with repetition filters, n-gram analysis
Tinker API — Training API using messages format for SFT, DPO/RLHF support
Data Provenance Initiative — Longpre et al. (2023). Dataset licensing and attribution audit
KTO Paper — Ethayarajh et al. (2024). Binary preference learning without pairs
C4/T5 Paper — Raffel et al. (2020). Foundational filtering: terminal punctuation, min sentences, alpha ratio, boilerplate removal

Related Skills

sundial-org/cs448b-visualization

development

VerifiedTrustedCommunity

Data visualization design based on Stanford CS448B. Use for: (1) choosing chart types, (2) selecting visual encodings, (3) critiquing visualizations, (4) building D3.js visualizations, (5) designing interactions/animations, (6) choosing colors, (7) visualizing networks, (8) visualizing text. Covers Bertin, Mackinlay, Cleveland & McGill.

148SKILL.mdUpdated Apr 15, 2026

sundial-org/cs448b-visualization

sundial-org/tinker

development

VerifiedTrustedCommunity

Fine-tune LLMs using the Tinker API. Covers supervised fine-tuning, reinforcement learning, LoRA training, vision-language models, and both high-level Cookbook patterns and low-level API usage.

148SKILL.mdUpdated Apr 15, 2026

sundial-org/tinker-training-cost

data-ai

VerifiedTrustedCommunity

Calculate training costs for Tinker fine-tuning jobs. Use when estimating costs for Tinker LLM training, counting tokens in datasets, or comparing Tinker model training prices. Tokenizes datasets using the correct model tokenizer and provides accurate cost estimates.

148SKILL.mdUpdated Apr 15, 2026

sundial-org/tinker-training-cost

sundial-org/skill

data-ai

VerifiedTrustedCommunity

Find, install, create, improve, and publish AI agent skills through the Sundial ecosystem. Use when the user wants to find or search for skills, install a skill, create a new skill, improve or evaluate an existing skill, or publish a skill to Sundial Hub. Trigger phrases include "find a skill", "install skill", "create a skill", "make a skill", "improve this skill", "evaluate skill", "publish skill", "push skill", "search for skills".

148SKILL.mdUpdated Apr 15, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/sundial-org/skills.git

# Copy into Claude Code skills folder (global)
cp -r skills/skills/training-data-curation ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

sundial-org/skills

148 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT