skills/fine-tuning-dataset-curator/SKILL.md
Prepare high-quality datasets for LLM fine-tuning with filtering, deduplication, augmentation, and RLHF data formatting. Activate on: fine-tuning data, training data curation, RLHF dataset, data quality filtering, SFT dataset. NOT for: model training infrastructure (ai-engineer), prompt engineering without fine-tuning (prompt-engineer).
npx skillsauth add curiositech/windags-skills fine-tuning-dataset-curatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Prepare, filter, deduplicate, and format high-quality datasets for supervised fine-tuning (SFT), RLHF, and DPO training of language models.
Activate on: "fine-tuning dataset", "training data preparation", "RLHF data", "DPO pairs", "SFT data", "data quality filtering", "dedup training data", "curate instruction dataset", "preference data"
NOT for: Model training loop implementation (ai-engineer), prompt optimization without fine-tuning (prompt-engineer), or general ETL pipelines (data-pipeline-engineer)
| Domain | Technologies | Notes | |--------|-------------|-------| | Quality Filtering | fasttext classifiers, perplexity scoring, regex rules | Remove noise before it poisons the model | | Deduplication | MinHash (datasketch), exact hash, SimHash | Near-dedup critical for training stability | | Augmentation | LLM-generated paraphrases, backtranslation, persona variation | 3-5x dataset size with diversity | | Format Conversion | chat-ml, Alpaca, ShareGPT, OpenAI JSONL | Match target training framework | | PII Removal | presidio, regex, spaCy NER | Legal requirement for most training data | | RLHF/DPO Prep | Preference pair generation, reward model labeling | Chosen/rejected pairs with margin scoring |
Raw Sources ──→ [Extract] ──→ [Filter] ──→ [Dedup] ──→ [Augment] ──→ [Format] ──→ [Validate]
│ │ │ │ │ │ │
logs, docs parse to quality MinHash paraphrase chat-ml hold-out
APIs, CSVs instruction/ scoring near-dedup via LLM or JSONL eval set
response remove < persona distribution
pairs threshold variation check
# Quality filtering pipeline
import hashlib
from datasketch import MinHash, MinHashLSH
def curate_sft_dataset(raw_examples: list[dict]) -> list[dict]:
# Step 1: Basic quality filters
filtered = []
for ex in raw_examples:
instruction, response = ex["instruction"], ex["response"]
if len(response.split()) < 10: # Too short
continue
if len(response.split()) > 2000: # Too long (likely garbage)
continue
if instruction.strip() == "": # Empty instruction
continue
if response.count("\n") > 50: # Excessive formatting
continue
filtered.append(ex)
# Step 2: Near-deduplication with MinHash
lsh = MinHashLSH(threshold=0.8, num_perm=128)
deduped = []
for i, ex in enumerate(filtered):
mh = MinHash(num_perm=128)
for word in ex["response"].split():
mh.update(word.encode("utf-8"))
if not lsh.query(mh): # No near-duplicate found
lsh.insert(f"doc_{i}", mh)
deduped.append(ex)
# Step 3: Format for training
formatted = []
for ex in deduped:
formatted.append({
"messages": [
{"role": "user", "content": ex["instruction"]},
{"role": "assistant", "content": ex["response"]}
]
})
return formatted
Instruction ──→ [Generate N responses] ──→ [Score/Rank] ──→ [Select Pairs]
│ │ │
temperature=0.8 human eval chosen: best
N=4 responses per or LLM judge rejected: worst
instruction or heuristic margin > threshold
Output format (DPO):
{
"prompt": "Explain quantum computing",
"chosen": "Quantum computing uses qubits...", # High-quality response
"rejected": "Quantum computing is computers..." # Lower-quality response
}
10 Gold Examples ──→ [LLM Generator] ──→ [Quality Filter] ──→ 500 Examples
│ │
"Generate 50 remove duplicates,
variations of score perplexity,
this instruction human spot-check
with different 10% sample
personas and
complexity levels"
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.