skills/dataset-curator/SKILL.md
Use this skill when designing, cleaning, deduplicating, or documenting datasets for model training and evaluation including schema design, class imbalance handling, and train/val/test splits. Not for running model training or hyperparameter tuning. Not for real-time data pipeline engineering.
npx skillsauth add nickcrew/claude-cortex dataset-curatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill covers the full lifecycle of dataset creation and curation for machine learning and LLM tasks. It addresses dataset schema design, data collection strategies, quality filtering, deduplication, class imbalance mitigation, stratified train/val/test splits, annotation guideline writing, and dataset card documentation. Good datasets are the foundation of reliable models — this skill helps teams avoid the most common data quality pitfalls that lead to poor generalization, evaluation leakage, and biased models.
| Task | Approach | |------|----------| | Define dataset schema | List fields, types, required vs optional, allowed values, and examples | | Remove duplicates | Hash-based exact dedup + MinHash/LSH for near-duplicate detection | | Fix class imbalance | Oversample minority (SMOTE) or undersample majority; adjust loss weights | | Create train/val/test splits | Stratified split by label; ensure no overlap of entities across splits | | Document the dataset | Write a dataset card with provenance, schema, statistics, and limitations | | Validate annotation quality | Compute inter-annotator agreement (Cohen's kappa or Krippendorff's alpha) | | Handle missing values | Decide per-field: impute, drop row, or add "unknown" category | | Detect label noise | Use confident learning (cleanlab) or cross-validation outlier detection |
Define the task and schema — Before collecting any data, write the schema: every field name, data type, allowed values, and whether it is required. For classification datasets, enumerate all valid labels and their definitions. Ambiguous schemas cause inconsistent annotations and training failures.
Establish collection strategy — Determine the data source: human-annotated, LLM-generated, web-scraped, synthetic, or a combination. Document collection date, source URLs, licenses, and any sampling decisions. Ensure the collection covers the full input distribution the model will encounter in production.
Write annotation guidelines — Create a guideline document for labelers that defines every label, provides positive and negative examples for each, and includes decision rules for edge cases. Pilot the guidelines with 2–3 annotators on a sample of 50 items and iterate before full annotation begins.
Run quality filtering — Remove items that are too short, too long, contain encoding errors, are in the wrong language, or fail domain-specific quality checks. Log how many items were removed at each filter step and why. Preserve a raw snapshot before filtering.
Deduplicate the dataset — Apply exact deduplication first (hash the text or key fields). Then apply near-duplicate detection using MinHash + LSH (e.g., datasketch library) or sentence embedding cosine similarity. Aim to remove items with >80% overlap. Keep the highest-quality copy when deduplicating.
Assess and address class imbalance — Compute class distribution. If any class has less than 5% of the majority class count, consider: (a) collecting more data for minority classes, (b) oversampling with augmentation, (c) applying class weights in the loss function, or (d) using stratified sampling. Document the chosen approach and its rationale.
Create stratified splits — Split data into train/val/test sets preserving class distribution in each split. Use an 80/10/10 or 70/15/15 ratio as a starting point. For datasets with identifiable entities (users, documents, companies), ensure the same entity never appears in multiple splits to prevent leakage. Test set must remain untouched until final evaluation.
Measure annotation quality — For human-labeled data, compute inter-annotator agreement on a sample (at least 10% of the dataset labeled by 2+ annotators). Cohen's kappa > 0.7 is acceptable; < 0.6 indicates guideline issues. Resolve disagreements through adjudication, not random selection.
Audit for biases and coverage gaps — Analyze the dataset across demographic attributes, time periods, domains, and edge cases. Identify over-represented and under-represented slices. Document known limitations in the dataset card. If possible, compare distribution to real production data.
Write the dataset card — Document: dataset name, version, task, schema, collection methodology, source licenses, size (rows and tokens), class distribution, split sizes, known limitations and biases, and citation. Use the Hugging Face dataset card format for compatibility with the ecosystem.
Input: A support team wants to train a model to classify customer emails into 8 intent categories: billing, refund, technical-issue, account-access, product-question, cancellation, complaint, and other.
Output:
Schema definition:
{
"id": "string (UUID)",
"text": "string (customer email body, min 10 chars, max 2000 chars)",
"label": "enum: [billing, refund, technical-issue, account-access, product-question, cancellation, complaint, other]",
"confidence": "float 0.0–1.0 (annotator confidence)",
"annotator_id": "string",
"annotation_date": "ISO 8601 date",
"source": "enum: [zendesk, email-archive, synthetic]"
}
Collection plan:
[NAME], emails with [EMAIL], account numbers with [ACCT]Annotation guidelines excerpt:
Label: "refund"
Definition: Customer is requesting money back for a purchase.
Positive examples:
- "I'd like a refund for my order from last week"
- "Please return the charge to my credit card"
Negative examples (do NOT label as refund):
- "I want to exchange my item" → label as product-question
- "Why was I charged twice?" → label as billing
Edge case: "I want to cancel and get my money back" → label as cancellation (intent is to cancel)
Deduplication approach:
from datasketch import MinHash, MinHashLSH
def get_minhash(text, num_perm=128):
m = MinHash(num_perm=num_perm)
for word in text.lower().split():
m.update(word.encode('utf8'))
return m
lsh = MinHashLSH(threshold=0.8, num_perm=128)
# Add items and query for near-duplicates; keep only one per cluster
Split strategy:
Total: 5,200 items (after dedup and filtering)
Train: 4,160 (80%) — stratified by label
Val: 520 (10%) — stratified by label
Test: 520 (10%) — stratified by label; set aside immediately, not used for development
Input: A manufacturing team wants to train a binary classifier to detect product defects from camera images. They have 10,000 raw images, but only ~400 are defective.
Output:
Schema definition:
image_id: string (factory_line + timestamp)
image_path: string (relative path to image file)
label: enum: [defective, non-defective]
defect_type: enum: [scratch, dent, discoloration, missing-component, none] (sub-type for defective)
image_width: int (pixels)
image_height: int (pixels)
capture_date: ISO 8601 date
production_line: string
annotated_by: string (annotator ID or "automated")
Quality filtering rules:
Class imbalance strategy:
Original: 9,600 non-defective, 400 defective (24:1 imbalance)
Option 1 — Data augmentation for minority class:
Apply: horizontal flip, ±15° rotation, brightness ±20%, add Gaussian noise
Target: 2,000 defective images (5× augmentation)
Result: 9,600 non-defective, 2,000 defective (4.8:1 ratio) — more manageable
Option 2 — Class weighting (simpler, use if augmentation is not feasible):
class_weight = {0: 1.0, 1: 24.0} # inverse frequency weighting
Apply in model training loss function
Recommendation: Use both — augment to 2,000 AND apply 4.8:1 class weight
Dataset card excerpt:
Dataset Name: Manufacturing Defect Detection v1.2
Task: Binary image classification (defective / non-defective)
Size: 11,600 images (9,600 non-defective, 2,000 defective after augmentation)
Source: Factory line cameras, Line A and Line B, 2024-01 to 2024-06
License: Internal use only (proprietary)
Known Limitations:
- Only covers Lines A and B; Line C has different lighting conditions
- Defective samples over-represent scratches (60% of defects)
- No samples from night shift (different ambient light)
Split: Train 80% / Val 10% / Test 10% (stratified by label and production line)
cleanlab to automatically detect likely mislabeled examples in existing datasetsdatasets library handles streaming, caching, and map operations efficiently for large datasetsdata_source field to every item — it's invaluable when debugging distribution shiftaugmented: true flag for traceabilitydevelopment
Product vision, roadmap development, and go-to-market execution with structured prioritization frameworks. Use when evaluating features, planning product direction, or assessing market fit.
development
Complete operational workflow for implementer agents (Codex, Gemini, etc.) making code changes and writing tests. Drives all work through atomic commits — each loop operates on the smallest complete, reviewable change. Defines the Code Change Loop, Test Writing Loop, Lint Gate, and Issue Filing process with circuit breakers, severity levels, and escalation rules. Requires `cortex git commit` for all commits. Includes bundled provider-aware review scripts that keep same-model shell-outs as the last resort, plus a fresh-context Codex fallback for code review and test audit. Use this skill when starting any implementation task.
development
Use this skill when writing product requirements documents, prioritizing features, creating user stories, defining acceptance criteria, or setting product metrics. Trigger phrases: 'write a PRD for', 'prioritize this feature backlog', 'write user stories for', 'help me define acceptance criteria', 'what metrics should we track for'. Not for writing code, designing UI mockups, or conducting user research interviews.
tools
Automates browser interactions for web testing, form filling, screenshots, and data extraction. Use when the user needs to navigate websites, interact with web pages, fill forms, take screenshots, test web applications, or extract information from web pages.