skills/dataset-curator/SKILL.md
Curate and clean training datasets for high-quality machine learning
npx skillsauth add jmsktm/claude-settings Dataset CuratorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The Dataset Curator skill guides you through the critical process of preparing high-quality training data for machine learning models. Data quality is the single most important factor in model performance, yet it is often underinvested. This skill helps you systematically clean, validate, augment, and maintain datasets that lead to better models.
From initial collection to ongoing maintenance, this skill covers deduplication, label quality assessment, bias detection, augmentation strategies, and version control. It applies best practices from production ML systems to ensure your datasets are not just clean, but strategically optimized for your learning objectives.
Whether you are building a classifier, fine-tuning an LLM, or training a custom model, this skill ensures your data foundation is solid.
def assess_quality(dataset):
return {
"size": len(dataset),
"duplicate_rate": find_duplicates(dataset).ratio,
"missing_rate": dataset.isnull().mean(),
"label_balance": compute_entropy(dataset.labels),
"outlier_rate": detect_outliers(dataset).ratio,
"estimated_label_noise": estimate_label_noise(dataset)
}
| Action | Command/Trigger | |--------|-----------------| | Assess quality | "Check quality of this dataset" | | Find duplicates | "Find duplicates in dataset" | | Clean labels | "Fix mislabeled data" | | Balance classes | "Handle class imbalance" | | Augment data | "Augment dataset for [task]" | | Version dataset | "Set up dataset versioning" |
Profile Before Processing: Understand your data before changing it
Preserve Provenance: Track every transformation
Prioritize Label Quality: Garbage labels in, garbage model out
Test Cleaning Impact: Measure effect of cleaning
Stratify Splits Carefully: Maintain distribution in train/val/test
Document Everything: Future you will thank present you
Identify and fix mislabeled examples:
from cleanlab import find_label_issues
# Train model to get predicted probabilities
model.fit(X_train, y_train)
pred_probs = model.predict_proba(X_train)
# Find likely mislabeled examples
issues = find_label_issues(
labels=y_train,
pred_probs=pred_probs,
return_indices_ranked_by="self_confidence"
)
# Review and correct top issues
for idx in issues[:100]:
review_and_correct(X_train[idx], y_train[idx])
Remove near-duplicates using embeddings:
def deduplicate_semantic(texts, threshold=0.95):
embeddings = embed(texts)
clusters = cluster_by_similarity(embeddings, threshold)
# Keep one representative per cluster
deduplicated = []
for cluster in clusters:
representative = select_best(cluster) # longest, most recent, etc.
deduplicated.append(representative)
return deduplicated
Prioritize labeling effort:
def active_learning_loop(unlabeled_pool, labeled_set, budget):
while len(labeled_set) < budget:
# Train on current labeled data
model.fit(labeled_set)
# Score unlabeled by uncertainty
uncertainties = model.uncertainty(unlabeled_pool)
# Select most uncertain for labeling
to_label = select_top_k(unlabeled_pool, uncertainties, k=10)
labels = human_label(to_label)
# Update sets
labeled_set.add(to_label, labels)
unlabeled_pool.remove(to_label)
return labeled_set
Find problematic subgroups:
def find_weak_slices(model, data, features):
# Evaluate on all slices
slices = generate_slices(data, features)
weak_slices = []
for slice_name, slice_data in slices:
performance = evaluate(model, slice_data)
if performance < overall_performance - threshold:
weak_slices.append({
"slice": slice_name,
"size": len(slice_data),
"performance": performance
})
return sorted(weak_slices, key=lambda x: x["performance"])
data-ai
Optimize YouTube videos for SEO, thumbnails, descriptions, and audience retention
testing
Design and facilitate effective workshops with agendas, activities, and outcomes
data-ai
Design and optimize AI-powered workflows for complex tasks
data-ai
Design and implement automated workflows to eliminate repetitive tasks and streamline processes