claude-project/skills/data-engineering/data-quality-standards/SKILL.md
Data quality validation rules, quality metrics, and acceptance criteria for Somali dialect classifier datasets. Covers duplicate detection, language filtering, quality scoring, and validation protocols. Auto-invokes when discussing data quality, validation, cleaning, or quality guardrails for this project.
npx skillsauth add ilyasibrahim/claude-agents-coordination data-quality-standardsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Language Purity:
Duplicate Rate:
Label Confidence:
Text Quality Score:
def basic_validation(record):
checks = {
'has_text': bool(record.get('text', '').strip()),
'has_label': record.get('label') in ['Northern', 'Southern', 'Central'],
'valid_length': 10 <= len(record.get('text', '')) <= 5000,
'valid_encoding': is_valid_utf8(record['text'])
}
return all(checks.values()), checks
from langdetect import detect
def validate_language(text):
try:
lang = detect(text)
return lang == 'so' # Somali ISO code
except:
return False
from difflib import SequenceMatcher
def is_near_duplicate(text1, text2, threshold=0.95):
similarity = SequenceMatcher(None, text1, text2).ratio()
return similarity >= threshold
def compute_quality_score(text):
score = 0
# Length appropriateness (1-3 points)
if 50 <= len(text) <= 1000:
score += 3
elif 20 <= len(text) < 50 or 1000 < len(text) <= 3000:
score += 2
else:
score += 1
# Vocabulary richness (1-3 points)
unique_words = len(set(text.split()))
total_words = len(text.split())
if total_words > 0:
vocab_ratio = unique_words / total_words
if vocab_ratio > 0.7:
score += 3
elif vocab_ratio > 0.5:
score += 2
else:
score += 1
# No HTML/formatting artifacts (1-2 points)
if not ('<' in text or '>' in text or '{' in text):
score += 2
# Proper sentences (1-2 points)
if text.count('.') >= 1: # At least one sentence
score += 2
return min(score, 10) # Cap at 10
For Training Set:
For Validation/Test Sets:
Remove if:
Flag for review if:
Accept if:
Dataset-Level:
Per-Source:
Per-Dialect:
Example Report:
Dataset Quality Report - 2025-11-06
Total Records: 10,000
Passing Validation: 9,200 (92%)
Average Quality Score: 7.8/10
Duplicates Removed: 600 (6%)
Language Purity: 98.5% Somali
Per-Source Quality:
- Wikipedia: 8.5/10 (3,000 records)
- BBC Somali: 8.2/10 (2,500 records)
- Social Media: 6.9/10 (4,500 records, 30% rejected)
Per-Dialect Distribution:
- Northern: 5,500 (59.8%)
- Southern: 2,200 (23.9%)
- Central: 1,500 (16.3%)
This skill auto-invokes when you mention:
Version: 1.0.0 Last Updated: 2025-11-06 Project: Somali Dialect Classifier
documentation
Voice, tone, and content guidelines for data/ML dashboards. Covers microcopy, error messages, success states, and data presentation language. Auto-invokes on copy, messaging, content, labels, error messages keywords.
development
Unified design system for data/ML dashboards. Quick reference for brand vs data color decisions, component patterns, typography, spacing. Auto-invokes on styling, CSS, design, colors, UI, visualization keywords. Tiered loading - core always, philosophy/implementation on-demand.
development
Coordination protocol for main Claude Code agent. Explicit user invocation required ("mobilize agents", "coordinate", "check registry"). Provides agent orchestration, registry management, and handoff protocols. Subagents never access this - main agent provides context in task prompts.
development
Model evaluation metrics, testing protocols, and performance assessment for Somali dialect classification. Covers accuracy, F1-score, confusion matrix analysis, per-dialect performance, and evaluation best practices for multi-class classification tasks.