claude-project/skills/machine-learning/lrl-nlp-techniques/SKILL.md
Low-resource NLP techniques specific to Somali language processing. Covers data scarcity strategies, cross-lingual transfer, morphological analysis, data augmentation for Somali, semi-supervised learning, and evaluation considerations for low-resource contexts. Auto-invokes when working on Somali NLP, low-resource language challenges, dialect classification, or language-specific modeling decisions.
npx skillsauth add ilyasibrahim/claude-agents-coordination lrl-nlp-techniquesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Language: Somali (Cushitic language family) Task: Dialect classification (Northern, Southern, Central) Challenge: Limited labeled training data Approach: Low-resource NLP techniques + transfer learning
Approach: Leverage high-resource languages with linguistic similarity
For Somali:
Implementation:
# Start with multilingual model
model = AutoModelFor
SequenceClassification.from_pretrained(
'xlm-roberta-base', # Pre-trained on 100 languages
num_labels=3 # Northern, Southern, Central
)
# Fine-tune on Somali data
trainer.train()
Techniques for Somali:
Back-Translation:
Synonym Replacement:
Character-Level Noise:
Example:
# Simple augmentation
def augment_somali_text(text):
# Preserve meaning, add variation
return varied_text
Approach: Use large unlabeled Somali corpus + small labeled set
Techniques:
For This Project:
Agglutinative Structure:
Grammatical Gender:
Verb Conjugation:
Tokenization Strategy:
# Tokenizer selection for Somali
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
# XLM-R uses Sentence Piece (subword tokenization)
# Good for morphologically rich languages
Classification Strategy:
Standard Metrics:
Low-Resource Specific:
Example:
# Detailed evaluation
from sklearn.metrics import classification_report, confusion_matrix
report = classification_report(y_true, y_pred,
target_names=['Northern', 'Southern', 'Central'])
cm = confusion_matrix(y_true, y_pred)
Challenge: Limited data means train/val/test splits are small
Approach:
Option 1: Fine-Tuned Multilingual Transformer
Option 2: Character-Level CNN
Option 3: Hybrid Approach
Recommendation for this project: Start with XLM-R (proven success on low-resource languages)
High-Quality:
Noisy but Useful:
Consider:
Given Limited Resources:
Challenge: Northern dialect likely overrepresented
Solutions:
Example:
# Weighted loss
from sklearn.utils.class_weight import compute_class_weight
class_weights = compute_class_weight('balanced',
classes=np.unique(y_train),
y=y_train)
# Use in training
loss_fn = nn.CrossEntropyLoss(weight=torch.tensor(class_weights))
Code Template:
from transformers import AutoModel, AutoTokenizer, Trainer
# 1. Load pre-trained model
model = AutoModelForSequenceClassification.from_pretrained('xlm-roberta-base', num_labels=3)
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
# 2. Prepare Somali dataset
train_dataset = prepare_dataset(somali_train_data, tokenizer)
# 3. Fine-tune
trainer = Trainer(
model=model,
train_dataset=train_dataset,
eval_dataset=val_dataset,
compute_metrics=compute_metrics
)
trainer.train()
# 4. Evaluate
results = trainer.evaluate(test_dataset)
This skill auto-invokes when you mention:
Version: 1.0.0 Last Updated: 2025-11-06 Project: Somali Dialect Classifier
documentation
Voice, tone, and content guidelines for data/ML dashboards. Covers microcopy, error messages, success states, and data presentation language. Auto-invokes on copy, messaging, content, labels, error messages keywords.
development
Unified design system for data/ML dashboards. Quick reference for brand vs data color decisions, component patterns, typography, spacing. Auto-invokes on styling, CSS, design, colors, UI, visualization keywords. Tiered loading - core always, philosophy/implementation on-demand.
development
Coordination protocol for main Claude Code agent. Explicit user invocation required ("mobilize agents", "coordinate", "check registry"). Provides agent orchestration, registry management, and handoff protocols. Subagents never access this - main agent provides context in task prompts.
development
Model evaluation metrics, testing protocols, and performance assessment for Somali dialect classification. Covers accuracy, F1-score, confusion matrix analysis, per-dialect performance, and evaluation best practices for multi-class classification tasks.