Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ilyasibrahim/lrl-nlp-techniques

Name: lrl-nlp-techniques
Author: ilyasibrahim

claude-project/skills/machine-learning/lrl-nlp-techniques/SKILL.md

npx skillsauth add ilyasibrahim/claude-agents-coordination lrl-nlp-techniques

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Low-Resource NLP Techniques for Somali

Project Context

Language: Somali (Cushitic language family) Task: Dialect classification (Northern, Southern, Central) Challenge: Limited labeled training data Approach: Low-resource NLP techniques + transfer learning

Data Scarcity Strategies

1. Cross-Lingual Transfer

Approach: Leverage high-resource languages with linguistic similarity

For Somali:

Use multilingual models (mBERT, XLM-R) pre-trained on 100+ languages
Fine-tune on limited Somali data
Arabic transfer (geographic/cultural proximity)
Afro-Asiatic language family knowledge transfer

Implementation:

# Start with multilingual model
model = AutoModelFor

SequenceClassification.from_pretrained(
    'xlm-roberta-base',  # Pre-trained on 100 languages
    num_labels=3  # Northern, Southern, Central
)

# Fine-tune on Somali data
trainer.train()

2. Data Augmentation

Techniques for Somali:

Back-Translation:

Somali → English → Somali (introduces variation)
Use with caution (may introduce artifacts)

Synonym Replacement:

Replace words with Somali synonyms
Maintain grammatical structure

Character-Level Noise:

Add/remove diacritics
Simulate OCR errors (if data source is scanned)

Example:

# Simple augmentation
def augment_somali_text(text):
    # Preserve meaning, add variation
    return varied_text

3. Semi-Supervised Learning

Approach: Use large unlabeled Somali corpus + small labeled set

Techniques:

Self-training: Train on labeled → predict on unlabeled → add confident predictions
Co-training: Train multiple models, use agreement
Pseudo-labeling: Label unlabeled data with existing model

For This Project:

Leverage web-scraped Somali text (Wikipedia, news, social media)
Use dialect classifier to pseudo-label unlabeled text
Iteratively improve with high-confidence predictions

Morphological Considerations

Somali Language Characteristics

Agglutinative Structure:

Words formed by adding affixes to roots
Example: buug (book) → buuggaan (these books)

Grammatical Gender:

Masculine/Feminine affects word forms
Important for proper parsing

Verb Conjugation:

Complex tense/aspect system
Affects sentence structure classification

Tokenization Strategy:

Use subword tokenization (BPE, WordPiece)
Captures morphological patterns
Better for low-resource scenarios

# Tokenizer selection for Somali
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
# XLM-R uses Sentence Piece (subword tokenization)
# Good for morphologically rich languages

Dialect-Specific Considerations

Northern Dialect (Standard Somali)

Most represented in written text
Official/formal language basis
More training data available

Southern Dialect (Af-Maay)

Significant linguistic differences
Less written representation
May require targeted data collection

Central Dialect

Intermediate characteristics
Mixed features from North/South
Potentially harder to classify

Classification Strategy:

Focus on dialectal markers (vocabulary, phonology represented in text)
Use character n-grams (capture phonetic patterns)
Leverage morphological differences

Evaluation in Low-Resource Context

Metrics

Standard Metrics:

Accuracy, Precision, Recall, F1-score

Low-Resource Specific:

Per-class performance (some dialects may be underrepresented)
Confusion matrix analysis (which dialects are confusable?)
Performance vs. training set size curves

Example:

# Detailed evaluation
from sklearn.metrics import classification_report, confusion_matrix

report = classification_report(y_true, y_pred,
                               target_names=['Northern', 'Southern', 'Central'])
cm = confusion_matrix(y_true, y_pred)

Cross-Validation Strategy

Challenge: Limited data means train/val/test splits are small

Approach:

k-fold cross-validation (k=5 or k=10)
Stratified splits (maintain class balance)
Report mean ± std dev across folds

Recommended Model Architectures

For Dialect Classification

Option 1: Fine-Tuned Multilingual Transformer

XLM-R or mBERT
Pre-trained on many languages
Fine-tune final layers on Somali

Option 2: Character-Level CNN

Good for morphologically rich languages
Captures sub-word patterns
Less data-hungry than full transformers

Option 3: Hybrid Approach

Character-level features + word embeddings
Captures both local and global patterns

Recommendation for this project: Start with XLM-R (proven success on low-resource languages)

Data Collection Best Practices

Sources for Somali Text

High-Quality:

Somali Wikipedia
Official government documents
News websites (e.g., BBC Somali)
Academic publications

Noisy but Useful:

Social media (Twitter, Facebook)
Forums and discussion boards
User-generated content

Consider:

Geographic metadata (helps with dialect labeling)
Source reliability
Copyright/usage rights

Labeling Strategy

Given Limited Resources:

Focus on high-confidence examples
Use native speakers for validation
Create clear labeling guidelines
Inter-annotator agreement checks

Handling Class Imbalance

Challenge: Northern dialect likely overrepresented

Solutions:

Weighted loss function (penalize majority class less)
Oversampling minority classes
Data augmentation for underrepresented dialects
Stratified sampling

Example:

# Weighted loss
from sklearn.utils.class_weight import compute_class_weight

class_weights = compute_class_weight('balanced',
                                     classes=np.unique(y_train),
                                     y=y_train)

# Use in training
loss_fn = nn.CrossEntropyLoss(weight=torch.tensor(class_weights))

Transfer Learning Pipeline

Recommended Workflow

Pre-training: Start with XLM-R (already done)
Language Adaptation: (Optional) Further pre-train on large Somali corpus
Task Fine-Tuning: Fine-tune on labeled dialect data
Evaluation: Test on held-out set
Iteration: Augment data, adjust hyperparameters

Code Template:

from transformers import AutoModel, AutoTokenizer, Trainer

# 1. Load pre-trained model
model = AutoModelForSequenceClassification.from_pretrained('xlm-roberta-base', num_labels=3)
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')

# 2. Prepare Somali dataset
train_dataset = prepare_dataset(somali_train_data, tokenizer)

# 3. Fine-tune
trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)
trainer.train()

# 4. Evaluate
results = trainer.evaluate(test_dataset)

Common Pitfalls

❌ Avoid

Overfitting: Very easy with limited data. Use regularization, dropout, early stopping.
Data Leakage: Ensure train/val/test splits don't overlap (especially with augmented data)
Inappropriate Baselines: Don't compare to high-resource benchmarks
Ignoring Linguistic Structure: Somali morphology matters—use appropriate tokenization

✅ Do

Start Simple: Baseline with logistic regression + TF-IDF before deep models
Use Pre-Trained Models: Leverage multilingual transformers
Validate with Native Speakers: Especially for edge cases
Document Data Sources: Maintain provenance for reproducibility
Report Confidence Intervals: Acknowledge uncertainty in low-resource setting

When This Skill Activates

This skill auto-invokes when you mention:

Somali language, Somali NLP, Somali dialect
Low-resource NLP, data scarcity, limited data
Dialect classification, dialect detection
Cross-lingual transfer, multilingual models
Morphological analysis, agglutinative languages
Data augmentation for NLP
XLM-R, mBERT, multilingual transformers
Semi-supervised learning, pseudo-labeling

References

Somali Wikipedia: https://so.wikipedia.org
BBC Somali: News source for text data
XLM-R Paper: Conneau et al., 2019 (unsupervised cross-lingual representation learning)
Low-Resource NLP Survey: Hedderich et al., 2021

Version: 1.0.0 Last Updated: 2025-11-06 Project: Somali Dialect Classifier

ilyasibrahim/lrl-nlp-techniques

claude-project/skills/machine-learning/lrl-nlp-techniques/SKILL.md

Low-resource NLP techniques specific to Somali language processing. Covers data scarcity strategies, cross-lingual transfer, morphological analysis, data augmentation for Somali, semi-supervised learning, and evaluation considerations for low-resource contexts. Auto-invokes when working on Somali NLP, low-resource language challenges, dialect classification, or language-specific modeling decisions.

52 stars

testing

Updated Apr 21, 2026

$ install --global

skillsauth

npx skillsauth add ilyasibrahim/claude-agents-coordination lrl-nlp-techniques

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 30, 2026, 1:45 PM40.4s1 file scanned

SKILL.md

name:: lrl-nlp-techniques
description:: Low-resource NLP techniques specific to Somali language processing. Covers data scarcity strategies, cross-lingual transfer, morphological analysis, data augmentation for Somali, semi-supervised learning, and evaluation considerations for low-resource contexts. Auto-invokes when working on Somali NLP, low-resource language challenges, dialect classification, or language-specific modeling decisions.
allowed-tools:: Read, Grep

Low-Resource NLP Techniques for Somali

Project Context

Data Scarcity Strategies

1. Cross-Lingual Transfer

Approach: Leverage high-resource languages with linguistic similarity

For Somali:

Use multilingual models (mBERT, XLM-R) pre-trained on 100+ languages
Fine-tune on limited Somali data
Arabic transfer (geographic/cultural proximity)
Afro-Asiatic language family knowledge transfer

Implementation:

# Start with multilingual model
model = AutoModelFor

SequenceClassification.from_pretrained(
    'xlm-roberta-base',  # Pre-trained on 100 languages
    num_labels=3  # Northern, Southern, Central
)

# Fine-tune on Somali data
trainer.train()

2. Data Augmentation

Techniques for Somali:

Back-Translation:

Somali → English → Somali (introduces variation)
Use with caution (may introduce artifacts)

Synonym Replacement:

Replace words with Somali synonyms
Maintain grammatical structure

Character-Level Noise:

Add/remove diacritics
Simulate OCR errors (if data source is scanned)

Example:

# Simple augmentation
def augment_somali_text(text):
    # Preserve meaning, add variation
    return varied_text

3. Semi-Supervised Learning

Approach: Use large unlabeled Somali corpus + small labeled set

Techniques:

Self-training: Train on labeled → predict on unlabeled → add confident predictions
Co-training: Train multiple models, use agreement
Pseudo-labeling: Label unlabeled data with existing model

For This Project:

Leverage web-scraped Somali text (Wikipedia, news, social media)
Use dialect classifier to pseudo-label unlabeled text
Iteratively improve with high-confidence predictions

Morphological Considerations

Somali Language Characteristics

Agglutinative Structure:

Words formed by adding affixes to roots
Example: buug (book) → buuggaan (these books)

Grammatical Gender:

Masculine/Feminine affects word forms
Important for proper parsing

Verb Conjugation:

Complex tense/aspect system
Affects sentence structure classification

Tokenization Strategy:

Use subword tokenization (BPE, WordPiece)
Captures morphological patterns
Better for low-resource scenarios

# Tokenizer selection for Somali
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')
# XLM-R uses Sentence Piece (subword tokenization)
# Good for morphologically rich languages

Dialect-Specific Considerations

Northern Dialect (Standard Somali)

Most represented in written text
Official/formal language basis
More training data available

Southern Dialect (Af-Maay)

Significant linguistic differences
Less written representation
May require targeted data collection

Central Dialect

Intermediate characteristics
Mixed features from North/South
Potentially harder to classify

Classification Strategy:

Focus on dialectal markers (vocabulary, phonology represented in text)
Use character n-grams (capture phonetic patterns)
Leverage morphological differences

Evaluation in Low-Resource Context

Metrics

Standard Metrics:

Accuracy, Precision, Recall, F1-score

Low-Resource Specific:

Per-class performance (some dialects may be underrepresented)
Confusion matrix analysis (which dialects are confusable?)
Performance vs. training set size curves

Example:

# Detailed evaluation
from sklearn.metrics import classification_report, confusion_matrix

report = classification_report(y_true, y_pred,
                               target_names=['Northern', 'Southern', 'Central'])
cm = confusion_matrix(y_true, y_pred)

Cross-Validation Strategy

Challenge: Limited data means train/val/test splits are small

Approach:

k-fold cross-validation (k=5 or k=10)
Stratified splits (maintain class balance)
Report mean ± std dev across folds

Recommended Model Architectures

For Dialect Classification

Option 1: Fine-Tuned Multilingual Transformer

XLM-R or mBERT
Pre-trained on many languages
Fine-tune final layers on Somali

Option 2: Character-Level CNN

Good for morphologically rich languages
Captures sub-word patterns
Less data-hungry than full transformers

Option 3: Hybrid Approach

Character-level features + word embeddings
Captures both local and global patterns

Recommendation for this project: Start with XLM-R (proven success on low-resource languages)

Data Collection Best Practices

Sources for Somali Text

High-Quality:

Somali Wikipedia
Official government documents
News websites (e.g., BBC Somali)
Academic publications

Noisy but Useful:

Social media (Twitter, Facebook)
Forums and discussion boards
User-generated content

Consider:

Geographic metadata (helps with dialect labeling)
Source reliability
Copyright/usage rights

Labeling Strategy

Given Limited Resources:

Focus on high-confidence examples
Use native speakers for validation
Create clear labeling guidelines
Inter-annotator agreement checks

Handling Class Imbalance

Challenge: Northern dialect likely overrepresented

Solutions:

Weighted loss function (penalize majority class less)
Oversampling minority classes
Data augmentation for underrepresented dialects
Stratified sampling

Example:

# Weighted loss
from sklearn.utils.class_weight import compute_class_weight

class_weights = compute_class_weight('balanced',
                                     classes=np.unique(y_train),
                                     y=y_train)

# Use in training
loss_fn = nn.CrossEntropyLoss(weight=torch.tensor(class_weights))

Transfer Learning Pipeline

Recommended Workflow

Pre-training: Start with XLM-R (already done)
Language Adaptation: (Optional) Further pre-train on large Somali corpus
Task Fine-Tuning: Fine-tune on labeled dialect data
Evaluation: Test on held-out set
Iteration: Augment data, adjust hyperparameters

Code Template:

from transformers import AutoModel, AutoTokenizer, Trainer

# 1. Load pre-trained model
model = AutoModelForSequenceClassification.from_pretrained('xlm-roberta-base', num_labels=3)
tokenizer = AutoTokenizer.from_pretrained('xlm-roberta-base')

# 2. Prepare Somali dataset
train_dataset = prepare_dataset(somali_train_data, tokenizer)

# 3. Fine-tune
trainer = Trainer(
    model=model,
    train_dataset=train_dataset,
    eval_dataset=val_dataset,
    compute_metrics=compute_metrics
)
trainer.train()

# 4. Evaluate
results = trainer.evaluate(test_dataset)

Common Pitfalls

❌ Avoid

Overfitting: Very easy with limited data. Use regularization, dropout, early stopping.
Data Leakage: Ensure train/val/test splits don't overlap (especially with augmented data)
Inappropriate Baselines: Don't compare to high-resource benchmarks
Ignoring Linguistic Structure: Somali morphology matters—use appropriate tokenization

✅ Do

Start Simple: Baseline with logistic regression + TF-IDF before deep models
Use Pre-Trained Models: Leverage multilingual transformers
Validate with Native Speakers: Especially for edge cases
Document Data Sources: Maintain provenance for reproducibility
Report Confidence Intervals: Acknowledge uncertainty in low-resource setting

When This Skill Activates

This skill auto-invokes when you mention:

Somali language, Somali NLP, Somali dialect
Low-resource NLP, data scarcity, limited data
Dialect classification, dialect detection
Cross-lingual transfer, multilingual models
Morphological analysis, agglutinative languages
Data augmentation for NLP
XLM-R, mBERT, multilingual transformers
Semi-supervised learning, pseudo-labeling

References

Somali Wikipedia: https://so.wikipedia.org
BBC Somali: News source for text data
XLM-R Paper: Conneau et al., 2019 (unsupervised cross-lingual representation learning)
Low-Resource NLP Survey: Hedderich et al., 2021

Version: 1.0.0 Last Updated: 2025-11-06 Project: Somali Dialect Classifier

Related Skills

ilyasibrahim/ux-writing

documentation

VerifiedTrustedCommunity

Voice, tone, and content guidelines for data/ML dashboards. Covers microcopy, error messages, success states, and data presentation language. Auto-invokes on copy, messaging, content, labels, error messages keywords.

52SKILL.mdUpdated Apr 21, 2026

ilyasibrahim/ux-writing

ilyasibrahim/design

development

VerifiedTrustedCommunity

Unified design system for data/ML dashboards. Quick reference for brand vs data color decisions, component patterns, typography, spacing. Auto-invokes on styling, CSS, design, colors, UI, visualization keywords. Tiered loading - core always, philosophy/implementation on-demand.

52SKILL.mdUpdated Apr 21, 2026

ilyasibrahim/agent-coordination

development

VerifiedTrustedCommunity

Coordination protocol for main Claude Code agent. Explicit user invocation required ("mobilize agents", "coordinate", "check registry"). Provides agent orchestration, registry management, and handoff protocols. Subagents never access this - main agent provides context in task prompts.

52SKILL.mdUpdated Apr 21, 2026

ilyasibrahim/agent-coordination

ilyasibrahim/model-evaluation-framework

development

VerifiedTrustedCommunity

Model evaluation metrics, testing protocols, and performance assessment for Somali dialect classification. Covers accuracy, F1-score, confusion matrix analysis, per-dialect performance, and evaluation best practices for multi-class classification tasks.

52SKILL.mdUpdated Apr 21, 2026

ilyasibrahim/model-evaluation-framework

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ilyasibrahim/claude-agents-coordination.git

# Copy into Claude Code skills folder (global)
cp -r claude-agents-coordination/claude-project/skills/machine-learning/lrl-nlp-techniques ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ilyasibrahim/claude-agents-coordination

52 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT