claude-project/skills/machine-learning/model-evaluation-framework/SKILL.md
Model evaluation metrics, testing protocols, and performance assessment for Somali dialect classification. Covers accuracy, F1-score, confusion matrix analysis, per-dialect performance, and evaluation best practices for multi-class classification tasks.
npx skillsauth add ilyasibrahim/claude-agents-coordination model-evaluation-frameworkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Accuracy:
Macro F1-Score:
Weighted F1-Score:
Precision (per dialect):
Recall (per dialect):
F1-Score (per dialect):
from sklearn.metrics import (
accuracy_score,
precision_recall_fscore_support,
classification_report,
confusion_matrix
)
def evaluate_model(y_true, y_pred, dialect_names):
"""Comprehensive model evaluation"""
# Overall metrics
accuracy = accuracy_score(y_true, y_pred)
# Per-class metrics
precision, recall, f1, support = precision_recall_fscore_support(
y_true, y_pred, average=None, labels=range(len(dialect_names))
)
# Macro averages
macro_f1 = f1.mean()
# Detailed report
report = classification_report(
y_true, y_pred,
target_names=dialect_names,
digits=4
)
# Confusion matrix
cm = confusion_matrix(y_true, y_pred)
return {
'accuracy': accuracy,
'macro_f1': macro_f1,
'per_class': {
dialect_names[i]: {
'precision': precision[i],
'recall': recall[i],
'f1': f1[i],
'support': support[i]
}
for i in range(len(dialect_names))
},
'report': report,
'confusion_matrix': cm
}
Example Confusion Matrix:
Predicted
N S C
Actual N [[450 20 10]
S [ 30 380 15]
C [ 15 25 255]]
Analysis:
import matplotlib.pyplot as plt
import seaborn as sns
def plot_confusion_matrix(cm, dialect_names):
"""Visualize confusion matrix"""
plt.figure(figsize=(8, 6))
sns.heatmap(
cm,
annot=True,
fmt='d',
cmap='Blues',
xticklabels=dialect_names,
yticklabels=dialect_names
)
plt.ylabel('Actual')
plt.xlabel('Predicted')
plt.title('Dialect Classification Confusion Matrix')
plt.tight_layout()
plt.savefig('confusion_matrix.png', dpi=300)
from sklearn.model_selection import StratifiedKFold
import numpy as np
def cross_validate_model(model, X, y, n_folds=5):
"""Stratified k-fold cross-validation"""
skf = StratifiedKFold(n_splits=n_folds, shuffle=True, random_state=42)
fold_scores = []
for fold, (train_idx, val_idx) in enumerate(skf.split(X, y)):
X_train, X_val = X[train_idx], X[val_idx]
y_train, y_val = y[train_idx], y[val_idx]
# Train model
model.fit(X_train, y_train)
# Evaluate
y_pred = model.predict(X_val)
accuracy = accuracy_score(y_val, y_pred)
f1 = f1_score(y_val, y_pred, average='macro')
fold_scores.append({'accuracy': accuracy, 'f1': f1})
print(f"Fold {fold + 1}: Accuracy={accuracy:.4f}, F1={f1:.4f}")
# Report mean and std
acc_mean = np.mean([s['accuracy'] for s in fold_scores])
acc_std = np.std([s['accuracy'] for s in fold_scores])
f1_mean = np.mean([s['f1'] for s in fold_scores])
f1_std = np.std([s['f1'] for s in fold_scores])
print(f"\nCross-Validation Results:")
print(f"Accuracy: {acc_mean:.4f} ± {acc_std:.4f}")
print(f"Macro F1: {f1_mean:.4f} ± {f1_std:.4f}")
return fold_scores
For Production Deployment:
For Research/Experimental:
def analyze_errors(X, y_true, y_pred, dialect_names, n_examples=10):
"""Identify and display misclassified examples"""
errors = []
for i in range(len(y_true)):
if y_true[i] != y_pred[i]:
errors.append({
'text': X[i],
'true_label': dialect_names[y_true[i]],
'pred_label': dialect_names[y_pred[i]]
})
# Sample random errors for review
import random
sample_errors = random.sample(errors, min(n_examples, len(errors)))
print(f"\nError Analysis ({len(errors)} total errors):\n")
for idx, error in enumerate(sample_errors):
print(f"Example {idx + 1}:")
print(f" Text: {error['text'][:100]}...")
print(f" True: {error['true_label']}, Predicted: {error['pred_label']}\n")
return errors
# Majority class baseline
def majority_baseline(y_train, y_test):
"""Predict most frequent class"""
from collections import Counter
majority_class = Counter(y_train).most_common(1)[0][0]
y_pred = [majority_class] * len(y_test)
return accuracy_score(y_test, y_pred)
# Random baseline
def random_baseline(y_train, y_test):
"""Random predictions based on train distribution"""
from collections import Counter
class_dist = Counter(y_train)
classes, counts = zip(*class_dist.items())
probs = [c / sum(counts) for c in counts]
y_pred = np.random.choice(classes, size=len(y_test), p=probs)
return accuracy_score(y_test, y_pred)
Report Format:
Baseline Results:
- Random: 33.5% accuracy
- Majority: 58.2% accuracy
- Our Model: 87.3% accuracy ✓ (significantly better)
# Model Evaluation Report
**Model:** XLM-R Fine-Tuned for Somali Dialect Classification
**Date:** 2025-11-06
**Test Set Size:** 1,500 examples
## Overall Performance
| Metric | Value |
|--------|-------|
| Accuracy | 87.3% |
| Macro F1 | 0.852 |
| Weighted F1 | 0.871 |
## Per-Dialect Performance
| Dialect | Precision | Recall | F1-Score | Support |
|---------|-----------|--------|----------|---------|
| Northern | 0.91 | 0.94 | 0.92 | 750 |
| Southern | 0.85 | 0.82 | 0.83 | 450 |
| Central | 0.81 | 0.79 | 0.80 | 300 |
## Confusion Matrix
[Insert visualization]
## Error Analysis
- Northern-Southern confusion: 3.2% of cases
- Most errors occur with short texts (<50 words)
- Informal/social media text more challenging
## Comparison to Baseline
- Majority class baseline: 50.0%
- Our model: 87.3%
- **Improvement: +37.3 percentage points**
## Recommendations
- Collect more Southern and Central dialect training data
- Investigate short text performance
- Consider ensemble approaches for difficult cases
This skill auto-invokes when you mention:
Version: 1.0.0 Last Updated: 2025-11-06 Project: Somali Dialect Classifier
documentation
Voice, tone, and content guidelines for data/ML dashboards. Covers microcopy, error messages, success states, and data presentation language. Auto-invokes on copy, messaging, content, labels, error messages keywords.
development
Unified design system for data/ML dashboards. Quick reference for brand vs data color decisions, component patterns, typography, spacing. Auto-invokes on styling, CSS, design, colors, UI, visualization keywords. Tiered loading - core always, philosophy/implementation on-demand.
development
Coordination protocol for main Claude Code agent. Explicit user invocation required ("mobilize agents", "coordinate", "check registry"). Provides agent orchestration, registry management, and handoff protocols. Subagents never access this - main agent provides context in task prompts.
testing
MLOps best practices for model versioning, experiment tracking, deployment, monitoring, and retraining workflows. Covers reproducibility, CI/CD for ML, model registry, and production ML system design.