skills/model-evaluator/SKILL.md
Evaluate and compare ML model performance with rigorous testing methodologies
npx skillsauth add jmsktm/claude-settings Model EvaluatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
The Model Evaluator skill helps you rigorously assess and compare machine learning model performance across multiple dimensions. It guides you through selecting appropriate metrics, designing evaluation protocols, avoiding common statistical pitfalls, and making data-driven decisions about model selection.
Proper model evaluation goes beyond accuracy scores. This skill covers evaluation across the full spectrum: predictive performance, computational efficiency, robustness, fairness, calibration, and production readiness. It helps you answer not just "which model is best?" but "which model is best for my specific use case and constraints?"
Whether you are comparing LLMs, classifiers, or custom models, this skill ensures your evaluation methodology is sound and your conclusions are reliable.
class ModelEvaluator:
def __init__(self, test_data, metrics):
self.test_data = test_data
self.metrics = metrics
self.results = {}
def evaluate(self, model, model_name):
predictions = model.predict(self.test_data.inputs)
scores = {}
for metric in self.metrics:
scores[metric.name] = metric.compute(
predictions,
self.test_data.labels
)
self.results[model_name] = scores
return scores
def compare(self):
return statistical_comparison(self.results)
| Action | Command/Trigger | |--------|-----------------| | Design evaluation | "How should I evaluate [model type]" | | Choose metrics | "What metrics for [task type]" | | Compare models | "Compare these models: [list]" | | LLM evaluation | "Evaluate LLM performance" | | Statistical testing | "Is this difference significant" | | Bias evaluation | "Check model for bias" |
Use Multiple Metrics: No single metric tells the whole story
Test on Realistic Data: Evaluation data should match production
Account for Variance: Models and data have randomness
Consider All Costs: Performance isn't just accuracy
Test Robustness: How does the model handle adversity?
Evaluate Fairly: Ensure fair comparison across models
Score models across multiple axes:
def multi_dim_evaluate(model, test_data):
return {
"accuracy": compute_accuracy(model, test_data),
"latency_p50": measure_latency(model, test_data, percentile=50),
"latency_p99": measure_latency(model, test_data, percentile=99),
"memory_mb": measure_memory(model),
"cost_per_1k": compute_cost(model, n=1000),
"robustness": adversarial_accuracy(model, test_data),
"fairness": demographic_parity(model, test_data)
}
Use LLMs to evaluate LLM outputs:
Prompt template:
"Rate the following response on a scale of 1-5 for:
- Accuracy: Is the information correct?
- Helpfulness: Does it address the user's need?
- Clarity: Is it easy to understand?
Question: {question}
Response: {response}
Ground truth (if available): {ground_truth}
Provide scores and brief justification."
For production evaluation:
class ABTest:
def __init__(self, model_a, model_b, traffic_split=0.5):
self.models = {"A": model_a, "B": model_b}
self.split = traffic_split
self.results = {"A": [], "B": []}
def serve(self, request):
variant = "A" if random.random() < self.split else "B"
response = self.models[variant].predict(request)
return response, variant
def record_outcome(self, variant, success):
self.results[variant].append(success)
def compute_significance(self):
return statistical_test(self.results["A"], self.results["B"])
Ensure predicted probabilities are meaningful:
- Expected Calibration Error (ECE)
- Reliability diagrams
- Brier score decomposition
- Temperature scaling for recalibration
data-ai
Optimize YouTube videos for SEO, thumbnails, descriptions, and audience retention
testing
Design and facilitate effective workshops with agendas, activities, and outcomes
data-ai
Design and optimize AI-powered workflows for complex tasks
data-ai
Design and implement automated workflows to eliminate repetitive tasks and streamline processes