areas/software/mlops/skills/model-evaluation/SKILL.md
# Skill: Model Evaluation ## When to load When evaluating a trained model, comparing versions, or performing fairness analysis. ## Threshold Selection ```python def select_optimal_threshold(y_true, y_prob, business_objective: str): """ business_objective: - 'max_f1': balanced precision/recall - 'high_precision': minimize false positives (fraud) - 'high_recall': minimize false negatives (screening) """ precisions, recalls, thresholds = precision_recall_curve(y_true
npx skillsauth add sawrus/agent-guides areas/software/mlops/skills/model-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When evaluating a trained model, comparing versions, or performing fairness analysis.
def select_optimal_threshold(y_true, y_prob, business_objective: str):
"""
business_objective:
- 'max_f1': balanced precision/recall
- 'high_precision': minimize false positives (fraud)
- 'high_recall': minimize false negatives (screening)
"""
precisions, recalls, thresholds = precision_recall_curve(y_true, y_prob)
if business_objective == 'max_f1':
f1_scores = 2 * (precisions * recalls) / (precisions + recalls + 1e-8)
return thresholds[np.argmax(f1_scores)]
def evaluate_fairness(y_true, y_pred, sensitive_attribute):
groups = sensitive_attribute.unique()
results = {g: {
"n": (sensitive_attribute == g).sum(),
"positive_rate": y_pred[sensitive_attribute == g].mean(),
"tpr": recall_score(y_true[sensitive_attribute == g], y_pred[sensitive_attribute == g]),
} for g in groups}
pos_rates = [r["positive_rate"] for r in results.values()]
dp_diff = max(pos_rates) - min(pos_rates)
if dp_diff > 0.1:
logger.warning(f"Demographic parity difference {dp_diff:.3f} exceeds 0.1 threshold")
return results, dp_diff
testing
QA Expert for writing E2E tests, test scenarios, test plans, and ensuring test coverage quality.
development
Expert UI/UX design intelligence for creating distinctive, high-craft, and mobile-first interfaces. Focuses on premium aesthetics, touch-first ergonomics, and Flutter performance.
development
Code Review Expert for static analysis, security auditing, architecture review, and ensuring code quality standards.
development
Babysit a GitHub pull request after creation by continuously polling review comments, CI checks/workflow runs, and mergeability state until the PR is merged/closed or user help is required. Diagnose failures, retry likely flaky failures up to 3 times, auto-fix/push branch-related issues when appropriate, and keep watching open PRs so fresh review feedback is surfaced promptly. Use when the user asks Codex to monitor a PR, watch CI, handle review comments, or keep an eye on failures and feedback on an open PR.