Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

harsh040506/model-evaluation

Name: model-evaluation
Author: harsh040506

engineering/ai-ml-engineering/skills/model-evaluation/SKILL.md

npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library model-evaluation

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

ML Model Evaluation

Comprehensive, rigorous model evaluation from offline metrics to production monitoring.

The Evaluation Pyramid

          Production metrics (online)
              A/B test / shadow deployment
           ──────────────────────────────
          Error analysis + slice analysis
              Calibration + robustness
       ──────────────────────────────────────
              Core task metrics
          (accuracy, F1, RMSE, BLEU...)

Work bottom-up. Core metrics are a prerequisite, not a destination.

Metric Reference by Task

Classification

| Metric | Formula | Use when | |--------|---------|---------| | Accuracy | correct / total | Balanced classes, simple baseline | | Precision | TP / (TP + FP) | False positives are costly (spam, fraud) | | Recall | TP / (TP + FN) | False negatives are costly (cancer screening, fraud detection) | | F1 | 2 × (P × R) / (P + R) | Imbalanced classes, balance P and R | | AUROC | Area under ROC curve | Compare models regardless of threshold | | AUPRC | Area under PR curve | Heavily imbalanced datasets (rare event detection) |

Never use accuracy alone on imbalanced datasets. A model that always predicts "negative" on a 99:1 dataset has 99% accuracy but is useless.

from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score, 
    average_precision_score, f1_score
)

# Complete classification report
print(classification_report(y_true, y_pred, target_names=class_names))

# AUROC (for binary)
auroc = roc_auc_score(y_true, y_prob[:, 1])

# AUPRC (better for imbalanced)
auprc = average_precision_score(y_true, y_prob[:, 1])

print(f"AUROC: {auroc:.4f}")
print(f"AUPRC: {auprc:.4f}")

Regression

| Metric | Formula | Interpretation | |--------|---------|---------------| | MAE | mean(|y - ŷ|) | Average absolute error in target units | | RMSE | √mean((y - ŷ)²) | Penalizes large errors more than MAE | | R² | 1 - SS_res/SS_tot | Fraction of variance explained (1.0 = perfect) | | MAPE | mean(|y - ŷ| / y) | Relative error (don't use when y can be near zero) |

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
r2 = r2_score(y_true, y_pred)

print(f"MAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R²: {r2:.4f}")

# Residual analysis (critical — aggregate metrics can hide problems)
residuals = y_true - y_pred
# Should be centered at zero with no systematic pattern

Text Generation

| Metric | What it measures | Limitation | |--------|-----------------|-----------| | BLEU | N-gram overlap with reference | Poor for paraphrase, over-penalizes length | | ROUGE-L | Longest common subsequence | Misses semantic similarity | | BERTScore | Semantic similarity using BERT embeddings | Slower, requires model | | Human eval | Actual quality judgment | Expensive, ground truth |

from evaluate import load

bleu = load("bleu")
rouge = load("rouge")
bertscore = load("bertscore")

# BLEU
result = bleu.compute(predictions=predictions, references=references)

# ROUGE
result = rouge.compute(predictions=predictions, references=references)

# BERTScore (semantic similarity)
result = bertscore.compute(predictions=predictions, references=references, lang="en")
print(f"BERTScore F1: {sum(result['f1'])/len(result['f1']):.4f}")

Human evaluation of 50–100 samples is essential for generation tasks. Automated metrics miss fluency, relevance, factual accuracy, and tone.

Slice Analysis

Why it matters: A model with 92% overall accuracy can have 60% accuracy on an important subgroup. Aggregate metrics hide this.

What to slice on

Data attributes: text length, category, time period, region, data source
Demographic groups (if applicable): age range, gender, language, country
Prediction confidence: high confidence, uncertain (0.45–0.55 prob), low confidence
Error types: false positives vs. false negatives separately

import pandas as pd
from sklearn.metrics import f1_score

def slice_analysis(df: pd.DataFrame, prediction_col: str, label_col: str, 
                   slice_col: str) -> pd.DataFrame:
    """Compute F1 per slice of a categorical column."""
    results = []
    
    # Overall
    overall_f1 = f1_score(df[label_col], df[prediction_col], average='macro')
    results.append({"slice": "OVERALL", "n": len(df), "f1": overall_f1})
    
    # Per slice
    for slice_val, group in df.groupby(slice_col):
        slice_f1 = f1_score(group[label_col], group[prediction_col], average='macro')
        results.append({
            "slice": f"{slice_col}={slice_val}",
            "n": len(group),
            "f1": slice_f1,
            "delta": slice_f1 - overall_f1,
        })
    
    result_df = pd.DataFrame(results).sort_values("f1")
    return result_df

# Flag slices with >5% degradation
analysis = slice_analysis(test_df, "prediction", "label", "category")
at_risk = analysis[analysis["delta"] < -0.05]
print("Underperforming slices:")
print(at_risk)

Calibration

A calibrated model produces probability estimates that match empirical frequencies.

Overconfident: Model says 95% confidence, but is right only 80% of the time. Underconfident: Model says 50% confidence on cases it's actually right 80% of the time.

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve
from sklearn.metrics import brier_score_loss

# Reliability diagram
fraction_of_positives, mean_predicted_value = calibration_curve(
    y_true, y_prob, n_bins=10, normalize=True
)

plt.figure(figsize=(8, 6))
plt.plot(mean_predicted_value, fraction_of_positives, 's-', label='Our model')
plt.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
plt.xlabel('Mean predicted probability')
plt.ylabel('Fraction actually positive')
plt.title('Calibration Curve (Reliability Diagram)')
plt.legend()
plt.tight_layout()
plt.savefig('calibration.png')

# Expected Calibration Error (ECE) — lower is better
brier = brier_score_loss(y_true, y_prob)
print(f"Brier score: {brier:.4f}")  # 0 = perfect

To improve calibration:

Platt scaling (logistic regression on model outputs)
Temperature scaling (scale logits before softmax)
Isotonic regression (non-parametric, more flexible)

Fairness Evaluation

For models affecting people, always evaluate fairness metrics across protected groups.

Key Metrics

| Metric | Definition | When to use | |--------|-----------|------------| | Demographic parity | Similar prediction rates across groups | Hiring, loan approval | | Equal opportunity | Similar TPR across groups | High-stakes decisions | | Predictive parity | Similar PPV across groups | Risk scoring | | Individual fairness | Similar people → similar predictions | Always |

def fairness_analysis(df: pd.DataFrame, prediction_col: str, label_col: str, 
                      group_col: str) -> pd.DataFrame:
    results = []
    for group, data in df.groupby(group_col):
        tp = ((data[prediction_col] == 1) & (data[label_col] == 1)).sum()
        fp = ((data[prediction_col] == 1) & (data[label_col] == 0)).sum()
        fn = ((data[prediction_col] == 0) & (data[label_col] == 1)).sum()
        tn = ((data[prediction_col] == 0) & (data[label_col] == 0)).sum()
        
        results.append({
            "group": group,
            "n": len(data),
            "positive_rate": (tp + fp) / len(data),          # Demographic parity
            "tpr": tp / (tp + fn) if (tp + fn) > 0 else None, # Equal opportunity
            "fpr": fp / (fp + tn) if (fp + tn) > 0 else None,
            "ppv": tp / (tp + fp) if (tp + fp) > 0 else None, # Predictive parity
        })
    
    return pd.DataFrame(results)

Report any metric gap > 5 percentage points as a finding requiring investigation.

Production Monitoring

Offline metrics don't guarantee production performance. Monitor:

Data Drift

from scipy.stats import ks_2samp

def detect_drift(training_data: pd.Series, production_data: pd.Series, 
                 threshold: float = 0.05) -> dict:
    """Kolmogorov-Smirnov test for distribution shift."""
    statistic, p_value = ks_2samp(training_data, production_data)
    return {
        "drift_detected": p_value < threshold,
        "ks_statistic": statistic,
        "p_value": p_value,
    }

Model Performance Monitoring

Track these metrics in production (with delayed labels when available):

Prediction distribution — is the model predicting more positives than expected?
Confidence distribution — are scores shifting toward uncertain range?
Error rate (when labels arrive) — compare to offline baseline
Latency P99 — performance regression from model serving layer

When to Retrain

| Signal | Action | |--------|--------| | Data drift detected (KS test fails) | Evaluate impact; retrain if metrics degrade | | Offline metrics drop > 2% | Immediate retraining trigger | | Concept drift (world has changed) | Schedule retrain | | New labeled data available | Scheduled retraining (weekly/monthly) |

Deeper Reference

For complete evaluation framework implementations and metrics reference tables, see:

references/evaluation-frameworks.md — scikit-learn, HuggingFace Evaluate, and RAGAS evaluation pipelines with slice analysis and fairness auditing
references/metrics-reference.md — decision guide for metric selection, threshold recommendations, and statistical significance testing for A/B experiments

harsh040506/model-evaluation

engineering/ai-ml-engineering/skills/model-evaluation/SKILL.md

This skill should be used when the user asks about "evaluate a model", "model evaluation", "evaluation metrics", "accuracy", "precision", "recall", "F1 score", "AUROC", "AUC-ROC", "confusion matrix", "BLEU", "ROUGE", "BERTScore", "model bias", "fairness metrics", "slice analysis", "error analysis", "calibration", "reliability diagram", "expected calibration error", "A/B test model", "shadow deployment", "champion-challenger", "offline evaluation", "online evaluation", "benchmark", "regression metrics", "MAE", "RMSE", "R squared", "model performance", "why is my model bad", or "production model monitoring". Also trigger for "how do I know if my model is good", "my model is worse in production than offline", or "how to measure fairness".

2 stars

development

Updated Apr 5, 2026

$ install --global

skillsauth

npx skillsauth add harsh040506/claude-code-unified-skill-plugin-library model-evaluation

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 5, 2026, 5:10 PM5.0s3 files scanned

SKILL.md

name:: model-evaluation
description:: This skill should be used when the user asks about "evaluate a model", "model evaluation", "evaluation metrics", "accuracy", "precision", "recall", "F1 score", "AUROC", "AUC-ROC", "confusion matrix", "BLEU", "ROUGE", "BERTScore", "model bias", "fairness metrics", "slice analysis", "error analysis", "calibration", "reliability diagram", "expected calibration error", "A/B test model", "shadow deployment", "champion-challenger", "offline evaluation", "online evaluation", "benchmark", "regression metrics", "MAE", "RMSE", "R squared", "model performance", "why is my model bad", or "production model monitoring". Also trigger for "how do I know if my model is good", "my model is worse in production than offline", or "how to measure fairness".

ML Model Evaluation

Comprehensive, rigorous model evaluation from offline metrics to production monitoring.

The Evaluation Pyramid

          Production metrics (online)
              A/B test / shadow deployment
           ──────────────────────────────
          Error analysis + slice analysis
              Calibration + robustness
       ──────────────────────────────────────
              Core task metrics
          (accuracy, F1, RMSE, BLEU...)

Work bottom-up. Core metrics are a prerequisite, not a destination.

Metric Reference by Task

Classification

Never use accuracy alone on imbalanced datasets. A model that always predicts "negative" on a 99:1 dataset has 99% accuracy but is useless.

from sklearn.metrics import (
    classification_report, confusion_matrix, roc_auc_score, 
    average_precision_score, f1_score
)

# Complete classification report
print(classification_report(y_true, y_pred, target_names=class_names))

# AUROC (for binary)
auroc = roc_auc_score(y_true, y_prob[:, 1])

# AUPRC (better for imbalanced)
auprc = average_precision_score(y_true, y_prob[:, 1])

print(f"AUROC: {auroc:.4f}")
print(f"AUPRC: {auprc:.4f}")

Regression

from sklearn.metrics import mean_absolute_error, mean_squared_error, r2_score
import numpy as np

mae = mean_absolute_error(y_true, y_pred)
rmse = np.sqrt(mean_squared_error(y_true, y_pred))
r2 = r2_score(y_true, y_pred)

print(f"MAE: {mae:.4f}")
print(f"RMSE: {rmse:.4f}")
print(f"R²: {r2:.4f}")

# Residual analysis (critical — aggregate metrics can hide problems)
residuals = y_true - y_pred
# Should be centered at zero with no systematic pattern

Text Generation

from evaluate import load

bleu = load("bleu")
rouge = load("rouge")
bertscore = load("bertscore")

# BLEU
result = bleu.compute(predictions=predictions, references=references)

# ROUGE
result = rouge.compute(predictions=predictions, references=references)

# BERTScore (semantic similarity)
result = bertscore.compute(predictions=predictions, references=references, lang="en")
print(f"BERTScore F1: {sum(result['f1'])/len(result['f1']):.4f}")

Human evaluation of 50–100 samples is essential for generation tasks. Automated metrics miss fluency, relevance, factual accuracy, and tone.

Slice Analysis

Why it matters: A model with 92% overall accuracy can have 60% accuracy on an important subgroup. Aggregate metrics hide this.

What to slice on

Data attributes: text length, category, time period, region, data source
Demographic groups (if applicable): age range, gender, language, country
Prediction confidence: high confidence, uncertain (0.45–0.55 prob), low confidence
Error types: false positives vs. false negatives separately

import pandas as pd
from sklearn.metrics import f1_score

def slice_analysis(df: pd.DataFrame, prediction_col: str, label_col: str, 
                   slice_col: str) -> pd.DataFrame:
    """Compute F1 per slice of a categorical column."""
    results = []
    
    # Overall
    overall_f1 = f1_score(df[label_col], df[prediction_col], average='macro')
    results.append({"slice": "OVERALL", "n": len(df), "f1": overall_f1})
    
    # Per slice
    for slice_val, group in df.groupby(slice_col):
        slice_f1 = f1_score(group[label_col], group[prediction_col], average='macro')
        results.append({
            "slice": f"{slice_col}={slice_val}",
            "n": len(group),
            "f1": slice_f1,
            "delta": slice_f1 - overall_f1,
        })
    
    result_df = pd.DataFrame(results).sort_values("f1")
    return result_df

# Flag slices with >5% degradation
analysis = slice_analysis(test_df, "prediction", "label", "category")
at_risk = analysis[analysis["delta"] < -0.05]
print("Underperforming slices:")
print(at_risk)

Calibration

A calibrated model produces probability estimates that match empirical frequencies.

Overconfident: Model says 95% confidence, but is right only 80% of the time. Underconfident: Model says 50% confidence on cases it's actually right 80% of the time.

import matplotlib.pyplot as plt
from sklearn.calibration import calibration_curve
from sklearn.metrics import brier_score_loss

# Reliability diagram
fraction_of_positives, mean_predicted_value = calibration_curve(
    y_true, y_prob, n_bins=10, normalize=True
)

plt.figure(figsize=(8, 6))
plt.plot(mean_predicted_value, fraction_of_positives, 's-', label='Our model')
plt.plot([0, 1], [0, 1], 'k--', label='Perfectly calibrated')
plt.xlabel('Mean predicted probability')
plt.ylabel('Fraction actually positive')
plt.title('Calibration Curve (Reliability Diagram)')
plt.legend()
plt.tight_layout()
plt.savefig('calibration.png')

# Expected Calibration Error (ECE) — lower is better
brier = brier_score_loss(y_true, y_prob)
print(f"Brier score: {brier:.4f}")  # 0 = perfect

To improve calibration:

Platt scaling (logistic regression on model outputs)
Temperature scaling (scale logits before softmax)
Isotonic regression (non-parametric, more flexible)

Fairness Evaluation

For models affecting people, always evaluate fairness metrics across protected groups.

Key Metrics

def fairness_analysis(df: pd.DataFrame, prediction_col: str, label_col: str, 
                      group_col: str) -> pd.DataFrame:
    results = []
    for group, data in df.groupby(group_col):
        tp = ((data[prediction_col] == 1) & (data[label_col] == 1)).sum()
        fp = ((data[prediction_col] == 1) & (data[label_col] == 0)).sum()
        fn = ((data[prediction_col] == 0) & (data[label_col] == 1)).sum()
        tn = ((data[prediction_col] == 0) & (data[label_col] == 0)).sum()
        
        results.append({
            "group": group,
            "n": len(data),
            "positive_rate": (tp + fp) / len(data),          # Demographic parity
            "tpr": tp / (tp + fn) if (tp + fn) > 0 else None, # Equal opportunity
            "fpr": fp / (fp + tn) if (fp + tn) > 0 else None,
            "ppv": tp / (tp + fp) if (tp + fp) > 0 else None, # Predictive parity
        })
    
    return pd.DataFrame(results)

Report any metric gap > 5 percentage points as a finding requiring investigation.

Production Monitoring

Offline metrics don't guarantee production performance. Monitor:

Data Drift

from scipy.stats import ks_2samp

def detect_drift(training_data: pd.Series, production_data: pd.Series, 
                 threshold: float = 0.05) -> dict:
    """Kolmogorov-Smirnov test for distribution shift."""
    statistic, p_value = ks_2samp(training_data, production_data)
    return {
        "drift_detected": p_value < threshold,
        "ks_statistic": statistic,
        "p_value": p_value,
    }

Model Performance Monitoring

Track these metrics in production (with delayed labels when available):

Prediction distribution — is the model predicting more positives than expected?
Confidence distribution — are scores shifting toward uncertain range?
Error rate (when labels arrive) — compare to offline baseline
Latency P99 — performance regression from model serving layer

When to Retrain

Deeper Reference

For complete evaluation framework implementations and metrics reference tables, see:

references/evaluation-frameworks.md — scikit-learn, HuggingFace Evaluate, and RAGAS evaluation pipelines with slice analysis and fairness auditing
references/metrics-reference.md — decision guide for metric selection, threshold recommendations, and statistical significance testing for A/B experiments

Related Skills

harsh040506/single-cell-rna-qc

testing

VerifiedTrustedCommunity

Performs quality control on single-cell RNA-seq data (.h5ad or .h5 files) using scverse best practices with MAD-based filtering and comprehensive visualizations. Use when users request QC analysis, filtering low-quality cells, assessing data quality, or following scverse/scanpy best practices for single-cell analysis.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/single-cell-rna-qc

harsh040506/scvi-tools

tools

VerifiedTrustedCommunity

Deep learning for single-cell analysis using scvi-tools. This skill should be used when users need (1) data integration and batch correction with scVI/scANVI, (2) ATAC-seq analysis with PeakVI, (3) CITE-seq multi-modal analysis with totalVI, (4) multiome RNA+ATAC analysis with MultiVI, (5) spatial transcriptomics deconvolution with DestVI, (6) label transfer and reference mapping with scANVI/scArches, (7) RNA velocity with veloVI, or (8) any deep learning-based single-cell method. Triggers include mentions of scVI, scANVI, totalVI, PeakVI, MultiVI, DestVI, veloVI, sysVI, scArches, variational autoencoder, VAE, batch correction, data integration, multi-modal, CITE-seq, multiome, reference mapping, latent space.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/scvi-tools

harsh040506/scientific-problem-selection

testing

VerifiedTrustedCommunity

This skill should be used when scientists need help with research problem selection, project ideation, troubleshooting stuck projects, or strategic scientific decisions. Use this skill when users ask to pitch a new research idea, work through a project problem, evaluate project risks, plan research strategy, navigate decision trees, or get help choosing what scientific problem to work on. Typical requests include "I have an idea for a project", "I'm stuck on my research", "help me evaluate this project", "what should I work on", or "I need strategic advice about my research".

2SKILL.mdUpdated Apr 5, 2026

harsh040506/scientific-problem-selection

harsh040506/nextflow-development

development

VerifiedTrustedCommunity

Run nf-core bioinformatics pipelines (rnaseq, sarek, atacseq) on sequencing data. Use when analyzing RNA-seq, WGS/WES, or ATAC-seq data—either local FASTQs or public datasets from GEO/SRA. Triggers on nf-core, Nextflow, FASTQ analysis, variant calling, gene expression, differential expression, GEO reanalysis, GSE/GSM/SRR accessions, or samplesheet creation.

2SKILL.mdUpdated Apr 5, 2026

harsh040506/nextflow-development

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/harsh040506/claude-code-unified-skill-plugin-library.git

# Copy into Claude Code skills folder (global)
cp -r claude-code-unified-skill-plugin-library/engineering/ai-ml-engineering/skills/model-evaluation ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

harsh040506/claude-code-unified-skill-plugin-library

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT