Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

kienbui1995/model-evaluation

Name: model-evaluation
Author: kienbui1995

skills/model-evaluation/SKILL.md

npx skillsauth add kienbui1995/magic-powers model-evaluation

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Model Evaluation

When to Use

When assessing whether a model is good enough to deploy, fairly represents all user groups, and won't degrade in production.

Core Jobs

1. Choose the Right Metrics

Match metric to problem type:

Binary classification: AUC-ROC, F1, precision, recall (choose based on cost of FP vs FN)
Multi-class: macro/weighted F1, confusion matrix
Regression: RMSE, MAE, MAPE (use MAE when outliers shouldn't dominate)
Ranking: NDCG, MRR
Generation (LLM): BLEU/ROUGE (weak), human eval, LLM-as-judge

Business metric matters more than ML metric — always connect model performance to business outcome.

2. Evaluate Across Segments

Don't report only aggregate metrics. Slice by:

User demographics (age, region, language)
Data subgroups (product category, request type)
Time (recent vs older data — look for drift)
Edge cases (short inputs, rare labels)

3. Bias and Fairness Checks

Equal opportunity: equal TPR across groups?
Demographic parity: equal positive prediction rates?
Use tools: Fairlearn, IBM AI Fairness 360

4. Pre-Production Validation

[ ] Performance on holdout test set
[ ] Performance on recent data (last 30 days)
[ ] Latency at P50/P95/P99 (meets SLA?)
[ ] Memory footprint (fits in serving environment?)
[ ] Slice analysis (no group significantly underperforms)
[ ] Shadow mode test (run alongside current system)

Key Outputs

Evaluation report with chosen metrics and rationale
Slice analysis (breakdown by key segments)
Bias/fairness assessment
Pre-production validation checklist

Anti-Patterns

Optimizing for accuracy on imbalanced datasets
Never slicing results by subgroup
Declaring a model "ready" without latency testing
Using test set for model selection (leakage)

kienbui1995/model-evaluation

skills/model-evaluation/SKILL.md

Use when selecting evaluation metrics, detecting bias, or validating model readiness for production

data-ai

Updated Apr 24, 2026

$ install --global

skillsauth

npx skillsauth add kienbui1995/magic-powers model-evaluation

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 3:59 AM365.4s1 file scanned

SKILL.md

name:: model-evaluation
description:: Use when selecting evaluation metrics, detecting bias, or validating model readiness for production

Model Evaluation

When to Use

When assessing whether a model is good enough to deploy, fairly represents all user groups, and won't degrade in production.

Core Jobs

1. Choose the Right Metrics

Match metric to problem type:

Binary classification: AUC-ROC, F1, precision, recall (choose based on cost of FP vs FN)
Multi-class: macro/weighted F1, confusion matrix
Regression: RMSE, MAE, MAPE (use MAE when outliers shouldn't dominate)
Ranking: NDCG, MRR
Generation (LLM): BLEU/ROUGE (weak), human eval, LLM-as-judge

Business metric matters more than ML metric — always connect model performance to business outcome.

2. Evaluate Across Segments

Don't report only aggregate metrics. Slice by:

User demographics (age, region, language)
Data subgroups (product category, request type)
Time (recent vs older data — look for drift)
Edge cases (short inputs, rare labels)

3. Bias and Fairness Checks

Equal opportunity: equal TPR across groups?
Demographic parity: equal positive prediction rates?
Use tools: Fairlearn, IBM AI Fairness 360

4. Pre-Production Validation

[ ] Performance on holdout test set
[ ] Performance on recent data (last 30 days)
[ ] Latency at P50/P95/P99 (meets SLA?)
[ ] Memory footprint (fits in serving environment?)
[ ] Slice analysis (no group significantly underperforms)
[ ] Shadow mode test (run alongside current system)

Key Outputs

Evaluation report with chosen metrics and rationale
Slice analysis (breakdown by key segments)
Bias/fairness assessment
Pre-production validation checklist

Anti-Patterns

Optimizing for accuracy on imbalanced datasets
Never slicing results by subgroup
Declaring a model "ready" without latency testing
Using test set for model selection (leakage)

Related Skills

kienbui1995/xr-interface-design

content-media

VerifiedTrustedCommunity

Use when designing for XR (AR/VR/MR), choosing interaction modes, or adapting 2D UI patterns for spatial computing

SKILL.mdUpdated Apr 24, 2026

kienbui1995/xr-interface-design

kienbui1995/writing-skills

testing

VerifiedTrustedCommunity

Use when creating new skills, editing existing skills, or verifying skills work before deployment

SKILL.mdUpdated Apr 24, 2026

kienbui1995/writing-skills

kienbui1995/writing-plans

development

VerifiedTrustedCommunity

Use when you have a spec or requirements for a multi-step task, before touching code

SKILL.mdUpdated Apr 24, 2026

kienbui1995/writing-plans

kienbui1995/workflow-templates

development

VerifiedTrustedCommunity

Use when executing a structured workflow — select and run a feature, bugfix, refactor, research, or incident template with correct agent and model assignments per phase.

SKILL.mdUpdated Apr 24, 2026

kienbui1995/workflow-templates

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/kienbui1995/magic-powers.git

# Copy into Claude Code skills folder (global)
cp -r magic-powers/skills/model-evaluation ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

kienbui1995/magic-powers

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT