Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

a5c-ai/eval-harness

Name: eval-harness
Author: a5c-ai

library/methodologies/everything-claude-code/skills/eval-harness/SKILL.md

npx skillsauth add a5c-ai/babysitter eval-harness

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Eval Harness

Overview

Evaluation harness methodology adapted from the Everything Claude Code project. Provides structured frameworks for benchmarking agent performance, testing skill quality, and running regression suites.

Evaluation Types

1. Agent Performance Benchmark

Define test cases with known-correct outputs
Run agent against each test case
Score: accuracy, completeness, relevance
Compare against baseline performance
Track performance over time

2. Skill Quality Testing

Verify skill instructions produce expected outcomes
Test edge cases and boundary conditions
Measure consistency across multiple runs
Check for harmful or incorrect outputs
Validate against ground truth

3. Regression Suite

Collection of previously-passing test cases
Run after any agent/skill modification
Flag regressions with before/after comparison
Maintain pass rate threshold (>= 95%)

4. Process Verification

End-to-end process execution with known inputs
Verify each phase produces expected outputs
Check task ordering and dependency satisfaction
Measure total execution time

Quality Scoring

Accuracy Score (0-100)

Correctness of output vs expected
Partial credit for partially correct outputs
Penalty for hallucinated or fabricated content

Completeness Score (0-100)

Coverage of required output elements
Missing sections flagged and scored
Bonus for useful additional context

Consistency Score (0-100)

Run same input 3 times
Compare outputs for semantic similarity
Flag inconsistencies

Composite Score

(accuracy * 0.4 + completeness * 0.3 + consistency * 0.3)
Threshold: 80 to pass

When to Use

After creating new agents or skills
After modifying existing agents or skills
Periodic quality audits
Before promoting skills to production

Agents Used

Used by process-level evaluation orchestrators
No specific agent dependency (evaluates other agents)

a5c-ai/eval-harness

library/methodologies/everything-claude-code/skills/eval-harness/SKILL.md

Evaluation harness for testing agent and skill quality through structured benchmarks, regression tests, and quality scoring.

510 stars

testing

Updated Mar 31, 2026

$ install --global

skillsauth

npx skillsauth add a5c-ai/babysitter eval-harness

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 1, 2026, 11:38 AM55.6s2 files scanned

SKILL.md

name:: eval-harness
description:: Evaluation harness for testing agent and skill quality through structured benchmarks, regression tests, and quality scoring.
allowed-tools:: Read, Write, Edit, Bash, Grep, Glob

Eval Harness

Overview

Evaluation Types

1. Agent Performance Benchmark

Define test cases with known-correct outputs
Run agent against each test case
Score: accuracy, completeness, relevance
Compare against baseline performance
Track performance over time

2. Skill Quality Testing

Verify skill instructions produce expected outcomes
Test edge cases and boundary conditions
Measure consistency across multiple runs
Check for harmful or incorrect outputs
Validate against ground truth

3. Regression Suite

Collection of previously-passing test cases
Run after any agent/skill modification
Flag regressions with before/after comparison
Maintain pass rate threshold (>= 95%)

4. Process Verification

End-to-end process execution with known inputs
Verify each phase produces expected outputs
Check task ordering and dependency satisfaction
Measure total execution time

Quality Scoring

Accuracy Score (0-100)

Correctness of output vs expected
Partial credit for partially correct outputs
Penalty for hallucinated or fabricated content

Completeness Score (0-100)

Coverage of required output elements
Missing sections flagged and scored
Bonus for useful additional context

Consistency Score (0-100)

Run same input 3 times
Compare outputs for semantic similarity
Flag inconsistencies

Composite Score

(accuracy * 0.4 + completeness * 0.3 + consistency * 0.3)
Threshold: 80 to pass

When to Use

After creating new agents or skills
After modifying existing agents or skills
Periodic quality audits
Before promoting skills to production

Agents Used

Used by process-level evaluation orchestrators
No specific agent dependency (evaluates other agents)

Related Skills

a5c-ai/model-card-generator

development

VerifiedTrustedCommunity

Model documentation skill for generating model cards following Google's model card framework.

680SKILL.mdUpdated Apr 28, 2026

a5c-ai/model-card-generator

a5c-ai/mlflow-experiment-tracker

development

VerifiedTrustedCommunity

MLflow integration skill for experiment tracking, model registry, and artifact management. Enables LLMs to log experiments, compare runs, manage model lifecycle, and retrieve artifacts through the MLflow API.

680SKILL.mdUpdated Apr 28, 2026

a5c-ai/mlflow-experiment-tracker

a5c-ai/lime-explainer

data-ai

VerifiedTrustedCommunity

LIME-based local explanation skill for individual predictions across tabular, text, and image data.

680SKILL.mdUpdated Apr 28, 2026

a5c-ai/lime-explainer

a5c-ai/kubeflow-pipeline-executor

devops

VerifiedTrustedCommunity

Kubeflow Pipelines skill for ML workflow orchestration, component management, and Kubernetes-native ML.

680SKILL.mdUpdated Apr 28, 2026

a5c-ai/kubeflow-pipeline-executor

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/a5c-ai/babysitter.git

# Copy into Claude Code skills folder (global)
cp -r babysitter/library/methodologies/everything-claude-code/skills/eval-harness ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

a5c-ai/babysitter

510 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT