Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

hidai25/run-eval

Name: run-eval
Author: hidai25

skills/run-eval/SKILL.md

npx skillsauth add hidai25/eval-view run-eval

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Run Eval

Use this skill after making changes to an AI agent (prompt edits, model swaps, tool changes, code refactors) to verify nothing broke.

What this does

EvalView compares current agent behavior against saved golden baselines. It runs your test cases, evaluates the outputs, and reports a diff status for each test:

PASSED — behavior matches the baseline
OUTPUT_CHANGED — output shifted but may be intentional
TOOLS_CHANGED — different tools were called
REGRESSION — score dropped significantly (blocking failure)

Steps

Locate the test directory. Look for tests/evalview/ in the project. If it exists, use that. Otherwise check for a tests/ directory with .yaml test files.
Run a regression check using the run_check MCP tool:
- If checking all tests: call run_check with the detected test_path
- If checking a specific test: also pass the test parameter with the test name
Interpret results:
- If all tests pass, confirm to the user that no regressions were found
- If REGRESSION is reported, show the diff (score delta, tool changes, output similarity) and offer to help fix it
- If OUTPUT_CHANGED or TOOLS_CHANGED, flag it as a warning — the user should decide if the change is intentional
If changes are intentional, offer to update the baseline by calling run_snapshot with an explanatory notes parameter.
Generate a visual report (optional) by calling generate_visual_report for a detailed HTML breakdown of traces, diffs, scores, and timelines.

CLI equivalent

evalview check tests/evalview/
evalview check tests/evalview/ --test "my-test"
evalview snapshot tests/evalview/ --notes "updated after prompt refactor"

Tips

Use run_check frequently — it calls the Python API directly with no subprocess overhead.
A score delta near zero with TOOLS_CHANGED often means the agent found an equivalent path.
Always snapshot after confirming intentional changes so future checks compare against the new baseline.

hidai25/run-eval

skills/run-eval/SKILL.md

Run EvalView regression checks against golden baselines to detect regressions in AI agent behavior after code, prompt, or model changes.

82 stars

development

Updated Apr 14, 2026

$ install --global

skillsauth

npx skillsauth add hidai25/eval-view run-eval

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 14, 2026, 3:00 AM7.4s1 file scanned

SKILL.md

name:: run-eval
description:: Run EvalView regression checks against golden baselines to detect regressions in AI agent behavior after code, prompt, or model changes.

Run Eval

Use this skill after making changes to an AI agent (prompt edits, model swaps, tool changes, code refactors) to verify nothing broke.

What this does

EvalView compares current agent behavior against saved golden baselines. It runs your test cases, evaluates the outputs, and reports a diff status for each test:

PASSED — behavior matches the baseline
OUTPUT_CHANGED — output shifted but may be intentional
TOOLS_CHANGED — different tools were called
REGRESSION — score dropped significantly (blocking failure)

Steps

Locate the test directory. Look for tests/evalview/ in the project. If it exists, use that. Otherwise check for a tests/ directory with .yaml test files.
Run a regression check using the run_check MCP tool:
- If checking all tests: call run_check with the detected test_path
- If checking a specific test: also pass the test parameter with the test name
Interpret results:
- If all tests pass, confirm to the user that no regressions were found
- If REGRESSION is reported, show the diff (score delta, tool changes, output similarity) and offer to help fix it
- If OUTPUT_CHANGED or TOOLS_CHANGED, flag it as a warning — the user should decide if the change is intentional
If changes are intentional, offer to update the baseline by calling run_snapshot with an explanatory notes parameter.
Generate a visual report (optional) by calling generate_visual_report for a detailed HTML breakdown of traces, diffs, scores, and timelines.

CLI equivalent

evalview check tests/evalview/
evalview check tests/evalview/ --test "my-test"
evalview snapshot tests/evalview/ --notes "updated after prompt refactor"

Tips

Use run_check frequently — it calls the Python API directly with no subprocess overhead.
A score delta near zero with TOOLS_CHANGED often means the agent found an equivalent path.
Always snapshot after confirming intentional changes so future checks compare against the new baseline.

Related Skills

hidai25/watch

testing

VerifiedTrustedCommunity

Start EvalView watch mode to automatically re-run regression checks whenever project files change.

82SKILL.mdUpdated Apr 14, 2026

hidai25/generate-tests

testing

VerifiedTrustedCommunity

Generate EvalView test cases — either from a SKILL.md file using LLM-powered generation, or by capturing real agent interactions through a proxy.

82SKILL.mdUpdated Apr 14, 2026

hidai25/generate-tests

hidai25/code-reviewer

development

VerifiedTrustedCommunity

A skill that helps review code for best practices, bugs, and security issues

80SKILL.mdUpdated Apr 5, 2026

hidai25/code-reviewer

hidai25/hello-world

tools

VerifiedTrustedCommunity

A simple skill that creates a greeting file

80SKILL.mdUpdated Apr 5, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/hidai25/eval-view.git

# Copy into Claude Code skills folder (global)
cp -r eval-view/skills/run-eval ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

hidai25/eval-view

82 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT