Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

neekware/artifacts/bundle/skills/engineering/eval

Name: artifacts/bundle/skills/engineering/eval
Author: neekware

artifacts/bundle/skills/engineering/eval/SKILL.md

# /hub:eval — Evaluate Agent Results Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid. ## Usage ``` /hub:eval # Eval latest session using configured criteria /hub:eval 20260317-143022 # Eval specific session /hub:eval --judge # Force LLM judge mode (ignore metric config) ``` ## What It Does ### Metric Mode (eval command configured) Run the evaluation command in

testing

Updated Apr 24, 2026

$ install --global

skillsauth

npx skillsauth add neekware/ehayeskills artifacts/bundle/skills/engineering/eval

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 4:00 PM13.4s1 file scanned

SKILL.md

/hub:eval — Evaluate Agent Results

Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.

Usage

/hub:eval                           # Eval latest session using configured criteria
/hub:eval 20260317-143022           # Eval specific session
/hub:eval --judge                   # Force LLM judge mode (ignore metric config)

What It Does

Metric Mode (eval command configured)

Run the evaluation command in each agent's worktree:

python {skill_path}/scripts/result_ranker.py \
  --session {session-id} \
  --eval-cmd "{eval_cmd}" \
  --metric {metric} --direction {direction}

Output:

RANK  AGENT       METRIC      DELTA      FILES
1     agent-2     142ms       -38ms      2
2     agent-1     165ms       -15ms      3
3     agent-3     190ms       +10ms      1

Winner: agent-2 (142ms)

LLM Judge Mode (no eval command, or --judge flag)

For each agent:

Get the diff: git diff {base_branch}...{agent_branch}
Read the agent's result post from .agenthub/board/results/agent-{i}-result.md
Compare all diffs and rank by:
- Correctness — Does it solve the task?
- Simplicity — Fewer lines changed is better (when equal correctness)
- Quality — Clean execution, good structure, no regressions

Present rankings with justification.

Example LLM judge output for a content task:

RANK  AGENT    VERDICT                               WORD COUNT
1     agent-1  Strong narrative, clear CTA            1480
2     agent-3  Good data points, weak intro           1520
3     agent-2  Generic tone, no differentiation       1350

Winner: agent-1 (strongest narrative arc and call-to-action)

Hybrid Mode

Run metric evaluation first
If top agents are within 10% of each other, use LLM judge to break ties
Present both metric and qualitative rankings

After Eval

Update session state:

python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating

Tell the user:
- Ranked results with winner highlighted
- Next step: /hub:merge to merge the winner
- Or /hub:merge {session-id} --agent {winner} to be explicit

Creator: Engineering License: MIT Source Repo: neekware/ehaye-skills Source Bucket: engineering Original Path: engineering/agenthub/skills/eval

Related Skills

neekware/stress-test

testing

VerifiedTrustedCommunity

/em -stress-test — Business Assumption Stress Testing

SKILL.mdUpdated Jul 1, 2026

neekware/postmortem

research

VerifiedTrustedCommunity

/em -postmortem — Honest Analysis of What Went Wrong

SKILL.mdUpdated Jul 1, 2026

neekware/hard-call

development

VerifiedTrustedCommunity

/em -hard-call — Framework for Decisions With No Good Options

SKILL.mdUpdated Jul 1, 2026

neekware/challenge

research

VerifiedTrustedCommunity

/em -challenge — Pre-Mortem Plan Analysis

SKILL.mdUpdated Jul 1, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/neekware/ehayeskills.git

# Copy into Claude Code skills folder (global)
cp -r ehayeskills/artifacts/bundle/skills/engineering/eval ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

neekware/ehayeskills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

neekware/artifacts/bundle/skills/engineering/eval

artifacts/bundle/skills/engineering/eval/SKILL.md

testing

Updated Apr 24, 2026

$ install --global

skillsauth

npx skillsauth add neekware/ehayeskills artifacts/bundle/skills/engineering/eval

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 4:00 PM13.4s1 file scanned

SKILL.md

/hub:eval — Evaluate Agent Results

Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.

Usage

/hub:eval                           # Eval latest session using configured criteria
/hub:eval 20260317-143022           # Eval specific session
/hub:eval --judge                   # Force LLM judge mode (ignore metric config)

What It Does

Metric Mode (eval command configured)

Run the evaluation command in each agent's worktree:

python {skill_path}/scripts/result_ranker.py \
  --session {session-id} \
  --eval-cmd "{eval_cmd}" \
  --metric {metric} --direction {direction}

Output:

RANK  AGENT       METRIC      DELTA      FILES
1     agent-2     142ms       -38ms      2
2     agent-1     165ms       -15ms      3
3     agent-3     190ms       +10ms      1

Winner: agent-2 (142ms)

LLM Judge Mode (no eval command, or --judge flag)

For each agent:

Get the diff: git diff {base_branch}...{agent_branch}
Read the agent's result post from .agenthub/board/results/agent-{i}-result.md
Compare all diffs and rank by:
- Correctness — Does it solve the task?
- Simplicity — Fewer lines changed is better (when equal correctness)
- Quality — Clean execution, good structure, no regressions

Present rankings with justification.

Example LLM judge output for a content task:

RANK  AGENT    VERDICT                               WORD COUNT
1     agent-1  Strong narrative, clear CTA            1480
2     agent-3  Good data points, weak intro           1520
3     agent-2  Generic tone, no differentiation       1350

Winner: agent-1 (strongest narrative arc and call-to-action)

Hybrid Mode

Run metric evaluation first
If top agents are within 10% of each other, use LLM judge to break ties
Present both metric and qualitative rankings

After Eval

Update session state:

python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating

Tell the user:
- Ranked results with winner highlighted
- Next step: /hub:merge to merge the winner
- Or /hub:merge {session-id} --agent {winner} to be explicit

Creator: Engineering License: MIT Source Repo: neekware/ehaye-skills Source Bucket: engineering Original Path: engineering/agenthub/skills/eval

Related Skills

neekware/stress-test

testing

VerifiedTrustedCommunity

/em -stress-test — Business Assumption Stress Testing

SKILL.mdUpdated Jul 1, 2026

neekware/postmortem

research

VerifiedTrustedCommunity

/em -postmortem — Honest Analysis of What Went Wrong

SKILL.mdUpdated Jul 1, 2026

neekware/hard-call

development

VerifiedTrustedCommunity

/em -hard-call — Framework for Decisions With No Good Options

SKILL.mdUpdated Jul 1, 2026

neekware/challenge

research

VerifiedTrustedCommunity

/em -challenge — Pre-Mortem Plan Analysis

SKILL.mdUpdated Jul 1, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/neekware/ehayeskills.git

# Copy into Claude Code skills folder (global)
cp -r ehayeskills/artifacts/bundle/skills/engineering/eval ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

neekware/ehayeskills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT