Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

bsweet101/eval

Name: eval
Author: bsweet101

.claude/skills/agenthub/skills/eval/SKILL.md

npx skillsauth add bsweet101/buckstop-rebrand eval

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

/hub:eval — Evaluate Agent Results

Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.

Usage

/hub:eval                           # Eval latest session using configured criteria
/hub:eval 20260317-143022           # Eval specific session
/hub:eval --judge                   # Force LLM judge mode (ignore metric config)

What It Does

Metric Mode (eval command configured)

Run the evaluation command in each agent's worktree:

python {skill_path}/scripts/result_ranker.py \
  --session {session-id} \
  --eval-cmd "{eval_cmd}" \
  --metric {metric} --direction {direction}

Output:

RANK  AGENT       METRIC      DELTA      FILES
1     agent-2     142ms       -38ms      2
2     agent-1     165ms       -15ms      3
3     agent-3     190ms       +10ms      1

Winner: agent-2 (142ms)

LLM Judge Mode (no eval command, or --judge flag)

For each agent:

Get the diff: git diff {base_branch}...{agent_branch}
Read the agent's result post from .agenthub/board/results/agent-{i}-result.md
Compare all diffs and rank by:
- Correctness — Does it solve the task?
- Simplicity — Fewer lines changed is better (when equal correctness)
- Quality — Clean execution, good structure, no regressions

Present rankings with justification.

Example LLM judge output for a content task:

RANK  AGENT    VERDICT                               WORD COUNT
1     agent-1  Strong narrative, clear CTA            1480
2     agent-3  Good data points, weak intro           1520
3     agent-2  Generic tone, no differentiation       1350

Winner: agent-1 (strongest narrative arc and call-to-action)

Hybrid Mode

Run metric evaluation first
If top agents are within 10% of each other, use LLM judge to break ties
Present both metric and qualitative rankings

After Eval

Update session state:

python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating

Tell the user:
- Ranked results with winner highlighted
- Next step: /hub:merge to merge the winner
- Or /hub:merge {session-id} --agent {winner} to be explicit

bsweet101/eval

.claude/skills/agenthub/skills/eval/SKILL.md

Evaluate and rank agent results by metric or LLM judge for an AgentHub session.

data-ai

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add bsweet101/buckstop-rebrand eval

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 4:39 PM13.5s1 file scanned

SKILL.md

name:: eval
description:: Evaluate and rank agent results by metric or LLM judge for an AgentHub session.
command:: /hub:eval

/hub:eval — Evaluate Agent Results

Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.

Usage

/hub:eval                           # Eval latest session using configured criteria
/hub:eval 20260317-143022           # Eval specific session
/hub:eval --judge                   # Force LLM judge mode (ignore metric config)

What It Does

Metric Mode (eval command configured)

Run the evaluation command in each agent's worktree:

python {skill_path}/scripts/result_ranker.py \
  --session {session-id} \
  --eval-cmd "{eval_cmd}" \
  --metric {metric} --direction {direction}

Output:

RANK  AGENT       METRIC      DELTA      FILES
1     agent-2     142ms       -38ms      2
2     agent-1     165ms       -15ms      3
3     agent-3     190ms       +10ms      1

Winner: agent-2 (142ms)

LLM Judge Mode (no eval command, or --judge flag)

For each agent:

Get the diff: git diff {base_branch}...{agent_branch}
Read the agent's result post from .agenthub/board/results/agent-{i}-result.md
Compare all diffs and rank by:
- Correctness — Does it solve the task?
- Simplicity — Fewer lines changed is better (when equal correctness)
- Quality — Clean execution, good structure, no regressions

Present rankings with justification.

Example LLM judge output for a content task:

RANK  AGENT    VERDICT                               WORD COUNT
1     agent-1  Strong narrative, clear CTA            1480
2     agent-3  Good data points, weak intro           1520
3     agent-2  Generic tone, no differentiation       1350

Winner: agent-1 (strongest narrative arc and call-to-action)

Hybrid Mode

Run metric evaluation first
If top agents are within 10% of each other, use LLM judge to break ties
Present both metric and qualitative rankings

After Eval

Update session state:

python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating

Tell the user:
- Ranked results with winner highlighted
- Next step: /hub:merge to merge the winner
- Or /hub:merge {session-id} --agent {winner} to be explicit

Related Skills

bsweet101/database-designer

data-ai

VerifiedTrustedCommunity

Use when the user asks to design database schemas, plan data migrations, optimize queries, choose between SQL and NoSQL, or model data relationships.

SKILL.mdUpdated Apr 17, 2026

bsweet101/database-designer

bsweet101/customer-success-manager

tools

VerifiedTrustedCommunity

Monitors customer health, predicts churn risk, and identifies expansion opportunities using weighted scoring models for SaaS customer success. Use when analyzing customer accounts, reviewing retention metrics, scoring at-risk customers, or when the user mentions churn, customer health scores, upsell opportunities, expansion revenue, retention analysis, or customer analytics. Runs three Python CLI tools to produce deterministic health scores, churn risk tiers, and prioritized expansion recommendations across Enterprise, Mid-Market, and SMB segments.

SKILL.mdUpdated Apr 17, 2026

bsweet101/customer-success-manager

bsweet101/culture-architect

development

VerifiedTrustedCommunity

Build, measure, and evolve company culture as operational behavior — not wall posters. Covers mission/vision/values workshops, values-to-behaviors translation, culture code creation, culture health assessment, and cultural rituals by stage. Use when building company values, assessing culture health, designing cultural rituals, creating culture codes, handling culture clashes, or when user mentions culture, values, culture debt, founder culture, or culture code.

SKILL.mdUpdated Apr 17, 2026

bsweet101/culture-architect

bsweet101/cto-advisor

testing

VerifiedTrustedCommunity

Technical leadership guidance for engineering teams, architecture decisions, and technology strategy. Use when assessing technical debt, scaling engineering teams, evaluating technologies, making architecture decisions, establishing engineering metrics, or when user mentions CTO, tech debt, technical debt, team scaling, architecture decisions, technology evaluation, engineering metrics, DORA metrics, or technology strategy.

SKILL.mdUpdated Apr 17, 2026

bsweet101/cto-advisor

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/bsweet101/buckstop-rebrand.git

# Copy into Claude Code skills folder (global)
cp -r buckstop-rebrand/.claude/skills/agenthub/skills/eval ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

bsweet101/buckstop-rebrand

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT