SKILLS/agenthub/skills/eval/SKILL.md
Evaluate and rank agent results by metric or LLM judge for an AgentHub session.
npx skillsauth add lioryehuda1-ui/C_GAME evalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Rank all agent results for a session. Supports metric-based evaluation (run a command), LLM judge (compare diffs), or hybrid.
/hub:eval # Eval latest session using configured criteria
/hub:eval 20260317-143022 # Eval specific session
/hub:eval --judge # Force LLM judge mode (ignore metric config)
Run the evaluation command in each agent's worktree:
python {skill_path}/scripts/result_ranker.py \
--session {session-id} \
--eval-cmd "{eval_cmd}" \
--metric {metric} --direction {direction}
Output:
RANK AGENT METRIC DELTA FILES
1 agent-2 142ms -38ms 2
2 agent-1 165ms -15ms 3
3 agent-3 190ms +10ms 1
Winner: agent-2 (142ms)
For each agent:
git diff {base_branch}...{agent_branch}.agenthub/board/results/agent-{i}-result.mdPresent rankings with justification.
Example LLM judge output for a content task:
RANK AGENT VERDICT WORD COUNT
1 agent-1 Strong narrative, clear CTA 1480
2 agent-3 Good data points, weak intro 1520
3 agent-2 Generic tone, no differentiation 1350
Winner: agent-1 (strongest narrative arc and call-to-action)
python {skill_path}/scripts/session_manager.py --update {session-id} --state evaluating
/hub:merge to merge the winner/hub:merge {session-id} --agent {winner} to be explicitdocumentation
Contract & Proposal Writer
tools
4 business growth agent skills and plugins for Claude Code, Codex, Gemini CLI, Cursor, OpenClaw. Customer success (health scoring, churn), sales engineer (RFP), revenue operations (pipeline, GTM), contract & proposal writer. Python tools (stdlib-only).
tools
Use when the user asks to automate browser tasks, scrape websites, fill forms, capture screenshots, extract structured data from web pages, or build web automation workflows. NOT for testing — use playwright-pro for that.
development
When the user wants to apply, document, or enforce brand guidelines for any product or company. Also use when the user mentions 'brand guidelines,' 'brand colors,' 'typography,' 'logo usage,' 'brand voice,' 'visual identity,' 'tone of voice,' 'brand standards,' 'style guide,' 'brand consistency,' or 'company design standards.' Covers color systems, typography, logo rules, imagery guidelines, and tone matrix for any brand — including Anthropic's official identity.