Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

tavva/eval-design

Name: eval-design
Author: tavva

plugins/eval-designer/skills/eval-design/SKILL.md

npx skillsauth add tavva/ben-claude-plugins eval-design

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Eval Design

Guide users through designing production-quality LLM evaluations. Output: structured spec a coding agent can implement with Langfuse.

Announce at start: "I'm using the eval-design skill to help design your evaluation."

When to Use

digraph when_to_use {
  "User question" [shape=box];
  "About measuring LLM output quality?" [shape=diamond];
  "Use this skill" [shape=box, style=filled, fillcolor=lightgreen];
  "Not this skill" [shape=box];

  "User question" -> "About measuring LLM output quality?";
  "About measuring LLM output quality?" -> "Use this skill" [label="yes"];
  "About measuring LLM output quality?" -> "Not this skill" [label="no"];
}

Use for:

Designing new evaluations
Choosing between eval approaches
Creating scoring rubrics or judge prompts
Building golden datasets
Questions like "how do I know if my LLM is working?"

Not for:

Langfuse setup, tracing, or observability (use the official langfuse skill)
Implementing evals (output spec to coding agent)
Generic testing that isn't LLM-specific

The Process

digraph eval_design_flow {
  rankdir=TB;

  understand [label="Understand\nthe System" shape=box];
  failures [label="Identify\nFailure Modes" shape=box];
  match [label="Match Eval Type\nto Problem" shape=box];
  design [label="Design\nthe Eval" shape=box];
  output [label="Output\nSpec" shape=box];

  understand -> failures -> match -> design -> output;
}

Ask questions ONE AT A TIME. Adapt depth based on user's experience level.

Phase 1: Understand the System

Ask about:

What does the LLM application do? (summarisation, Q&A, agent, chat)
What's already instrumented in Langfuse? (traces, spans, generations)
What does a "good" output look like for users?

Phase 2: Identify Failure Modes

Key questions:

What failures have you seen in production?
Have you done error analysis on real traces?
Which failures have the highest business impact?

Critical: Ground eval design in ACTUAL failures, not hypothetical concerns. If user hasn't analysed real failures, recommend starting there.

Phase 3: Match Eval Type to Problem

| Problem Type | Recommended Eval | |--------------|------------------| | Format validation (JSON, dates) | Code-based | | Factual accuracy with known answers | Code-based + match | | Tone, helpfulness, coherence | LLM-as-judge | | Safety, toxicity | LLM-as-judge | | RAG retrieval quality | LLM-as-judge per chunk | | High-stakes decisions | Human → LLM-judge | | Novel/unclear failure modes | Human annotation first |

Consult references/eval-types.md for detailed guidance on each type.

Phase 4: Design the Eval

For each eval, define:

Success criteria - prefer binary PASS/FAIL over Likert scales
Rubric or logic - specific, unambiguous criteria
Dataset requirements - happy path + edge cases + adversarial
Integration - when/where to run

Use references/judge-prompts.md for LLM-as-judge prompt templates.

Output Format

Output a structured spec using the template in references/document-template.md.

The spec should be complete enough that a coding agent can implement it using the Langfuse SDK without further clarification.

Principles for Great Evals

Start from Error Analysis

Examine real failures, not hypothetical concerns. Generic metrics ("hallucination score") without grounding in actual user complaints are noise.

Questions to surface:

"What would a human expert check when reviewing this output?"
"If this eval passes but users are unhappy, what did we miss?"

Match Tool to Task

Don't over-engineer:

Code-based evals are faster, cheaper, more deterministic
Only use LLM-as-judge when judgment is genuinely subjective
Human annotation for calibration, not production eval

Binary PASS/FAIL Over Likert

Rating 1-5 creates noise and inconsistency. Binary forces clarity:

What exactly makes something PASS?
What exactly makes something FAIL?

Cover the Right Cases

Dataset should include:

Happy path: Common, expected inputs
Edge cases: Boundaries, unusual but valid inputs
Adversarial: Inputs designed to break the system

Calibrate LLM Judges

Before deploying LLM-as-judge:

Create human-labelled examples (20-30)
Run judge on same examples
Measure agreement (target >85%)
Adjust prompt until calibrated

Red Flags to Watch For

Push back when you see:

Jumping to LLM-as-judge when code-based would suffice
Likert scales instead of binary
Generic metrics without connection to actual failures
Missing adversarial cases
No plan for calibration

Anti-Patterns

"Vibes-based" evaluation: Testing a few examples and shipping if it "looks good" → Systematic dataset with coverage of failure modes

Evaluating the model, not the product: Generic benchmarks that don't reflect real use → Evals grounded in actual user complaints and failures

Over-engineering: Complex eval pipelines for simple checks → Simplest eval that catches the failure mode

Handoff to Implementation

After outputting the spec:

"This eval design is ready for implementation. A coding agent can use the official langfuse skill to implement this using the Langfuse SDK."

Do NOT attempt to implement the eval. The skill outputs design specs only.

Quick Reference

| User Says | Do This | |-----------|---------| | "How do I know if my LLM is working?" | Start Phase 1 | | "I need to evaluate [specific thing]" | Jump to Phase 2, focus on that failure mode | | "Should I use LLM-as-judge?" | Ask about failure mode → recommend based on type | | "Help me write a judge prompt" | Use references/judge-prompts.md templates | | "What should my dataset look like?" | Cover happy path + edge + adversarial |

Additional Resources

Reference Files

references/document-template.md - Output spec template with guidance
references/eval-types.md - Detailed eval type selection and patterns
references/judge-prompts.md - Ready-to-use LLM-as-judge prompts

tavva/eval-design

plugins/eval-designer/skills/eval-design/SKILL.md

This skill should be used when the user asks "what evals can we create", "how do I evaluate this", "design an eval", "create evals for", "how do I know if my LLM is working", "measure quality", or mentions evals, evaluation, scoring rubrics, golden datasets, LLM-as-judge, quality metrics, or judge prompts.

development

Updated May 15, 2026

$ install --global

skillsauth

npx skillsauth add tavva/ben-claude-plugins eval-design

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 15, 2026, 4:11 AM36.1s4 files scanned

SKILL.md

name:: eval-design
description:: >

Eval Design

Guide users through designing production-quality LLM evaluations. Output: structured spec a coding agent can implement with Langfuse.

Announce at start: "I'm using the eval-design skill to help design your evaluation."

When to Use

digraph when_to_use {
  "User question" [shape=box];
  "About measuring LLM output quality?" [shape=diamond];
  "Use this skill" [shape=box, style=filled, fillcolor=lightgreen];
  "Not this skill" [shape=box];

  "User question" -> "About measuring LLM output quality?";
  "About measuring LLM output quality?" -> "Use this skill" [label="yes"];
  "About measuring LLM output quality?" -> "Not this skill" [label="no"];
}

Use for:

Designing new evaluations
Choosing between eval approaches
Creating scoring rubrics or judge prompts
Building golden datasets
Questions like "how do I know if my LLM is working?"

Not for:

Langfuse setup, tracing, or observability (use the official langfuse skill)
Implementing evals (output spec to coding agent)
Generic testing that isn't LLM-specific

The Process

digraph eval_design_flow {
  rankdir=TB;

  understand [label="Understand\nthe System" shape=box];
  failures [label="Identify\nFailure Modes" shape=box];
  match [label="Match Eval Type\nto Problem" shape=box];
  design [label="Design\nthe Eval" shape=box];
  output [label="Output\nSpec" shape=box];

  understand -> failures -> match -> design -> output;
}

Ask questions ONE AT A TIME. Adapt depth based on user's experience level.

Phase 1: Understand the System

Ask about:

What does the LLM application do? (summarisation, Q&A, agent, chat)
What's already instrumented in Langfuse? (traces, spans, generations)
What does a "good" output look like for users?

Phase 2: Identify Failure Modes

Key questions:

What failures have you seen in production?
Have you done error analysis on real traces?
Which failures have the highest business impact?

Critical: Ground eval design in ACTUAL failures, not hypothetical concerns. If user hasn't analysed real failures, recommend starting there.

Phase 3: Match Eval Type to Problem

Consult references/eval-types.md for detailed guidance on each type.

Phase 4: Design the Eval

For each eval, define:

Success criteria - prefer binary PASS/FAIL over Likert scales
Rubric or logic - specific, unambiguous criteria
Dataset requirements - happy path + edge cases + adversarial
Integration - when/where to run

Use references/judge-prompts.md for LLM-as-judge prompt templates.

Output Format

Output a structured spec using the template in references/document-template.md.

The spec should be complete enough that a coding agent can implement it using the Langfuse SDK without further clarification.

Principles for Great Evals

Start from Error Analysis

Examine real failures, not hypothetical concerns. Generic metrics ("hallucination score") without grounding in actual user complaints are noise.

Questions to surface:

"What would a human expert check when reviewing this output?"
"If this eval passes but users are unhappy, what did we miss?"

Match Tool to Task

Don't over-engineer:

Code-based evals are faster, cheaper, more deterministic
Only use LLM-as-judge when judgment is genuinely subjective
Human annotation for calibration, not production eval

Binary PASS/FAIL Over Likert

Rating 1-5 creates noise and inconsistency. Binary forces clarity:

What exactly makes something PASS?
What exactly makes something FAIL?

Cover the Right Cases

Dataset should include:

Happy path: Common, expected inputs
Edge cases: Boundaries, unusual but valid inputs
Adversarial: Inputs designed to break the system

Calibrate LLM Judges

Before deploying LLM-as-judge:

Create human-labelled examples (20-30)
Run judge on same examples
Measure agreement (target >85%)
Adjust prompt until calibrated

Red Flags to Watch For

Push back when you see:

Jumping to LLM-as-judge when code-based would suffice
Likert scales instead of binary
Generic metrics without connection to actual failures
Missing adversarial cases
No plan for calibration

Anti-Patterns

"Vibes-based" evaluation: Testing a few examples and shipping if it "looks good" → Systematic dataset with coverage of failure modes

Evaluating the model, not the product: Generic benchmarks that don't reflect real use → Evals grounded in actual user complaints and failures

Over-engineering: Complex eval pipelines for simple checks → Simplest eval that catches the failure mode

Handoff to Implementation

After outputting the spec:

"This eval design is ready for implementation. A coding agent can use the official langfuse skill to implement this using the Langfuse SDK."

Do NOT attempt to implement the eval. The skill outputs design specs only.

Quick Reference

Additional Resources

Reference Files

references/document-template.md - Output spec template with guidance
references/eval-types.md - Detailed eval type selection and patterns
references/judge-prompts.md - Ready-to-use LLM-as-judge prompts

Related Skills

tavva/sprite-cli

tools

VerifiedTrustedCommunity

This skill should be used when the user asks to "create a sprite", "run in sprite", "execute in sprite", "sprite exec", "open sprite console", "list sprites", "destroy sprite", "create checkpoint", "restore checkpoint", "proxy through sprite", or mentions Sprite, isolated environments, or persistent microVMs. Also triggers on requests to manage sprite authentication, checkpoints, or port forwarding.

SKILL.mdUpdated Apr 15, 2026

tavva/browser-automation

tools

VerifiedTrustedCommunity

This skill should be used when working on frontend code, debugging UI issues, verifying visual changes, scraping web pages, testing web features, or inspecting page state. Also triggers on "open browser", "take screenshot", "navigate to URL", "scrape website", "extract page content", "check accessibility", or any web automation task. Use proactively during frontend development to verify changes visually.

SKILL.mdUpdated Apr 15, 2026

tavva/browser-automation

tavva/resend-cli

tools

VerifiedTrustedCommunity

This skill should be used when the user asks to "send an email", "send email via Resend", "list emails", "check email status", "cancel scheduled email", "manage domains", "add domain", "verify domain DNS", "create API key", "list API keys", "manage templates", "create email template", or mentions Resend, transactional email, or email delivery. Also triggers on requests to configure Resend, check domain verification, or manage email infrastructure.

SKILL.mdUpdated Apr 15, 2026

tavva/readme

documentation

VerifiedTrustedCommunity

This skill should be used when the user asks to "create a README", "write a README", "generate a README", "improve my README", "make my README better", "README best practices", or mentions needing project documentation. Provides guidance for creating excellent READMEs following patterns from awesome-readme.

SKILL.mdUpdated Apr 15, 2026

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/tavva/ben-claude-plugins.git

# Copy into Claude Code skills folder (global)
cp -r ben-claude-plugins/plugins/eval-designer/skills/eval-design ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

tavva/ben-claude-plugins

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT