plugins/eval-designer/skills/eval-design/SKILL.md
This skill should be used when the user asks "what evals can we create", "how do I evaluate this", "design an eval", "create evals for", "how do I know if my LLM is working", "measure quality", or mentions evals, evaluation, scoring rubrics, golden datasets, LLM-as-judge, quality metrics, or judge prompts.
npx skillsauth add tavva/ben-claude-plugins eval-designInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide users through designing production-quality LLM evaluations. Output: structured spec a coding agent can implement with Langfuse.
Announce at start: "I'm using the eval-design skill to help design your evaluation."
digraph when_to_use {
"User question" [shape=box];
"About measuring LLM output quality?" [shape=diamond];
"Use this skill" [shape=box, style=filled, fillcolor=lightgreen];
"Not this skill" [shape=box];
"User question" -> "About measuring LLM output quality?";
"About measuring LLM output quality?" -> "Use this skill" [label="yes"];
"About measuring LLM output quality?" -> "Not this skill" [label="no"];
}
Use for:
Not for:
langfuse skill)digraph eval_design_flow {
rankdir=TB;
understand [label="Understand\nthe System" shape=box];
failures [label="Identify\nFailure Modes" shape=box];
match [label="Match Eval Type\nto Problem" shape=box];
design [label="Design\nthe Eval" shape=box];
output [label="Output\nSpec" shape=box];
understand -> failures -> match -> design -> output;
}
Ask questions ONE AT A TIME. Adapt depth based on user's experience level.
Ask about:
Key questions:
Critical: Ground eval design in ACTUAL failures, not hypothetical concerns. If user hasn't analysed real failures, recommend starting there.
| Problem Type | Recommended Eval | |--------------|------------------| | Format validation (JSON, dates) | Code-based | | Factual accuracy with known answers | Code-based + match | | Tone, helpfulness, coherence | LLM-as-judge | | Safety, toxicity | LLM-as-judge | | RAG retrieval quality | LLM-as-judge per chunk | | High-stakes decisions | Human → LLM-judge | | Novel/unclear failure modes | Human annotation first |
Consult references/eval-types.md for detailed guidance on each type.
For each eval, define:
Use references/judge-prompts.md for LLM-as-judge prompt templates.
Output a structured spec using the template in references/document-template.md.
The spec should be complete enough that a coding agent can implement it using the Langfuse SDK without further clarification.
Examine real failures, not hypothetical concerns. Generic metrics ("hallucination score") without grounding in actual user complaints are noise.
Questions to surface:
Don't over-engineer:
Rating 1-5 creates noise and inconsistency. Binary forces clarity:
Dataset should include:
Before deploying LLM-as-judge:
Push back when you see:
"Vibes-based" evaluation: Testing a few examples and shipping if it "looks good" → Systematic dataset with coverage of failure modes
Evaluating the model, not the product: Generic benchmarks that don't reflect real use → Evals grounded in actual user complaints and failures
Over-engineering: Complex eval pipelines for simple checks → Simplest eval that catches the failure mode
After outputting the spec:
"This eval design is ready for implementation. A coding agent can use the official langfuse skill to implement this using the Langfuse SDK."
Do NOT attempt to implement the eval. The skill outputs design specs only.
| User Says | Do This |
|-----------|---------|
| "How do I know if my LLM is working?" | Start Phase 1 |
| "I need to evaluate [specific thing]" | Jump to Phase 2, focus on that failure mode |
| "Should I use LLM-as-judge?" | Ask about failure mode → recommend based on type |
| "Help me write a judge prompt" | Use references/judge-prompts.md templates |
| "What should my dataset look like?" | Cover happy path + edge + adversarial |
references/document-template.md - Output spec template with guidancereferences/eval-types.md - Detailed eval type selection and patternsreferences/judge-prompts.md - Ready-to-use LLM-as-judge promptstools
This skill should be used when the user asks to "create a sprite", "run in sprite", "execute in sprite", "sprite exec", "open sprite console", "list sprites", "destroy sprite", "create checkpoint", "restore checkpoint", "proxy through sprite", or mentions Sprite, isolated environments, or persistent microVMs. Also triggers on requests to manage sprite authentication, checkpoints, or port forwarding.
tools
This skill should be used when working on frontend code, debugging UI issues, verifying visual changes, scraping web pages, testing web features, or inspecting page state. Also triggers on "open browser", "take screenshot", "navigate to URL", "scrape website", "extract page content", "check accessibility", or any web automation task. Use proactively during frontend development to verify changes visually.
tools
This skill should be used when the user asks to "send an email", "send email via Resend", "list emails", "check email status", "cancel scheduled email", "manage domains", "add domain", "verify domain DNS", "create API key", "list API keys", "manage templates", "create email template", or mentions Resend, transactional email, or email delivery. Also triggers on requests to configure Resend, check domain verification, or manage email infrastructure.
documentation
This skill should be used when the user asks to "create a README", "write a README", "generate a README", "improve my README", "make my README better", "README best practices", or mentions needing project documentation. Provides guidance for creating excellent READMEs following patterns from awesome-readme.