grey-haven-plugins/core/skills/evaluation/SKILL.md
Evaluate LLM outputs with multi-dimensional rubrics, handle non-determinism, and implement LLM-as-judge patterns. Essential for production LLM systems. Use when testing prompts, validating outputs, comparing models, or when user mentions 'evaluation', 'testing LLM', 'rubric', 'LLM-as-judge', 'output quality', 'prompt testing', or 'model comparison'.
npx skillsauth add greyhaven-ai/claude-code-config evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Evaluate LLM outputs systematically with rubrics, handle non-determinism, and implement LLM-as-judge patterns.
Research shows 95% of output variance comes from just two sources:
Temperature, model version, and other factors account for only 5%.
Implication: Focus evaluation on prompt quality, not model tweaking.
examples/)reference/)templates/)checklists/)Don't use single scores. Break down evaluation into dimensions:
| Dimension | Weight | Criteria | |-----------|--------|----------| | Accuracy | 30% | Factually correct, no hallucinations | | Completeness | 25% | Addresses all requirements | | Clarity | 20% | Well-organized, easy to understand | | Conciseness | 15% | No unnecessary content | | Format | 10% | Follows specified structure |
LLMs are non-deterministic. Handle with:
Strategy 1: Multiple Runs
- Run same prompt 3-5 times
- Report mean and variance
- Flag high-variance cases
Strategy 2: Seed Control
- Set temperature=0 for reproducibility
- Document seed for debugging
- Accept some variation is normal
Strategy 3: Statistical Significance
- Use paired comparisons
- Require 70%+ win rate for "better"
- Report confidence intervals
Use a judge LLM to evaluate outputs:
┌─────────────┐ ┌─────────────┐ ┌─────────────┐
│ Prompt │────▶│ Test LLM │────▶│ Output │
└─────────────┘ └─────────────┘ └─────────────┘
│
▼
┌─────────────┐ ┌─────────────┐
│ Rubric │────▶│ Judge LLM │
└─────────────┘ └─────────────┘
│
▼
┌─────────────┐
│ Score │
└─────────────┘
Best Practice: Use stronger model as judge (Opus judges Sonnet).
Structure test cases with:
interface TestCase {
id: string
input: string // User message or context
expectedBehavior: string // What output should do
rubric: RubricItem[] // Evaluation criteria
groundTruth?: string // Optional gold standard
metadata: {
category: string
difficulty: 'easy' | 'medium' | 'hard'
createdAt: string
}
}
rubric:
dimensions:
- name: accuracy
weight: 0.3
criteria:
5: "Completely accurate, no errors"
4: "Minor errors, doesn't affect correctness"
3: "Some errors, partially correct"
2: "Significant errors, mostly incorrect"
1: "Completely incorrect or hallucinated"
test_cases:
- id: "code-gen-001"
input: "Write a function to reverse a string"
expected_behavior: "Returns working reverse function"
ground_truth: |
function reverse(s: string): string {
return s.split('').reverse().join('')
}
# Run test suite
python evaluate.py --suite code-generation --runs 3
# Output
# ┌─────────────────────────────────────────────┐
# │ Test Suite: code-generation │
# │ Total: 50 | Pass: 47 | Fail: 3 │
# │ Accuracy: 94% (±2.1%) │
# │ Avg Score: 4.2/5.0 │
# └─────────────────────────────────────────────┘
Look for:
1. Write test cases (expected behavior)
2. Run baseline evaluation
3. Modify prompt/implementation
4. Run evaluation again
5. Compare: new scores ≥ baseline?
acquire → prepare → process → parse → render → EVALUATE
│
┌───────┴───────┐
│ Compare to │
│ ground truth │
│ or rubric │
└───────────────┘
Current prompt → Evaluate → Score: 3.2
Apply principles → Improve prompt
New prompt → Evaluate → Score: 4.1 ✓
prompt-engineering - Improve prompts based on evaluationtesting-strategy - Overall testing approachesllm-project-development - Pipeline with evaluation stage# Design your rubric
cat templates/rubric-template.yaml
# Create test cases
cat templates/test-case-template.yaml
# Learn LLM-as-judge
cat reference/llm-as-judge-guide.md
# Run evaluation checklist
cat checklists/evaluation-setup-checklist.md
Skill Version: 1.0 Key Finding: 95% variance from prompts (80%) + sampling (15%) Last Updated: 2025-01-15
development
Grey Haven's comprehensive testing strategy - Vitest unit/integration/e2e for TypeScript, pytest markers for Python, >80% coverage requirement, fixture patterns, and Doppler for test environments. Use when writing tests, setting up test infrastructure, running tests, debugging test failures, improving coverage, configuring CI/CD, or when user mentions 'test', 'testing', 'pytest', 'vitest', 'coverage', 'TDD', 'test-driven development', 'unit test', 'integration test', 'e2e', 'end-to-end', 'test fixtures', 'mocking', 'test setup', 'CI testing'.
development
Comprehensive test suite generation with unit tests, integration tests, edge cases, and error handling. Use when generating tests for existing code, improving coverage, or creating systematic test suites. Triggers: 'generate tests', 'add tests', 'test coverage', 'write tests for', 'create test suite'.
development
Specialized testing for React applications using TanStack ecosystem (Query, Router, Table, Form) with Vite and Vitest. Use when testing React + TanStack apps, mocking server state, testing router, or validating query behavior. Triggers: 'TanStack testing', 'React Query testing', 'test TanStack', 'mock query', 'router test'.
development
Apply Grey Haven's TanStack ecosystem patterns - Router file-based routing, Query data fetching with staleTime, and Start server functions. Use when building React applications with TanStack Start. Triggers: 'TanStack', 'TanStack Start', 'TanStack Query', 'TanStack Router', 'React Query', 'file-based routing', 'server functions'.