Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

greyhaven-ai/evaluation

Name: evaluation
Author: greyhaven-ai

grey-haven-plugins/core/skills/evaluation/SKILL.md

npx skillsauth add greyhaven-ai/claude-code-config evaluation

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Evaluation Skill

Evaluate LLM outputs systematically with rubrics, handle non-determinism, and implement LLM-as-judge patterns.

Core Insight: The 95% Variance Finding

Research shows 95% of output variance comes from just two sources:

80% from prompt tokens (wording, structure, examples)
15% from random seed/sampling

Temperature, model version, and other factors account for only 5%.

Implication: Focus evaluation on prompt quality, not model tweaking.

What's Included

Examples (`examples/`)

Prompt comparison - A/B testing prompts with rubrics
Model evaluation - Comparing outputs across models
Regression testing - Detecting output degradation

Reference Guides (`reference/`)

Rubric design - Multi-dimensional evaluation criteria
LLM-as-judge - Using LLMs to evaluate LLM outputs
Statistical methods - Handling non-determinism

Templates (`templates/`)

Rubric templates - Ready-to-use evaluation criteria
Judge prompts - LLM-as-judge prompt templates
Test case format - Structured test case templates

Checklists (`checklists/`)

Evaluation setup - Before running evaluations
Rubric validation - Ensuring rubric quality

Key Concepts

1. Multi-Dimensional Rubrics

Don't use single scores. Break down evaluation into dimensions:

| Dimension | Weight | Criteria | |-----------|--------|----------| | Accuracy | 30% | Factually correct, no hallucinations | | Completeness | 25% | Addresses all requirements | | Clarity | 20% | Well-organized, easy to understand | | Conciseness | 15% | No unnecessary content | | Format | 10% | Follows specified structure |

2. Handling Non-Determinism

LLMs are non-deterministic. Handle with:

Strategy 1: Multiple Runs
- Run same prompt 3-5 times
- Report mean and variance
- Flag high-variance cases

Strategy 2: Seed Control
- Set temperature=0 for reproducibility
- Document seed for debugging
- Accept some variation is normal

Strategy 3: Statistical Significance
- Use paired comparisons
- Require 70%+ win rate for "better"
- Report confidence intervals

3. LLM-as-Judge Pattern

Use a judge LLM to evaluate outputs:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Prompt    │────▶│  Test LLM   │────▶│   Output    │
└─────────────┘     └─────────────┘     └─────────────┘
                                               │
                                               ▼
                    ┌─────────────┐     ┌─────────────┐
                    │   Rubric    │────▶│ Judge LLM   │
                    └─────────────┘     └─────────────┘
                                               │
                                               ▼
                                        ┌─────────────┐
                                        │   Score     │
                                        └─────────────┘

Best Practice: Use stronger model as judge (Opus judges Sonnet).

4. Test Case Design

Structure test cases with:

interface TestCase {
  id: string
  input: string              // User message or context
  expectedBehavior: string   // What output should do
  rubric: RubricItem[]       // Evaluation criteria
  groundTruth?: string       // Optional gold standard
  metadata: {
    category: string
    difficulty: 'easy' | 'medium' | 'hard'
    createdAt: string
  }
}

Evaluation Workflow

Step 1: Define Rubric

rubric:
  dimensions:
    - name: accuracy
      weight: 0.3
      criteria:
        5: "Completely accurate, no errors"
        4: "Minor errors, doesn't affect correctness"
        3: "Some errors, partially correct"
        2: "Significant errors, mostly incorrect"
        1: "Completely incorrect or hallucinated"

Step 2: Create Test Cases

test_cases:
  - id: "code-gen-001"
    input: "Write a function to reverse a string"
    expected_behavior: "Returns working reverse function"
    ground_truth: |
      function reverse(s: string): string {
        return s.split('').reverse().join('')
      }

Step 3: Run Evaluation

# Run test suite
python evaluate.py --suite code-generation --runs 3

# Output
# ┌─────────────────────────────────────────────┐
# │ Test Suite: code-generation                 │
# │ Total: 50 | Pass: 47 | Fail: 3              │
# │ Accuracy: 94% (±2.1%)                       │
# │ Avg Score: 4.2/5.0                          │
# └─────────────────────────────────────────────┘

Step 4: Analyze Results

Look for:

Low-scoring dimensions - Target for improvement
High-variance cases - Prompt needs clarification
Regression from baseline - Investigate changes

Grey Haven Integration

With TDD Workflow

1. Write test cases (expected behavior)
2. Run baseline evaluation
3. Modify prompt/implementation
4. Run evaluation again
5. Compare: new scores ≥ baseline?

With Pipeline Architecture

acquire → prepare → process → parse → render → EVALUATE
                                                  │
                                          ┌───────┴───────┐
                                          │ Compare to    │
                                          │ ground truth  │
                                          │ or rubric     │
                                          └───────────────┘

With Prompt Engineering

Current prompt → Evaluate → Score: 3.2
Apply principles → Improve prompt
New prompt → Evaluate → Score: 4.1 ✓

Use This Skill When

Testing new prompts before production
Comparing prompt variations (A/B testing)
Validating model outputs meet quality bar
Detecting regressions after changes
Building evaluation datasets
Implementing automated quality gates

Related Skills

prompt-engineering - Improve prompts based on evaluation
testing-strategy - Overall testing approaches
llm-project-development - Pipeline with evaluation stage

Quick Start

# Design your rubric
cat templates/rubric-template.yaml

# Create test cases
cat templates/test-case-template.yaml

# Learn LLM-as-judge
cat reference/llm-as-judge-guide.md

# Run evaluation checklist
cat checklists/evaluation-setup-checklist.md

Skill Version: 1.0 Key Finding: 95% variance from prompts (80%) + sampling (15%) Last Updated: 2025-01-15

greyhaven-ai/evaluation

grey-haven-plugins/core/skills/evaluation/SKILL.md

Evaluate LLM outputs with multi-dimensional rubrics, handle non-determinism, and implement LLM-as-judge patterns. Essential for production LLM systems. Use when testing prompts, validating outputs, comparing models, or when user mentions 'evaluation', 'testing LLM', 'rubric', 'LLM-as-judge', 'output quality', 'prompt testing', or 'model comparison'.

23 stars

testing

Updated Apr 18, 2026

$ install --global

skillsauth

npx skillsauth add greyhaven-ai/claude-code-config evaluation

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 18, 2026, 7:32 AM25.6s6 files scanned

SKILL.md

name:: evaluation
description:: Evaluate LLM outputs with multi-dimensional rubrics, handle non-determinism, and implement LLM-as-judge patterns. Essential for production LLM systems. Use when testing prompts, validating outputs, comparing models, or when user mentions 'evaluation', 'testing LLM', 'rubric', 'LLM-as-judge', 'output quality', 'prompt testing', or 'model comparison'.
# v2.0.43:: Skills to auto-load for evaluation work
# v2.0.74:: Tools for evaluation work

Evaluation Skill

Evaluate LLM outputs systematically with rubrics, handle non-determinism, and implement LLM-as-judge patterns.

Core Insight: The 95% Variance Finding

Research shows 95% of output variance comes from just two sources:

80% from prompt tokens (wording, structure, examples)
15% from random seed/sampling

Temperature, model version, and other factors account for only 5%.

Implication: Focus evaluation on prompt quality, not model tweaking.

What's Included

Examples (`examples/`)

Prompt comparison - A/B testing prompts with rubrics
Model evaluation - Comparing outputs across models
Regression testing - Detecting output degradation

Reference Guides (`reference/`)

Rubric design - Multi-dimensional evaluation criteria
LLM-as-judge - Using LLMs to evaluate LLM outputs
Statistical methods - Handling non-determinism

Templates (`templates/`)

Rubric templates - Ready-to-use evaluation criteria
Judge prompts - LLM-as-judge prompt templates
Test case format - Structured test case templates

Checklists (`checklists/`)

Evaluation setup - Before running evaluations
Rubric validation - Ensuring rubric quality

Key Concepts

1. Multi-Dimensional Rubrics

Don't use single scores. Break down evaluation into dimensions:

2. Handling Non-Determinism

LLMs are non-deterministic. Handle with:

Strategy 1: Multiple Runs
- Run same prompt 3-5 times
- Report mean and variance
- Flag high-variance cases

Strategy 2: Seed Control
- Set temperature=0 for reproducibility
- Document seed for debugging
- Accept some variation is normal

Strategy 3: Statistical Significance
- Use paired comparisons
- Require 70%+ win rate for "better"
- Report confidence intervals

3. LLM-as-Judge Pattern

Use a judge LLM to evaluate outputs:

┌─────────────┐     ┌─────────────┐     ┌─────────────┐
│   Prompt    │────▶│  Test LLM   │────▶│   Output    │
└─────────────┘     └─────────────┘     └─────────────┘
                                               │
                                               ▼
                    ┌─────────────┐     ┌─────────────┐
                    │   Rubric    │────▶│ Judge LLM   │
                    └─────────────┘     └─────────────┘
                                               │
                                               ▼
                                        ┌─────────────┐
                                        │   Score     │
                                        └─────────────┘

Best Practice: Use stronger model as judge (Opus judges Sonnet).

4. Test Case Design

Structure test cases with:

interface TestCase {
  id: string
  input: string              // User message or context
  expectedBehavior: string   // What output should do
  rubric: RubricItem[]       // Evaluation criteria
  groundTruth?: string       // Optional gold standard
  metadata: {
    category: string
    difficulty: 'easy' | 'medium' | 'hard'
    createdAt: string
  }
}

Evaluation Workflow

Step 1: Define Rubric

rubric:
  dimensions:
    - name: accuracy
      weight: 0.3
      criteria:
        5: "Completely accurate, no errors"
        4: "Minor errors, doesn't affect correctness"
        3: "Some errors, partially correct"
        2: "Significant errors, mostly incorrect"
        1: "Completely incorrect or hallucinated"

Step 2: Create Test Cases

test_cases:
  - id: "code-gen-001"
    input: "Write a function to reverse a string"
    expected_behavior: "Returns working reverse function"
    ground_truth: |
      function reverse(s: string): string {
        return s.split('').reverse().join('')
      }

Step 3: Run Evaluation

# Run test suite
python evaluate.py --suite code-generation --runs 3

# Output
# ┌─────────────────────────────────────────────┐
# │ Test Suite: code-generation                 │
# │ Total: 50 | Pass: 47 | Fail: 3              │
# │ Accuracy: 94% (±2.1%)                       │
# │ Avg Score: 4.2/5.0                          │
# └─────────────────────────────────────────────┘

Step 4: Analyze Results

Look for:

Low-scoring dimensions - Target for improvement
High-variance cases - Prompt needs clarification
Regression from baseline - Investigate changes

Grey Haven Integration

With TDD Workflow

1. Write test cases (expected behavior)
2. Run baseline evaluation
3. Modify prompt/implementation
4. Run evaluation again
5. Compare: new scores ≥ baseline?

With Pipeline Architecture

acquire → prepare → process → parse → render → EVALUATE
                                                  │
                                          ┌───────┴───────┐
                                          │ Compare to    │
                                          │ ground truth  │
                                          │ or rubric     │
                                          └───────────────┘

With Prompt Engineering

Current prompt → Evaluate → Score: 3.2
Apply principles → Improve prompt
New prompt → Evaluate → Score: 4.1 ✓

Use This Skill When

Testing new prompts before production
Comparing prompt variations (A/B testing)
Validating model outputs meet quality bar
Detecting regressions after changes
Building evaluation datasets
Implementing automated quality gates

Related Skills

prompt-engineering - Improve prompts based on evaluation
testing-strategy - Overall testing approaches
llm-project-development - Pipeline with evaluation stage

Quick Start

# Design your rubric
cat templates/rubric-template.yaml

# Create test cases
cat templates/test-case-template.yaml

# Learn LLM-as-judge
cat reference/llm-as-judge-guide.md

# Run evaluation checklist
cat checklists/evaluation-setup-checklist.md

Skill Version: 1.0 Key Finding: 95% variance from prompts (80%) + sampling (15%) Last Updated: 2025-01-15

Related Skills

greyhaven-ai/testing-strategy

development

VerifiedTrustedCommunity

Grey Haven's comprehensive testing strategy - Vitest unit/integration/e2e for TypeScript, pytest markers for Python, >80% coverage requirement, fixture patterns, and Doppler for test environments. Use when writing tests, setting up test infrastructure, running tests, debugging test failures, improving coverage, configuring CI/CD, or when user mentions 'test', 'testing', 'pytest', 'vitest', 'coverage', 'TDD', 'test-driven development', 'unit test', 'integration test', 'e2e', 'end-to-end', 'test fixtures', 'mocking', 'test setup', 'CI testing'.

23SKILL.mdUpdated Apr 5, 2026

greyhaven-ai/testing-strategy

greyhaven-ai/test-generation

development

VerifiedTrustedCommunity

Comprehensive test suite generation with unit tests, integration tests, edge cases, and error handling. Use when generating tests for existing code, improving coverage, or creating systematic test suites. Triggers: 'generate tests', 'add tests', 'test coverage', 'write tests for', 'create test suite'.

23SKILL.mdUpdated Apr 5, 2026

greyhaven-ai/test-generation

greyhaven-ai/react-tanstack-testing

development

VerifiedTrustedCommunity

Specialized testing for React applications using TanStack ecosystem (Query, Router, Table, Form) with Vite and Vitest. Use when testing React + TanStack apps, mocking server state, testing router, or validating query behavior. Triggers: 'TanStack testing', 'React Query testing', 'test TanStack', 'mock query', 'router test'.

23SKILL.mdUpdated Apr 5, 2026

greyhaven-ai/react-tanstack-testing

greyhaven-ai/tanstack-patterns

development

VerifiedTrustedCommunity

Apply Grey Haven's TanStack ecosystem patterns - Router file-based routing, Query data fetching with staleTime, and Start server functions. Use when building React applications with TanStack Start. Triggers: 'TanStack', 'TanStack Start', 'TanStack Query', 'TanStack Router', 'React Query', 'file-based routing', 'server functions'.

23SKILL.mdUpdated Apr 5, 2026

greyhaven-ai/tanstack-patterns

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/greyhaven-ai/claude-code-config.git

# Copy into Claude Code skills folder (global)
cp -r claude-code-config/grey-haven-plugins/core/skills/evaluation ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

greyhaven-ai/claude-code-config

23 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

greyhaven-ai/evaluation

$ install --global

Security Scan Results

SKILL.md

Evaluation Skill

Core Insight: The 95% Variance Finding

What's Included

Examples (examples/)

Reference Guides (reference/)

Templates (templates/)

Checklists (checklists/)

Key Concepts

1. Multi-Dimensional Rubrics

2. Handling Non-Determinism

3. LLM-as-Judge Pattern

4. Test Case Design

Evaluation Workflow

Step 1: Define Rubric

Step 2: Create Test Cases

Step 3: Run Evaluation

Step 4: Analyze Results

Grey Haven Integration

With TDD Workflow

With Pipeline Architecture

With Prompt Engineering

Use This Skill When

Related Skills

Quick Start

Related Skills

greyhaven-ai/testing-strategy

greyhaven-ai/test-generation

greyhaven-ai/react-tanstack-testing

greyhaven-ai/tanstack-patterns

greyhaven-ai/evaluation

$ install --global

Security Scan Results

SKILL.md

Evaluation Skill

Core Insight: The 95% Variance Finding

What's Included

Examples (examples/)

Reference Guides (reference/)

Templates (templates/)

Checklists (checklists/)

Key Concepts

1. Multi-Dimensional Rubrics

2. Handling Non-Determinism

3. LLM-as-Judge Pattern

4. Test Case Design

Evaluation Workflow

Step 1: Define Rubric

Step 2: Create Test Cases

Step 3: Run Evaluation

Step 4: Analyze Results

Grey Haven Integration

With TDD Workflow

With Pipeline Architecture

With Prompt Engineering

Use This Skill When

Related Skills

Quick Start

Related Skills

greyhaven-ai/testing-strategy

greyhaven-ai/test-generation

greyhaven-ai/react-tanstack-testing

greyhaven-ai/tanstack-patterns

Examples (`examples/`)

Reference Guides (`reference/`)

Templates (`templates/`)

Checklists (`checklists/`)

Examples (`examples/`)

Reference Guides (`reference/`)

Templates (`templates/`)

Checklists (`checklists/`)