promptfoo-evaluation/SKILL.md
Configures and runs LLM evaluation using Promptfoo framework. Use when setting up prompt testing, creating evaluation configs (promptfooconfig.yaml), writing Python custom assertions, implementing llm-rubric for LLM-as-judge, or managing few-shot examples in prompts. Triggers on keywords like "promptfoo", "eval", "LLM evaluation", "prompt testing", or "model comparison".
npx skillsauth add fernandezbaptiste/claude-code-skills promptfoo-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill provides guidance for configuring and running LLM evaluations using Promptfoo, an open-source CLI tool for testing and comparing LLM outputs.
# Initialize a new evaluation project
npx promptfoo@latest init
# Run evaluation
npx promptfoo@latest eval
# View results in browser
npx promptfoo@latest view
A typical Promptfoo project structure:
project/
├── promptfooconfig.yaml # Main configuration
├── prompts/
│ ├── system.md # System prompt
│ └── chat.json # Chat format prompt
├── tests/
│ └── cases.yaml # Test cases
└── scripts/
└── metrics.py # Custom Python assertions
# yaml-language-server: $schema=https://promptfoo.dev/config-schema.json
description: "My LLM Evaluation"
# Prompts to test
prompts:
- file://prompts/system.md
- file://prompts/chat.json
# Models to compare
providers:
- id: anthropic:messages:claude-sonnet-4-5-20250929
label: Claude-4.5-Sonnet
- id: openai:gpt-4.1
label: GPT-4.1
# Test cases
tests: file://tests/cases.yaml
# Default assertions for all tests
defaultTest:
assert:
- type: python
value: file://scripts/metrics.py:custom_assert
- type: llm-rubric
value: |
Evaluate the response quality on a 0-1 scale.
threshold: 0.7
# Output path
outputPath: results/eval-results.json
You are a helpful assistant.
Task: {{task}}
Context: {{context}}
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "{{user_input}}"}
]
Embed examples directly in prompt or use chat format with assistant messages:
[
{"role": "system", "content": "{{system_prompt}}"},
{"role": "user", "content": "Example input: {{example_input}}"},
{"role": "assistant", "content": "{{example_output}}"},
{"role": "user", "content": "Now process: {{actual_input}}"}
]
- description: "Test case 1"
vars:
system_prompt: file://prompts/system.md
user_input: "Hello world"
# Load content from files
context: file://data/context.txt
assert:
- type: contains
value: "expected text"
- type: python
value: file://scripts/metrics.py:custom_check
threshold: 0.8
Create a Python file for custom assertions (e.g., scripts/metrics.py):
def get_assert(output: str, context: dict) -> dict:
"""Default assertion function."""
vars_dict = context.get('vars', {})
# Access test variables
expected = vars_dict.get('expected', '')
# Return result
return {
"pass": expected in output,
"score": 0.8,
"reason": "Contains expected content",
"named_scores": {"relevance": 0.9}
}
def custom_check(output: str, context: dict) -> dict:
"""Custom named assertion."""
word_count = len(output.split())
passed = 100 <= word_count <= 500
return {
"pass": passed,
"score": min(1.0, word_count / 300),
"reason": f"Word count: {word_count}"
}
Key points:
get_assertfile://path.py:function_namebool, float (score), or dict with pass/score/reasoncontext['vars']assert:
- type: llm-rubric
value: |
Evaluate the response based on:
1. Accuracy of information
2. Clarity of explanation
3. Completeness
Score 0.0-1.0 where 0.7+ is passing.
threshold: 0.7
provider: openai:gpt-4.1 # Optional: override grader model
Best practices:
threshold to set minimum passing score| Type | Usage | Example |
|------|-------|---------|
| contains | Check substring | value: "hello" |
| icontains | Case-insensitive | value: "HELLO" |
| equals | Exact match | value: "42" |
| regex | Pattern match | value: "\\d{4}" |
| python | Custom logic | value: file://script.py |
| llm-rubric | LLM grading | value: "Is professional" |
| latency | Response time | threshold: 1000 |
All paths are relative to config file location:
# Load file content as variable
vars:
content: file://data/input.txt
# Load prompt from file
prompts:
- file://prompts/main.md
# Load test cases from file
tests: file://tests/cases.yaml
# Load Python assertion
assert:
- type: python
value: file://scripts/check.py:validate
# Basic run
npx promptfoo@latest eval
# With specific config
npx promptfoo@latest eval --config path/to/config.yaml
# Output to file
npx promptfoo@latest eval --output results.json
# Filter tests
npx promptfoo@latest eval --filter-metadata category=math
# View results
npx promptfoo@latest view
Python not found:
export PROMPTFOO_PYTHON=python3
Large outputs truncated:
Outputs over 30000 characters are truncated. Use head_limit in assertions.
File not found errors:
Ensure paths are relative to promptfooconfig.yaml location.
Use the echo provider to preview rendered prompts without making API calls:
# promptfooconfig-preview.yaml
providers:
- echo # Returns prompt as output, no API calls
tests:
- vars:
input: "test content"
Use cases:
# Run preview mode
npx promptfoo@latest eval --config promptfooconfig-preview.yaml
Cost: Free - no API tokens consumed.
For complex few-shot learning with full examples:
[
{"role": "system", "content": "{{system_prompt}}"},
// Few-shot Example 1
{"role": "user", "content": "Task: {{example_input_1}}"},
{"role": "assistant", "content": "{{example_output_1}}"},
// Few-shot Example 2 (optional)
{"role": "user", "content": "Task: {{example_input_2}}"},
{"role": "assistant", "content": "{{example_output_2}}"},
// Actual test
{"role": "user", "content": "Task: {{actual_input}}"}
]
Test case configuration:
tests:
- vars:
system_prompt: file://prompts/system.md
# Few-shot examples
example_input_1: file://data/examples/input1.txt
example_output_1: file://data/examples/output1.txt
example_input_2: file://data/examples/input2.txt
example_output_2: file://data/examples/output2.txt
# Actual test
actual_input: file://data/test1.txt
Best practices:
For Chinese/long-form content evaluations (10k+ characters):
Configuration:
providers:
- id: anthropic:messages:claude-sonnet-4-5-20250929
config:
max_tokens: 8192 # Increase for long outputs
defaultTest:
assert:
- type: python
value: file://scripts/metrics.py:check_length
Python assertion for text metrics:
import re
def strip_tags(text: str) -> str:
"""Remove HTML tags for pure text."""
return re.sub(r'<[^>]+>', '', text)
def check_length(output: str, context: dict) -> dict:
"""Check output length constraints."""
raw_input = context['vars'].get('raw_input', '')
input_len = len(strip_tags(raw_input))
output_len = len(strip_tags(output))
reduction_ratio = 1 - (output_len / input_len) if input_len > 0 else 0
return {
"pass": 0.7 <= reduction_ratio <= 0.9,
"score": reduction_ratio,
"reason": f"Reduction: {reduction_ratio:.1%} (target: 70-90%)",
"named_scores": {
"input_length": input_len,
"output_length": output_len,
"reduction_ratio": reduction_ratio
}
}
Project: Chinese short-video content curation from long transcripts
Structure:
tiaogaoren/
├── promptfooconfig.yaml # Production config
├── promptfooconfig-preview.yaml # Preview config (echo provider)
├── prompts/
│ ├── tiaogaoren-prompt.json # Chat format with few-shot
│ └── v4/system-v4.md # System prompt
├── tests/cases.yaml # 3 test samples
├── scripts/metrics.py # Custom metrics (reduction ratio, etc.)
├── data/ # 5 samples (2 few-shot, 3 eval)
└── results/
See: /Users/tiansheng/Workspace/prompts/tiaogaoren/ for full implementation.
For detailed API reference and advanced patterns, see references/promptfoo_api.md.
data-ai
Download YouTube videos and HLS streams (m3u8) from platforms like Mux, Vimeo, etc. using yt-dlp and ffmpeg. Use this skill when users request downloading videos, extracting audio, handling protected streams with authentication headers, or troubleshooting download issues like nsig extraction failures, 403 errors, or cookie extraction problems.
development
Diagnose Windows App (Microsoft Remote Desktop / Azure Virtual Desktop / W365) connection quality issues on macOS. Analyze transport protocol selection (UDP Shortpath vs WebSocket), detect VPN/proxy interference with STUN/TURN negotiation, and parse Windows App logs for Shortpath failures. This skill should be used when VDI connections are slow, when transport shows WebSocket instead of UDP, when RDP Shortpath fails to establish, or when RTT is unexpectedly high.
development
This skill should be used when comparing two videos to analyze compression results or quality differences. Generates interactive HTML reports with quality metrics (PSNR, SSIM) and frame-by-frame visual comparisons. Triggers when users mention "compare videos", "video quality", "compression analysis", "before/after compression", or request quality assessment of compressed videos.
development
Extract design systems from reference UI images and generate implementation-ready UI design prompts. Use when users provide UI screenshots/mockups and want to create consistent designs, generate design systems, or build MVP UIs matching reference aesthetics.