skills/ai-model-evaluation/SKILL.md
To systematically assess the performance, accuracy, and safety of LLM outputs using quantitative metrics and "LLM-as-a-Judge" patterns, ensuring production readiness. Use when: Before deploying any LLM application to production; When comparing different models (e.g., GPT-4o vs. Claude 3.5 Sonnet) or prompt versions; To detect regressions after updating prompts or RAG knowledge bases.
npx skillsauth add jyjeanne/ai-setup-forge ai-model-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
To systematically assess the performance, accuracy, and safety of LLM outputs using quantitative metrics and "LLM-as-a-Judge" patterns, ensuring production readiness.
Create a tests.json file containing inputs and expected outputs.
[
{
"input": "What is the return policy?",
"expected": "You can return items within 30 days.",
"context": "Our policy allows returns for 30 days from purchase date."
}
]
Promptfoo is a popular tool for running batch evaluations.
# Install
npm install -g promptfoo
# Initialize
promptfoo init
Configure promptfooconfig.yaml:
prompts:
- "Answer this question using the context: {{context}}. Question: {{input}}"
providers:
- openai:gpt-4o
tests:
- vars:
input: "What is the return policy?"
context: "30-day return policy applies."
assert:
- type: icontains
value: "30 days"
- type: llm-rubric
value: "Does not mention unrelated topics"
For RAG systems, evaluate the three-way relationship: Question, Context, and Answer.
import { rce } from "deepeval"; // Conceptual example
async function evaluateRag(query: string, retrievalContext: string, output: string) {
// 1. Faithfulness: Is the answer grounded in the context?
// 2. Answer Relevance: Does it answer the query?
// 3. Context Precision: Was the retrieved context relevant?
}
Use a stronger model to grade your target model.
async function gradeOutput(question: string, answer: string, reference: string) {
const graderPrompt = `
You are an impartial judge. Grade the student's answer based on the reference.
Question: ${question}
Reference: ${reference}
Student Answer: ${answer}
Provide a score from 1-10 and a brief explanation.
Output JSON: { "score": number, "explanation": string }
`;
// Call GPT-4 with JSON mode enabled
}
Integrate evaluation into your GitHub Actions to prevent regressions.
# .github/workflows/ai-eval.yml
jobs:
evaluate:
runs-on: ubuntu-latest
steps:
- uses: actions/checkout@v4
- run: npx promptfoo eval
gpt-4o-mini for simpler checks.A detailed report (HTML/JSON) showing pass/fail status, accuracy percentages, and regression analysis.
development
Generate breadboard circuit mockups and visual diagrams using HTML5 Canvas drawing techniques. Use when asked to create circuit layouts, visualize electronic component placements, draw breadboard diagrams, mockup 6502 builds, generate retro computer schematics, or design vintage electronics projects. Supports 555 timers, W65C02S microprocessors, 28C256 EEPROMs, W65C22 VIA chips, 7400-series logic gates, LEDs, resistors, capacitors, switches, buttons, crystals, and wires.
development
Apply lean thinking to UX: hypothesis-driven design, collaborative sketching, and rapid experiments instead of heavy deliverables. Use when the user mentions "Lean UX", "design hypothesis", "UX experiment", "collaborative design", or "outcome over output". Covers hypothesis statements, MVPs for UX, and cross-functional collaboration. For Build-Measure-Learn, see lean-startup. For usability audits, see ux-heuristics.
development
Design MVPs, validated learning experiments, and pivot-or-persevere decisions using Build-Measure-Learn. Use when the user mentions "MVP scope", "validated learning", "pivot or persevere", "vanity metrics", or "test assumptions". Covers innovation accounting and actionable metrics. For 5-day prototype testing, see design-sprint. For customer motivation analysis, see jobs-to-be-done.
tools
Instrument, trace, evaluate, and monitor LLM applications and AI agents with LangSmith. Use when setting up observability for LLM pipelines, running offline or online evaluations, managing prompts in the Prompt Hub, creating datasets for regression testing, or deploying agent servers. Triggers on: langsmith, langchain tracing, llm tracing, llm observability, llm evaluation, trace llm calls, @traceable, wrap_openai, langsmith evaluate, langsmith dataset, langsmith feedback, langsmith prompt hub, langsmith project, llm monitoring, llm debugging, llm quality, openevals, langsmith cli, langsmith experiment, annotate llm, llm judge.