ai-evaluation/SKILL.md
Use when setting up quality assurance for AI features — defining evaluation criteria, measuring output quality, using AI-as-judge, monitoring production AI, detecting drift, and building user feedback loops
npx skillsauth add peterbamuhigire/skills-web-dev ai-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
ai-evaluation or would be better handled by a more specific companion skill.SKILL.md first, then load only the referenced deep-dive files that are necessary for the task.Evaluation is the biggest bottleneck to successful AI deployment. Define evaluation criteria BEFORE building. Without evaluation, you cannot know if your AI feature is working, degrading, or harming users.
Core principle: Evaluation-driven development. Like TDD for AI — define what "good" means first, then build.
| Dimension | What to Measure | Method | |---|---|---| | Format | Is output valid JSON/schema? Correct length? | Automated rules | | Factual accuracy | Does output match the provided context? | AI-as-judge or RAG citation check | | Safety | Toxic, harmful, or brand-risk content? | Classifier or AI-as-judge | | Instruction-following | Did it follow format/tone/language rules? | Automated + AI-as-judge | | Relevance | Does output address the user's question? | AI-as-judge | | Cost | Tokens per request; cost per feature | Logged automatically | | Latency | Time to first token; total response time | Logged automatically |
1. Define criteria before building
2. Create golden test set (20–50 examples with expected outputs)
3. Run automated format checks on every new model/prompt version
4. Run AI-as-judge for quality checks
5. Compare against previous version — only deploy if metrics hold or improve
6. Monitor production: track live metrics + user feedback
7. Retrain/reprompt when drift detected
CREATE TABLE ai_eval_cases (
id INT AUTO_INCREMENT PRIMARY KEY,
feature_name VARCHAR(100), -- 'invoice_analysis', 'sales_report'
input TEXT NOT NULL, -- the user query or document
expected_output TEXT, -- ideal output (or key elements of it)
eval_criteria JSON, -- {"format": "json", "must_contain": ["total", "vendor"]}
created_by INT,
created_at TIMESTAMP DEFAULT NOW()
);
Build test cases from:
Run these on every deployment:
class AiEvaluator {
public function evaluateFormat(string $output, array $criteria): EvalResult {
$score = 0;
$issues = [];
// JSON validity
if (($criteria['format'] ?? null) === 'json') {
$decoded = json_decode($output, true);
if (json_last_error() !== JSON_ERROR_NONE) {
$issues[] = 'Invalid JSON';
} else {
$score += 25;
// Required keys
foreach ($criteria['required_keys'] ?? [] as $key) {
if (!array_key_exists($key, $decoded)) {
$issues[] = "Missing key: $key";
} else {
$score += 10;
}
}
}
}
// Length constraints
if (isset($criteria['max_words'])) {
$wordCount = str_word_count($output);
if ($wordCount > $criteria['max_words']) {
$issues[] = "Too long: {$wordCount} words (max {$criteria['max_words']})";
} else {
$score += 15;
}
}
// Must-contain terms
foreach ($criteria['must_contain'] ?? [] as $term) {
if (stripos($output, $term) === false) {
$issues[] = "Missing expected term: $term";
} else {
$score += 10;
}
}
return new EvalResult($score, $issues);
}
}
Use a strong model to evaluate your AI feature's outputs. Reliable for quality, relevance, and tone.
function judgeAiOutput(string $input, string $output, string $criteria): array {
$judgePrompt = <<<PROMPT
You are evaluating the quality of an AI assistant's response.
Evaluation criteria:
{$criteria}
User input:
---
{$input}
---
AI response to evaluate:
---
{$output}
---
Score the response on each criterion from 1–5 (5 = excellent).
Explain your reasoning briefly, then give an overall score (1–5).
Format your response as JSON:
{
"relevance": {"score": X, "reason": "..."},
"accuracy": {"score": X, "reason": "..."},
"tone": {"score": X, "reason": "..."},
"overall": X
}
PROMPT;
return callLLM('gpt-4o', $judgePrompt, temperature: 0.1);
}
AI-judge best practices:
CREATE TABLE ai_quality_metrics (
id BIGINT AUTO_INCREMENT PRIMARY KEY,
tenant_id INT NOT NULL,
feature_name VARCHAR(100),
prompt_version VARCHAR(10),
model VARCHAR(50),
format_valid BOOLEAN,
latency_ms INT,
tokens_in INT,
tokens_out INT,
judge_score DECIMAL(3,2), -- 1.00–5.00 from AI judge (async)
user_rating TINYINT, -- 1–5 from explicit feedback
thumbs_up BOOLEAN, -- quick user feedback
created_at TIMESTAMP DEFAULT NOW(),
INDEX idx_feature_date (feature_name, created_at),
INDEX idx_tenant_date (tenant_id, created_at)
);
$alerts = [
'format_failure_rate' => 0.05, // Alert if > 5% of responses fail format
'thumbs_down_rate' => 0.15, // Alert if > 15% negative feedback
'p99_latency_ms' => 8000, // Alert if p99 latency > 8 seconds
'cost_per_request' => 0.05, // Alert if avg cost > $0.05 per request
];
Drift = your AI feature is silently getting worse. Causes:
-- Weekly average quality score — watch for downward trend
SELECT
YEARWEEK(created_at) AS week,
feature_name,
AVG(judge_score) AS avg_quality,
AVG(CASE WHEN thumbs_up = FALSE THEN 1 ELSE 0 END) AS negative_rate,
AVG(latency_ms) AS avg_latency
FROM ai_quality_metrics
WHERE tenant_id = ? AND created_at > DATE_SUB(NOW(), INTERVAL 8 WEEK)
GROUP BY week, feature_name
ORDER BY week;
Act when:
| Signal | Type | Strength | |---|---|---| | Thumbs up / down | Explicit | Medium | | Star rating | Explicit | Medium | | "That's wrong" in chat | Implicit | High | | User edits output | Implicit | Very high | | Early generation stop | Implicit | Medium | | Rephrases same question | Implicit | High | | Regenerates response | Implicit | Medium |
Collect user edits as preference data: original output = rejected, edited version = preferred.
| Phase | What to Evaluate | How | |---|---|---| | Pre-deploy | New prompt version vs old | A/B on golden test set | | Pre-deploy | New model vs old | Same test set, compare scores | | Post-deploy | Production quality | Sample 5% of requests → AI judge | | Post-deploy | User satisfaction | Feedback collection | | Ongoing | Drift detection | Weekly metric trend |
Never deploy a new prompt or model without running the golden test set first.
Chip Huyen — AI Engineering (2025) Ch.3–4,10; Chip Huyen — Designing ML Systems (2022) Ch.8
data-ai
Use when adding AI-powered analytics to a SaaS platform — semantic search over business data, natural language queries, trend detection, anomaly alerts, and AI-generated insights for dashboards. Covers embeddings, NL2SQL, and per-tenant analytics...
data-ai
Design AI-powered analytics dashboards — what metrics to show, how to display AI predictions and confidence, drill-down patterns, KPI cards, trend visualisation, AI Insights panels, export design, and role-based dashboard variants. Invoke when...
development
Use when designing, building, reviewing, or upgrading production software systems that must be secure, performant, maintainable, scalable, and user-centered. Apply before writing specs, code, architecture, APIs, databases, mobile apps, SaaS platforms, or ERP systems.
development
Professional web app UI using commercial templates (Tabler/Bootstrap 5) with strong frontend design direction when needed. Use for CRUD interfaces, dashboards, admin panels with SweetAlert2, DataTables, Flatpickr. Clone seeder-page.php, use...