skills/llm-evaluation-harness/SKILL.md
Build automated LLM evaluation pipelines with benchmarks, regression tests, RAGAS, and human eval workflows. Activate on: LLM evaluation, benchmark testing, eval pipeline, RAGAS, model regression tests. NOT for: traditional software testing (testing-expert), model training (ai-engineer).
npx skillsauth add curiositech/windags-skills llm-evaluation-harnessInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Build automated evaluation pipelines for LLM applications with benchmarks, regression tests, RAG evaluation (RAGAS), and human eval workflows.
Activate on: "evaluate LLM", "benchmark model", "regression test AI", "RAGAS evaluation", "eval pipeline", "LLM quality metrics", "compare model versions", "human evaluation workflow", "test AI responses"
NOT for: Traditional unit/integration testing (testing-expert), model training loops (ai-engineer), or prompt writing (prompt-engineer)
| Domain | Technologies | Notes | |--------|-------------|-------| | RAG Evaluation | RAGAS, DeepEval, custom | Faithfulness, answer relevance, context precision | | LLM-as-Judge | Claude, GPT-4o, Llama 3.1 as evaluators | Rubric-based scoring with calibration | | Exact Match | Regex, JSON schema validation, string match | For structured outputs: classification, extraction | | Human Eval | Argilla, Label Studio, custom UI | Gold-standard quality, expensive, slow | | Benchmarks | MMLU, HumanEval, custom domain benchmarks | Standardized comparison across models | | CI Integration | GitHub Actions, pytest, Vitest | Eval-as-tests with pass/fail thresholds |
Eval Dataset (N test cases)
│
├──→ [Exact Match] ──→ Precision/Recall/F1 (for structured outputs)
│
├──→ [LLM-as-Judge] ──→ Rubric scores 1-5 per dimension
│ │
│ └── Calibrate: run judge on 20 pre-scored examples first
│
├──→ [RAGAS] ──→ Faithfulness, Answer Relevance, Context Precision
│ │
│ └── For RAG systems only; measures retrieval + generation quality
│
└──→ [Human Eval] ──→ Gold-standard labels (sample 10-20%)
│
└── Use for calibrating LLM-as-judge, not as primary method
All results ──→ [Score Aggregation] ──→ [Trend Tracker] ──→ [CI Gate]
# LLM-as-judge evaluation
import json
JUDGE_RUBRIC = """
Score the following response on a scale of 1-5 for each dimension:
- **Correctness** (1-5): Is the information factually accurate?
- **Completeness** (1-5): Does it address all parts of the question?
- **Clarity** (1-5): Is it well-organized and easy to understand?
Question: {question}
Expected: {expected}
Response: {response}
Return JSON: {{"correctness": N, "completeness": N, "clarity": N, "reasoning": "..."}}
"""
async def evaluate_with_judge(test_cases: list[dict], model_output_fn) -> dict:
results = []
for case in test_cases:
response = await model_output_fn(case["question"])
judge_prompt = JUDGE_RUBRIC.format(
question=case["question"],
expected=case["expected"],
response=response
)
scores = await llm_call(judge_prompt, model="claude-sonnet-4-20250514", temperature=0)
results.append(json.loads(scores))
# Aggregate
return {
dim: sum(r[dim] for r in results) / len(results)
for dim in ["correctness", "completeness", "clarity"]
}
Test Case: (question, ground_truth, retrieved_contexts)
│
├──→ Faithfulness: Is the answer supported by retrieved contexts?
│ Score = (claims supported by context) / (total claims in answer)
│
├──→ Answer Relevance: Does the answer address the question?
│ Score = cosine_sim(question, generated_questions_from_answer)
│
├──→ Context Precision: Are relevant contexts ranked higher?
│ Score = weighted precision of relevant contexts in top-k
│
└──→ Context Recall: Were all ground-truth facts retrievable?
Score = (ground_truth_claims in contexts) / (total ground_truth_claims)
# RAGAS evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset
eval_dataset = Dataset.from_dict({
"question": questions,
"answer": generated_answers,
"contexts": retrieved_contexts,
"ground_truth": expected_answers,
})
result = evaluate(eval_dataset, metrics=[
faithfulness, answer_relevancy, context_precision
])
print(result) # {'faithfulness': 0.87, 'answer_relevancy': 0.92, ...}
On PR / prompt change:
│
▼
[Run eval suite] ──→ scores
│
▼
[Compare to baseline]
├── Score >= baseline - tolerance (2%) ──→ PASS (merge allowed)
└── Score < baseline - tolerance ──→ FAIL (block merge)
│
└── Report: which test cases regressed, by how much
tools
Building resilient distributed systems with circuit breakers, retries with full-jitter exponential backoff, retry budgets (per-request 3-attempt + per-client 10% ratio per Google SRE), deadline propagation, and the cascading-failure math (4 layers × 3 retries = 64x amplification). Grounded in Resilience4j, Microsoft Cloud Patterns, AWS Architecture Blog (Marc Brooker), and Google SRE Book.
testing
Designing HTTP cache headers that work correctly across browsers, CDNs, and shared proxies — `Cache-Control` directives per RFC 9111, `stale-while-revalidate` and `stale-if-error` per RFC 5861, the Vary header for varying responses, and surrogate keys for tag-based purging. Grounded in IETF RFCs and Cloudflare/Fastly docs.
development
Use when designing or fixing a Content Security Policy on a real site, choosing between nonce-based and hash-based CSP, adding strict-dynamic, debugging "Refused to execute inline script" errors, deploying CSP in report-only mode first, configuring report-to / report-uri, or auditing an existing policy for unsafe-inline / unsafe-eval / wildcards. Triggers: "CSP blocks legitimate inline script", strict-dynamic, nonce-{RANDOM}, sha256-{HASH}, object-src none, base-uri none, frame-ancestors, Trusted Types, X-Content-Security-Policy obsolete, report-only vs enforced. NOT for general HTTP security headers (HSTS, COOP/COEP), Trusted Types deep dive, CORS configuration, or building a WAF.
tools
Choosing and operating an HTTP API versioning strategy that doesn't break clients — Stripe's date-based pinned versions, the Deprecation/Sunset header pair (RFC 9745 + RFC 8594), URI vs header vs media-type approaches, and the version-transformer pattern. Grounded in Stripe's published architecture and IETF RFCs.