Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

curiositech/llm-evaluation-harness

Name: llm-evaluation-harness
Author: curiositech

skills/llm-evaluation-harness/SKILL.md

npx skillsauth add curiositech/windags-skills llm-evaluation-harness

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

LLM Evaluation Harness

Build automated evaluation pipelines for LLM applications with benchmarks, regression tests, RAG evaluation (RAGAS), and human eval workflows.

Activation Triggers

Activate on: "evaluate LLM", "benchmark model", "regression test AI", "RAGAS evaluation", "eval pipeline", "LLM quality metrics", "compare model versions", "human evaluation workflow", "test AI responses"

NOT for: Traditional unit/integration testing (testing-expert), model training loops (ai-engineer), or prompt writing (prompt-engineer)

Quick Start

Define eval dimensions — Correctness, faithfulness, relevance, coherence, safety. Pick the 2-3 that matter most for your use case.
Build eval dataset — 50-200 curated test cases with expected outputs or rubrics. Include edge cases and adversarial inputs.
Choose eval methods — LLM-as-judge for scalable scoring, exact-match for structured outputs, RAGAS for RAG systems, human eval for nuance.
Automate in CI — Run evals on every prompt change, model upgrade, or pipeline modification. Fail the build if scores regress.
Track trends — Store eval results over time. A 2% quality drop per release compounds into a 20% drop over 10 releases.

Core Capabilities

| Domain | Technologies | Notes | |--------|-------------|-------| | RAG Evaluation | RAGAS, DeepEval, custom | Faithfulness, answer relevance, context precision | | LLM-as-Judge | Claude, GPT-4o, Llama 3.1 as evaluators | Rubric-based scoring with calibration | | Exact Match | Regex, JSON schema validation, string match | For structured outputs: classification, extraction | | Human Eval | Argilla, Label Studio, custom UI | Gold-standard quality, expensive, slow | | Benchmarks | MMLU, HumanEval, custom domain benchmarks | Standardized comparison across models | | CI Integration | GitHub Actions, pytest, Vitest | Eval-as-tests with pass/fail thresholds |

Architecture Patterns

Pattern 1: Multi-Method Evaluation Pipeline

Eval Dataset (N test cases)
    │
    ├──→ [Exact Match] ──→ Precision/Recall/F1 (for structured outputs)
    │
    ├──→ [LLM-as-Judge] ──→ Rubric scores 1-5 per dimension
    │        │
    │        └── Calibrate: run judge on 20 pre-scored examples first
    │
    ├──→ [RAGAS] ──→ Faithfulness, Answer Relevance, Context Precision
    │        │
    │        └── For RAG systems only; measures retrieval + generation quality
    │
    └──→ [Human Eval] ──→ Gold-standard labels (sample 10-20%)
             │
             └── Use for calibrating LLM-as-judge, not as primary method

All results ──→ [Score Aggregation] ──→ [Trend Tracker] ──→ [CI Gate]

# LLM-as-judge evaluation
import json

JUDGE_RUBRIC = """
Score the following response on a scale of 1-5 for each dimension:

- **Correctness** (1-5): Is the information factually accurate?
- **Completeness** (1-5): Does it address all parts of the question?
- **Clarity** (1-5): Is it well-organized and easy to understand?

Question: {question}
Expected: {expected}
Response: {response}

Return JSON: {{"correctness": N, "completeness": N, "clarity": N, "reasoning": "..."}}
"""

async def evaluate_with_judge(test_cases: list[dict], model_output_fn) -> dict:
    results = []
    for case in test_cases:
        response = await model_output_fn(case["question"])
        judge_prompt = JUDGE_RUBRIC.format(
            question=case["question"],
            expected=case["expected"],
            response=response
        )
        scores = await llm_call(judge_prompt, model="claude-sonnet-4-20250514", temperature=0)
        results.append(json.loads(scores))

    # Aggregate
    return {
        dim: sum(r[dim] for r in results) / len(results)
        for dim in ["correctness", "completeness", "clarity"]
    }

Pattern 2: RAG Evaluation with RAGAS

Test Case: (question, ground_truth, retrieved_contexts)
    │
    ├──→ Faithfulness: Is the answer supported by retrieved contexts?
    │    Score = (claims supported by context) / (total claims in answer)
    │
    ├──→ Answer Relevance: Does the answer address the question?
    │    Score = cosine_sim(question, generated_questions_from_answer)
    │
    ├──→ Context Precision: Are relevant contexts ranked higher?
    │    Score = weighted precision of relevant contexts in top-k
    │
    └──→ Context Recall: Were all ground-truth facts retrievable?
         Score = (ground_truth_claims in contexts) / (total ground_truth_claims)

# RAGAS evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

eval_dataset = Dataset.from_dict({
    "question": questions,
    "answer": generated_answers,
    "contexts": retrieved_contexts,
    "ground_truth": expected_answers,
})

result = evaluate(eval_dataset, metrics=[
    faithfulness, answer_relevancy, context_precision
])
print(result)  # {'faithfulness': 0.87, 'answer_relevancy': 0.92, ...}

Pattern 3: CI Regression Gate

On PR / prompt change:
    │
    ▼
[Run eval suite] ──→ scores
    │
    ▼
[Compare to baseline]
    ├── Score >= baseline - tolerance (2%) ──→ PASS (merge allowed)
    └── Score < baseline - tolerance        ──→ FAIL (block merge)
                                                  │
                                                  └── Report: which test cases regressed, by how much

Anti-Patterns

Evaluating without a baseline — A score of 4.2/5 means nothing without knowing the previous score was 4.5. Always track baselines and trends.
LLM-as-judge without calibration — Judges have biases (verbosity preference, position bias). Calibrate on 20+ pre-scored examples and check inter-rater agreement.
Too few test cases — 10 test cases produce noisy metrics. Target 50+ for reliable averages, 200+ for statistical confidence on sub-dimensions.
Evaluating only happy paths — Include adversarial inputs, edge cases, ambiguous questions, and out-of-scope queries. The model should fail gracefully.
Manual eval as the only method — Human evaluation is expensive and slow. Use it to calibrate automated methods, then run automated evals in CI.

Quality Checklist

[ ] Eval dataset has 50+ test cases with expected outputs or rubrics
[ ] Multiple eval dimensions defined (correctness, completeness, safety, etc.)
[ ] LLM-as-judge calibrated against human scores (inter-rater agreement > 0.8)
[ ] Baseline scores established and tracked over time
[ ] Regression threshold defined (e.g., fail if any dimension drops > 2%)
[ ] Edge cases and adversarial inputs included in eval dataset (minimum 20%)
[ ] Eval runs automated in CI on every prompt/model/pipeline change
[ ] Results stored with timestamps for trend analysis
[ ] Human eval used for calibration, not as sole evaluation method
[ ] RAGAS metrics used for RAG systems (faithfulness, relevance, precision)

curiositech/llm-evaluation-harness

skills/llm-evaluation-harness/SKILL.md

Build automated LLM evaluation pipelines with benchmarks, regression tests, RAGAS, and human eval workflows. Activate on: LLM evaluation, benchmark testing, eval pipeline, RAGAS, model regression tests. NOT for: traditional software testing (testing-expert), model training (ai-engineer).

development

Updated Apr 4, 2026

$ install --global

skillsauth

npx skillsauth add curiositech/windags-skills llm-evaluation-harness

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 4, 2026, 2:15 PM4.3s1 file scanned

SKILL.md

license:: Apache-2.0
name:: llm-evaluation-harness
description:: Build automated LLM evaluation pipelines with benchmarks, regression tests, RAGAS, and human eval workflows. Activate on: LLM evaluation, benchmark testing, eval pipeline, RAGAS, model regression tests. NOT for: traditional software testing (testing-expert), model training (ai-engineer).
allowed-tools:: Read,Write,Edit,Bash(python:*,pip:*,npm:*,npx:*)
category:: AI & Machine Learning
- skill:: fine-tuning-dataset-curator
reason:: Eval sets curated alongside training data measure fine-tune effectiveness

LLM Evaluation Harness

Build automated evaluation pipelines for LLM applications with benchmarks, regression tests, RAG evaluation (RAGAS), and human eval workflows.

Activation Triggers

NOT for: Traditional unit/integration testing (testing-expert), model training loops (ai-engineer), or prompt writing (prompt-engineer)

Quick Start

Define eval dimensions — Correctness, faithfulness, relevance, coherence, safety. Pick the 2-3 that matter most for your use case.
Build eval dataset — 50-200 curated test cases with expected outputs or rubrics. Include edge cases and adversarial inputs.
Choose eval methods — LLM-as-judge for scalable scoring, exact-match for structured outputs, RAGAS for RAG systems, human eval for nuance.
Automate in CI — Run evals on every prompt change, model upgrade, or pipeline modification. Fail the build if scores regress.
Track trends — Store eval results over time. A 2% quality drop per release compounds into a 20% drop over 10 releases.

Core Capabilities

Architecture Patterns

Pattern 1: Multi-Method Evaluation Pipeline

Eval Dataset (N test cases)
    │
    ├──→ [Exact Match] ──→ Precision/Recall/F1 (for structured outputs)
    │
    ├──→ [LLM-as-Judge] ──→ Rubric scores 1-5 per dimension
    │        │
    │        └── Calibrate: run judge on 20 pre-scored examples first
    │
    ├──→ [RAGAS] ──→ Faithfulness, Answer Relevance, Context Precision
    │        │
    │        └── For RAG systems only; measures retrieval + generation quality
    │
    └──→ [Human Eval] ──→ Gold-standard labels (sample 10-20%)
             │
             └── Use for calibrating LLM-as-judge, not as primary method

All results ──→ [Score Aggregation] ──→ [Trend Tracker] ──→ [CI Gate]

# LLM-as-judge evaluation
import json

JUDGE_RUBRIC = """
Score the following response on a scale of 1-5 for each dimension:

- **Correctness** (1-5): Is the information factually accurate?
- **Completeness** (1-5): Does it address all parts of the question?
- **Clarity** (1-5): Is it well-organized and easy to understand?

Question: {question}
Expected: {expected}
Response: {response}

Return JSON: {{"correctness": N, "completeness": N, "clarity": N, "reasoning": "..."}}
"""

async def evaluate_with_judge(test_cases: list[dict], model_output_fn) -> dict:
    results = []
    for case in test_cases:
        response = await model_output_fn(case["question"])
        judge_prompt = JUDGE_RUBRIC.format(
            question=case["question"],
            expected=case["expected"],
            response=response
        )
        scores = await llm_call(judge_prompt, model="claude-sonnet-4-20250514", temperature=0)
        results.append(json.loads(scores))

    # Aggregate
    return {
        dim: sum(r[dim] for r in results) / len(results)
        for dim in ["correctness", "completeness", "clarity"]
    }

Pattern 2: RAG Evaluation with RAGAS

Test Case: (question, ground_truth, retrieved_contexts)
    │
    ├──→ Faithfulness: Is the answer supported by retrieved contexts?
    │    Score = (claims supported by context) / (total claims in answer)
    │
    ├──→ Answer Relevance: Does the answer address the question?
    │    Score = cosine_sim(question, generated_questions_from_answer)
    │
    ├──→ Context Precision: Are relevant contexts ranked higher?
    │    Score = weighted precision of relevant contexts in top-k
    │
    └──→ Context Recall: Were all ground-truth facts retrievable?
         Score = (ground_truth_claims in contexts) / (total ground_truth_claims)

# RAGAS evaluation
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy, context_precision
from datasets import Dataset

eval_dataset = Dataset.from_dict({
    "question": questions,
    "answer": generated_answers,
    "contexts": retrieved_contexts,
    "ground_truth": expected_answers,
})

result = evaluate(eval_dataset, metrics=[
    faithfulness, answer_relevancy, context_precision
])
print(result)  # {'faithfulness': 0.87, 'answer_relevancy': 0.92, ...}

Pattern 3: CI Regression Gate

On PR / prompt change:
    │
    ▼
[Run eval suite] ──→ scores
    │
    ▼
[Compare to baseline]
    ├── Score >= baseline - tolerance (2%) ──→ PASS (merge allowed)
    └── Score < baseline - tolerance        ──→ FAIL (block merge)
                                                  │
                                                  └── Report: which test cases regressed, by how much

Anti-Patterns

Evaluating without a baseline — A score of 4.2/5 means nothing without knowing the previous score was 4.5. Always track baselines and trends.
LLM-as-judge without calibration — Judges have biases (verbosity preference, position bias). Calibrate on 20+ pre-scored examples and check inter-rater agreement.
Too few test cases — 10 test cases produce noisy metrics. Target 50+ for reliable averages, 200+ for statistical confidence on sub-dimensions.
Evaluating only happy paths — Include adversarial inputs, edge cases, ambiguous questions, and out-of-scope queries. The model should fail gracefully.
Manual eval as the only method — Human evaluation is expensive and slow. Use it to calibrate automated methods, then run automated evals in CI.

Quality Checklist

[ ] Eval dataset has 50+ test cases with expected outputs or rubrics
[ ] Multiple eval dimensions defined (correctness, completeness, safety, etc.)
[ ] LLM-as-judge calibrated against human scores (inter-rater agreement > 0.8)
[ ] Baseline scores established and tracked over time
[ ] Regression threshold defined (e.g., fail if any dimension drops > 2%)
[ ] Edge cases and adversarial inputs included in eval dataset (minimum 20%)
[ ] Eval runs automated in CI on every prompt/model/pipeline change
[ ] Results stored with timestamps for trend analysis
[ ] Human eval used for calibration, not as sole evaluation method
[ ] RAGAS metrics used for RAG systems (faithfulness, relevance, precision)

Related Skills

curiositech/revisiting-interview-data-analysing-turn

data-ai

VerifiedTrustedCommunity

license: Apache-2.0 NOT for unrelated tasks outside this domain.

8SKILL.mdUpdated Jul 19, 2026

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

development

VerifiedTrustedCommunity

Use when designing caching strategies (cache-aside, write-through, write-behind), implementing distributed locks, building rate limiters, leaderboards, real-time streams (XADD/consumer groups), pub/sub, or tuning eviction policies. Triggers: thundering-herd on cache miss, dogpile on key expiry, Redlock vs SET-NX-PX choice, sliding-window rate limiter, hot-key on a single cluster slot, big-key blowup, MULTI/EXEC across slots, KEYS in production. NOT for Redis Cluster operations/admin (different domain), embedded KV (SQLite, leveldb), in-process LRU caches, or Memcached.

8SKILL.mdUpdated Jul 19, 2026

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

tools

VerifiedTrustedCommunity

Drawing the `'use client'` boundary correctly in React Server Components apps (Next.js App Router, RSC frameworks) — leaf-pushing, slot composition, serialization rules, and environment poisoning prevention. Grounded in react.dev and Next.js 16 docs.

8SKILL.mdUpdated Jul 19, 2026

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy

development

VerifiedTrustedCommunity

Use when designing rate limiting for an API, choosing between token bucket / sliding window / leaky bucket / fixed window, implementing it in Redis, deciding edge (Cloudflare/Upstash) vs origin enforcement, sizing per-user vs per-IP vs per-endpoint quotas, returning the right 429 response with Retry-After, or fixing the boundary-burst bug in fixed-window limiters. Triggers: 429 too many requests, INCR + EXPIRE, ZADD + ZREMRANGEBYSCORE + ZCARD, X-RateLimit-Remaining header, Cloudflare WAF rate limiting rules, Upstash @upstash/ratelimit, leaky bucket shaping vs policing, distributed rate limiter consistency. NOT for DDoS mitigation specifically (different scale), CAPTCHA / bot management, full WAF design, or per-user quota billing.

8SKILL.mdUpdated Jul 19, 2026

curiositech/rate-limiting-strategy

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/curiositech/windags-skills.git

# Copy into Claude Code skills folder (global)
cp -r windags-skills/skills/llm-evaluation-harness ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

curiositech/windags-skills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT