skills/advanced-evaluation/SKILL.md
This skill should be used when the user asks to "implement LLM-as-judge", "compare model outputs", "create evaluation rubrics", "mitigate evaluation bias", or mentions direct scoring, pairwise compari
npx skillsauth add ranbot-ai/awesome-skills advanced-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill covers production-grade techniques for evaluating LLM outputs using LLMs as judges. It synthesizes research from academic papers, industry practices, and practical implementation experience into actionable patterns for building reliable evaluation systems.
Key insight: LLM-as-a-Judge is not a single technique but a family of approaches, each suited to different evaluation contexts. Choosing the right approach and mitigating known biases is the core competency this skill develops.
Activate this skill when:
Evaluation approaches fall into two primary categories with distinct reliability profiles:
Direct Scoring: A single LLM rates one response on a defined scale.
Pairwise Comparison: An LLM compares two responses and selects the better one.
Research from the MT-Bench paper (Zheng et al., 2023) establishes that pairwise comparison achieves higher agreement with human judges than direct scoring for preference-based evaluation, while direct scoring remains appropriate for objective criteria with clear ground truth.
LLM judges exhibit systematic biases that must be actively mitigated:
Position Bias: First-position responses receive preferential treatment in pairwise comparison. Mitigation: Evaluate twice with swapped positions, use majority vote or consistency check.
Length Bias: Longer responses are rated higher regardless of quality. Mitigation: Explicit prompting to ignore length, length-normalized scoring.
Self-Enhancement Bias: Models rate their own outputs higher. Mitigation: Use different models for generation and evaluation, or acknowledge limitation.
Verbosity Bias: Detailed explanations receive higher scores even when unnecessary. Mitigation: Criteria-specific rubrics that penalize irrelevant detail.
Authority Bias: Confident, authoritative tone rated higher regardless of accuracy. Mitigation: Require evidence citation, fact-checking layer.
Choose metrics based on the evaluation task structure:
| Task Type | Primary Metrics | Secondary Metrics | |-----------|-----------------|-------------------| | Binary classification (pass/fail) | Recall, Precision, F1 | Cohen's κ | | Ordinal scale (1-5 rating) | Spearman's ρ, Kendall's τ | Cohen's κ (weighted) | | Pairwise preference | Agreement rate, Position consistency | Confidence calibration | | Multi-label | Macro-F1, Micro-F1 | Per-label precision/recall |
The critical insight: High absolute agreement matters less than systematic disagreement patterns. A judge that consistently disagrees with humans on specific criteria is more problematic than one with random noise.
Direct scoring requires three components: clear criteria, a calibrated scale, and structured output format.
Criteria Definition Pattern:
Criterion: [Name]
Description: [What this criterion measures]
Weight: [Relative importance, 0-1]
Scale Calibration:
Prompt Structure for Direct Scoring:
You are an expert evaluator assessing response quality.
## Task
Evaluate the following response against each criterion.
## Original Prompt
{prompt}
## Response to Evaluate
{response}
## Criteria
{for each criterion: name, description, weight}
## Instructions
For each criterion:
1. Find specific evidence in the response
2. Score according to the rubric (1-{max} scale)
3. Justify your score with evidence
4. Suggest one specific improvement
## Output Format
Respond with structured JSON containing scores, justifications, and summary.
Chain-of-Thought Requirement: All scoring prompts must require justification before the score. Research shows this improves reliability by 15-25% compared to score-first approaches.
Pairwise comparison is inherently more reliable for preference-
testing
Fix SEO indexing issues, crawl budget problems, and Search Console coverage errors for Next.js apps. Covers canonical tags, noindex audits, sitemap health, static rendering, and internal linking.
data-ai
Analyze AI disruption pressure across a business, map competitive exposure, and produce a 90-day defensive action plan.
tools
--- name: longbridge description: 125+ agent skills for Longbridge Securities — real-time quotes, charts, fundamentals, portfolio analysis, options, and more for HK/US/A-share/SG markets. Trilingual: Simplified Chinese, Traditional category: AI & Agents source: antigravity tags: [api, mcp, claude, ai, agent, security, cro] url: https://github.com/sickn33/antigravity-awesome-skills/tree/main/skills/longbridge --- # Longbridge ## Overview Longbridge is the official skill collection for Longbr
tools
Design, debug, and harden GitHub Actions CI/CD workflows, including reusable workflows, matrix builds, self-hosted runners, OIDC authentication, caching, environments, secrets, and release automation.