skills/skillxiv-v0.0.2-claude-opus-4.6/agent-as-a-judge-evaluation-framework/SKILL.md
Transition from simple LLM-based evaluation to agentic judges that employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory. Survey of sophisticated evaluation paradigms for complex, specialized, and multi-step assessment tasks across diverse domains.
npx skillsauth add ADu2021/skillXiv agent-as-a-judge-evaluation-frameworkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Traditional LLM-as-a-Judge evaluation suffers from inherent limitations as assessments become increasingly complex, specialized, and multi-step: models exhibit biases, perform only shallow single-pass reasoning, and cannot verify assessments against real-world observations. For example, code correctness evaluation requires execution verification; mathematical proof assessment benefits from formal verification tools; and nuanced evaluations require consultation with domain experts. These limitations compound when evaluation decisions require persistent context or adaptive strategy adjustment.
Evolve from static LLM evaluation to autonomous agents employing structured planning, tool integration, multi-agent collaboration, and persistent memory.
class AgentAsAJudge:
def __init__(self, llm_backbone, tools):
self.llm = llm_backbone
self.tools = tools # Code executor, search, theorem prover, etc.
self.evaluation_history = {}
self.domain_experts = []
def evaluate_complex_submission(self, submission, rubric, context=None):
"""Multi-step agentic evaluation with tool verification"""
# Step 1: Planning - decompose evaluation into subtasks
evaluation_plan = self.create_evaluation_plan(submission, rubric)
# Output: List of specialized subtasks (code correctness, efficiency, style)
# Step 2: Distributed Task Execution
subtask_results = {}
for subtask in evaluation_plan.subtasks:
if subtask.requires_code_execution:
# Delegate to code executor tool
result = self.tools.execute_code(submission.code, subtask.test_cases)
subtask_results[subtask.id] = result
elif subtask.requires_search:
# Delegate to search tool
result = self.tools.search(subtask.query)
subtask_results[subtask.id] = result
elif subtask.requires_expert_review:
# Route to domain expert (human-in-the-loop)
result = self.domain_experts[subtask.expert_domain].review(submission)
subtask_results[subtask.id] = result
# Step 3: Multi-Agent Consensus (if multiple experts)
if len(self.domain_experts) > 1:
consensus_rating = self.aggregate_expert_opinions(subtask_results)
else:
consensus_rating = self.synthesize_results(subtask_results)
# Step 4: Persistent Memory Update
self.evaluation_history[submission.id] = {
"plan": evaluation_plan,
"results": subtask_results,
"reasoning_trace": consensus_rating.trace,
"final_score": consensus_rating.score,
"confidence": consensus_rating.confidence
}
return consensus_rating
def create_evaluation_plan(self, submission, rubric):
"""Planning agent designs evaluation workflow"""
prompt = f"""
Given submission:
{submission}
And evaluation rubric:
{rubric}
Design an evaluation plan with steps:
- What aspects require code execution verification?
- What aspects need external tool use (search, calculation)?
- What aspects require domain expert input?
- In what order should evaluations proceed?
"""
plan_text = self.llm.generate(prompt)
return parse_evaluation_plan(plan_text)
def aggregate_expert_opinions(self, expert_results):
"""Multi-agent debate mechanism"""
debate_prompt = f"""
Multiple experts have evaluated a submission:
{format_expert_opinions(expert_results)}
Please synthesize their opinions into a unified assessment.
Highlight agreements and discuss disagreements.
"""
synthesis = self.llm.generate(debate_prompt)
return parse_synthesis(synthesis)
Stage 1: Procedural
Stage 2: Reactive
Stage 3: Self-Evolving
1. Multi-Agent Collaboration
2. Planning
3. Tool Integration
4. Memory & Personalization
5. Optimization Paradigms
Implemented in:
Challenge: Computational Expense
Challenge: Inference Latency
Challenge: Safety Risks from Tool Access
Challenge: Privacy Concerns with Persistent Memory
The full paper provides comprehensive taxonomy of 50+ published approaches across these dimensions and application domains.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.