Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ADu2021/agent-as-a-judge-evaluation-framework

Name: agent-as-a-judge-evaluation-framework
Author: ADu2021

skills/skillxiv-v0.0.2-claude-opus-4.6/agent-as-a-judge-evaluation-framework/SKILL.md

npx skillsauth add ADu2021/skillXiv agent-as-a-judge-evaluation-framework

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

When to Use This Skill

Complex multi-step evaluation requiring decomposition into subtasks
Scenarios needing verification beyond model inference (code execution, theorem proving)
Assessments where tool access and external data are crucial
Domains requiring domain-expert involvement or specialization
Evaluation where context persistence across decisions improves judgment

When NOT to Use This Skill

Simple classification tasks (LLM-as-Judge is sufficient)
Real-time latency-critical evaluation scenarios
Evaluations where tool access creates safety or security risks
Scenarios with limited computational budget for multi-agent systems

Problem Summary

Traditional LLM-as-a-Judge evaluation suffers from inherent limitations as assessments become increasingly complex, specialized, and multi-step: models exhibit biases, perform only shallow single-pass reasoning, and cannot verify assessments against real-world observations. For example, code correctness evaluation requires execution verification; mathematical proof assessment benefits from formal verification tools; and nuanced evaluations require consultation with domain experts. These limitations compound when evaluation decisions require persistent context or adaptive strategy adjustment.

Solution: Agent-as-a-Judge Framework

Evolve from static LLM evaluation to autonomous agents employing structured planning, tool integration, multi-agent collaboration, and persistent memory.

class AgentAsAJudge:
    def __init__(self, llm_backbone, tools):
        self.llm = llm_backbone
        self.tools = tools  # Code executor, search, theorem prover, etc.
        self.evaluation_history = {}
        self.domain_experts = []

    def evaluate_complex_submission(self, submission, rubric, context=None):
        """Multi-step agentic evaluation with tool verification"""

        # Step 1: Planning - decompose evaluation into subtasks
        evaluation_plan = self.create_evaluation_plan(submission, rubric)
        # Output: List of specialized subtasks (code correctness, efficiency, style)

        # Step 2: Distributed Task Execution
        subtask_results = {}
        for subtask in evaluation_plan.subtasks:
            if subtask.requires_code_execution:
                # Delegate to code executor tool
                result = self.tools.execute_code(submission.code, subtask.test_cases)
                subtask_results[subtask.id] = result
            elif subtask.requires_search:
                # Delegate to search tool
                result = self.tools.search(subtask.query)
                subtask_results[subtask.id] = result
            elif subtask.requires_expert_review:
                # Route to domain expert (human-in-the-loop)
                result = self.domain_experts[subtask.expert_domain].review(submission)
                subtask_results[subtask.id] = result

        # Step 3: Multi-Agent Consensus (if multiple experts)
        if len(self.domain_experts) > 1:
            consensus_rating = self.aggregate_expert_opinions(subtask_results)
        else:
            consensus_rating = self.synthesize_results(subtask_results)

        # Step 4: Persistent Memory Update
        self.evaluation_history[submission.id] = {
            "plan": evaluation_plan,
            "results": subtask_results,
            "reasoning_trace": consensus_rating.trace,
            "final_score": consensus_rating.score,
            "confidence": consensus_rating.confidence
        }

        return consensus_rating

    def create_evaluation_plan(self, submission, rubric):
        """Planning agent designs evaluation workflow"""
        prompt = f"""
        Given submission:
        {submission}

        And evaluation rubric:
        {rubric}

        Design an evaluation plan with steps:
        - What aspects require code execution verification?
        - What aspects need external tool use (search, calculation)?
        - What aspects require domain expert input?
        - In what order should evaluations proceed?
        """
        plan_text = self.llm.generate(prompt)
        return parse_evaluation_plan(plan_text)

    def aggregate_expert_opinions(self, expert_results):
        """Multi-agent debate mechanism"""
        debate_prompt = f"""
        Multiple experts have evaluated a submission:
        {format_expert_opinions(expert_results)}

        Please synthesize their opinions into a unified assessment.
        Highlight agreements and discuss disagreements.
        """
        synthesis = self.llm.generate(debate_prompt)
        return parse_synthesis(synthesis)

Three Developmental Stages

Stage 1: Procedural

Fixed workflows with predetermined decision rules
Example: Template-based rubric following
Limitation: Cannot adapt to submission-specific characteristics

Stage 2: Reactive

Adaptive routing based on intermediate feedback
Tool invocation triggered by observed issues
Example: Code evaluation → runs tests → if failures, analyzes error messages
Improvement: Responds to evidence but still within pre-defined pathways

Stage 3: Self-Evolving

Autonomous refinement of evaluation rubrics during operation
Updates criteria based on new submission types
Example: Discovers new code anti-patterns → adds to evaluation criteria
Most sophisticated: Continuous improvement of evaluation process

Five Key Methodological Dimensions

1. Multi-Agent Collaboration

Collective consensus mechanisms (voting, debate)
Task specialization (separate agents for different aspects)
Expert diversity reduces individual model biases

2. Planning

Workflow orchestration (execution order matters)
Rubric discovery (adaptively refine evaluation criteria)
Strategy adaptation (route to appropriate tools/experts)

3. Tool Integration

Code execution (verify correctness)
Theorem provers (validate mathematical proofs)
Search engines (fact-checking, evidence gathering)
Specialized calculators and validators

4. Memory & Personalization

Intermediate state tracking (evaluation history)
User context persistence (remember prior interactions)
Pattern learning (identify common issues)

5. Optimization Paradigms

Training-time (fine-tune judges on feedback)
Inference-time (adaptive evaluation strategy selection)
Hybrid (continuous learning + strategic optimization)

Application Domains

Implemented in:

Mathematics: Proof verification, solution correctness
Code Analysis: Execution correctness, efficiency, style
Fact-Checking: Multi-source verification, claim validation
Conversation Quality: Turn-level assessment, coherence evaluation
Medicine: Diagnosis assessment, treatment plan evaluation
Law: Case analysis, precedent relevance
Finance: Risk assessment, decision reasoning
Education: Student understanding, learning progress

Key Challenges & Mitigation

Challenge: Computational Expense

Mitigation: Cache evaluation results, use faster models for initial screening

Challenge: Inference Latency

Mitigation: Parallelize independent subtask evaluation

Challenge: Safety Risks from Tool Access

Mitigation: Sandboxed execution, permission-based tool access

Challenge: Privacy Concerns with Persistent Memory

Mitigation: Anonymization, access controls, retention policies

Implementation Recommendations

Start with Procedural Stage: Template-based evaluation is baseline
Integrate Tools Gradually: Add code execution, then search, then specialized tools
Implement Multi-Agent Review: Deploy domain experts for high-stakes decisions
Build Memory Infrastructure: Log evaluation decisions for pattern analysis
Add Self-Evolution Loop: Periodically review and refine rubrics

Survey Coverage

The full paper provides comprehensive taxonomy of 50+ published approaches across these dimensions and application domains.

ADu2021/agent-as-a-judge-evaluation-framework

skills/skillxiv-v0.0.2-claude-opus-4.6/agent-as-a-judge-evaluation-framework/SKILL.md

Transition from simple LLM-based evaluation to agentic judges that employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory. Survey of sophisticated evaluation paradigms for complex, specialized, and multi-step assessment tasks across diverse domains.

2 stars

tools

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add ADu2021/skillXiv agent-as-a-judge-evaluation-framework

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 3:02 PM4.3s1 file scanned

SKILL.md

name:: agent-as-a-judge-evaluation-framework
title:: Agent-as-a-Judge
version:: 0.0.2
engine:: skillxiv-v0.0.2-claude-opus-4.6
license:: MIT
url:: https://arxiv.org/abs/2601.05111
keywords:: [Evaluation Systems, Agent Design, LLM Evaluation, Multi-Agent Systems]
description:: Transition from simple LLM-based evaluation to agentic judges that employ planning, tool-augmented verification, multi-agent collaboration, and persistent memory. Survey of sophisticated evaluation paradigms for complex, specialized, and multi-step assessment tasks across diverse domains.

When to Use This Skill

Complex multi-step evaluation requiring decomposition into subtasks
Scenarios needing verification beyond model inference (code execution, theorem proving)
Assessments where tool access and external data are crucial
Domains requiring domain-expert involvement or specialization
Evaluation where context persistence across decisions improves judgment

When NOT to Use This Skill

Simple classification tasks (LLM-as-Judge is sufficient)
Real-time latency-critical evaluation scenarios
Evaluations where tool access creates safety or security risks
Scenarios with limited computational budget for multi-agent systems

Problem Summary

Solution: Agent-as-a-Judge Framework

Evolve from static LLM evaluation to autonomous agents employing structured planning, tool integration, multi-agent collaboration, and persistent memory.

class AgentAsAJudge:
    def __init__(self, llm_backbone, tools):
        self.llm = llm_backbone
        self.tools = tools  # Code executor, search, theorem prover, etc.
        self.evaluation_history = {}
        self.domain_experts = []

    def evaluate_complex_submission(self, submission, rubric, context=None):
        """Multi-step agentic evaluation with tool verification"""

        # Step 1: Planning - decompose evaluation into subtasks
        evaluation_plan = self.create_evaluation_plan(submission, rubric)
        # Output: List of specialized subtasks (code correctness, efficiency, style)

        # Step 2: Distributed Task Execution
        subtask_results = {}
        for subtask in evaluation_plan.subtasks:
            if subtask.requires_code_execution:
                # Delegate to code executor tool
                result = self.tools.execute_code(submission.code, subtask.test_cases)
                subtask_results[subtask.id] = result
            elif subtask.requires_search:
                # Delegate to search tool
                result = self.tools.search(subtask.query)
                subtask_results[subtask.id] = result
            elif subtask.requires_expert_review:
                # Route to domain expert (human-in-the-loop)
                result = self.domain_experts[subtask.expert_domain].review(submission)
                subtask_results[subtask.id] = result

        # Step 3: Multi-Agent Consensus (if multiple experts)
        if len(self.domain_experts) > 1:
            consensus_rating = self.aggregate_expert_opinions(subtask_results)
        else:
            consensus_rating = self.synthesize_results(subtask_results)

        # Step 4: Persistent Memory Update
        self.evaluation_history[submission.id] = {
            "plan": evaluation_plan,
            "results": subtask_results,
            "reasoning_trace": consensus_rating.trace,
            "final_score": consensus_rating.score,
            "confidence": consensus_rating.confidence
        }

        return consensus_rating

    def create_evaluation_plan(self, submission, rubric):
        """Planning agent designs evaluation workflow"""
        prompt = f"""
        Given submission:
        {submission}

        And evaluation rubric:
        {rubric}

        Design an evaluation plan with steps:
        - What aspects require code execution verification?
        - What aspects need external tool use (search, calculation)?
        - What aspects require domain expert input?
        - In what order should evaluations proceed?
        """
        plan_text = self.llm.generate(prompt)
        return parse_evaluation_plan(plan_text)

    def aggregate_expert_opinions(self, expert_results):
        """Multi-agent debate mechanism"""
        debate_prompt = f"""
        Multiple experts have evaluated a submission:
        {format_expert_opinions(expert_results)}

        Please synthesize their opinions into a unified assessment.
        Highlight agreements and discuss disagreements.
        """
        synthesis = self.llm.generate(debate_prompt)
        return parse_synthesis(synthesis)

Three Developmental Stages

Stage 1: Procedural

Fixed workflows with predetermined decision rules
Example: Template-based rubric following
Limitation: Cannot adapt to submission-specific characteristics

Stage 2: Reactive

Adaptive routing based on intermediate feedback
Tool invocation triggered by observed issues
Example: Code evaluation → runs tests → if failures, analyzes error messages
Improvement: Responds to evidence but still within pre-defined pathways

Stage 3: Self-Evolving

Autonomous refinement of evaluation rubrics during operation
Updates criteria based on new submission types
Example: Discovers new code anti-patterns → adds to evaluation criteria
Most sophisticated: Continuous improvement of evaluation process

Five Key Methodological Dimensions

1. Multi-Agent Collaboration

Collective consensus mechanisms (voting, debate)
Task specialization (separate agents for different aspects)
Expert diversity reduces individual model biases

2. Planning

Workflow orchestration (execution order matters)
Rubric discovery (adaptively refine evaluation criteria)
Strategy adaptation (route to appropriate tools/experts)

3. Tool Integration

Code execution (verify correctness)
Theorem provers (validate mathematical proofs)
Search engines (fact-checking, evidence gathering)
Specialized calculators and validators

4. Memory & Personalization

Intermediate state tracking (evaluation history)
User context persistence (remember prior interactions)
Pattern learning (identify common issues)

5. Optimization Paradigms

Training-time (fine-tune judges on feedback)
Inference-time (adaptive evaluation strategy selection)
Hybrid (continuous learning + strategic optimization)

Application Domains

Implemented in:

Mathematics: Proof verification, solution correctness
Code Analysis: Execution correctness, efficiency, style
Fact-Checking: Multi-source verification, claim validation
Conversation Quality: Turn-level assessment, coherence evaluation
Medicine: Diagnosis assessment, treatment plan evaluation
Law: Case analysis, precedent relevance
Finance: Risk assessment, decision reasoning
Education: Student understanding, learning progress

Key Challenges & Mitigation

Challenge: Computational Expense

Mitigation: Cache evaluation results, use faster models for initial screening

Challenge: Inference Latency

Mitigation: Parallelize independent subtask evaluation

Challenge: Safety Risks from Tool Access

Mitigation: Sandboxed execution, permission-based tool access

Challenge: Privacy Concerns with Persistent Memory

Mitigation: Anonymization, access controls, retention policies

Implementation Recommendations

Start with Procedural Stage: Template-based evaluation is baseline
Integrate Tools Gradually: Add code execution, then search, then specialized tools
Implement Multi-Agent Review: Deploy domain experts for high-stakes decisions
Build Memory Infrastructure: Log evaluation decisions for pattern analysis
Add Self-Evolution Loop: Periodically review and refine rubrics

Survey Coverage

The full paper provides comprehensive taxonomy of 50+ published approaches across these dimensions and application domains.

Related Skills

ADu2021/flow-map-trajectory-tilting

testing

VerifiedTrustedCommunity

Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

testing

VerifiedTrustedCommunity

Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

data-ai

VerifiedTrustedCommunity

Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

devops

VerifiedTrustedCommunity

Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flex-continuous-agent-evolution

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ADu2021/skillXiv.git

# Copy into Claude Code skills folder (global)
cp -r skillXiv/skills/skillxiv-v0.0.2-claude-opus-4.6/agent-as-a-judge-evaluation-framework ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ADu2021/skillXiv

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT