skills/auto-arena/SKILL.md
Automatically evaluate and compare multiple AI models or agents without pre-existing test data. Generates test queries from a task description, collects responses from all target endpoints, auto-generates evaluation rubrics, runs pairwise comparisons via a judge model, and produces win-rate rankings with reports and charts. Supports checkpoint resume, incremental endpoint addition, and judge model hot-swap. Use when the user asks to compare, benchmark, or rank multiple models or agents on a custom task, or run an arena-style evaluation.
npx skillsauth add agentscope-ai/openjudge auto-arenaInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
End-to-end automated model comparison using the OpenJudge AutoArenaPipeline:
# Install OpenJudge
pip install py-openjudge
# Extra dependency for auto_arena (chart generation)
pip install matplotlib
| Info | Required? | Notes |
|------|-----------|-------|
| Task description | Yes | What the models/agents should do (set in config YAML) |
| Target endpoints | Yes | At least 2 OpenAI-compatible endpoints to compare |
| Judge endpoint | Yes | Strong model for pairwise evaluation (e.g. gpt-4, qwen-max) |
| API keys | Yes | Env vars: OPENAI_API_KEY, DASHSCOPE_API_KEY, etc. |
| Number of queries | No | Default: 20 |
| Seed queries | No | Example queries to guide generation style |
| System prompts | No | Per-endpoint system prompts |
| Output directory | No | Default: ./evaluation_results |
| Report language | No | "zh" (default) or "en" |
# Run evaluation
python -m cookbooks.auto_arena --config config.yaml --save
# Use pre-generated queries
python -m cookbooks.auto_arena --config config.yaml \
--queries_file queries.json --save
# Start fresh, ignore checkpoint
python -m cookbooks.auto_arena --config config.yaml --fresh --save
# Re-run only pairwise evaluation with new judge model
# (keeps queries, responses, and rubrics)
python -m cookbooks.auto_arena --config config.yaml --rerun-judge --save
import asyncio
from cookbooks.auto_arena.auto_arena_pipeline import AutoArenaPipeline
async def main():
pipeline = AutoArenaPipeline.from_config("config.yaml")
result = await pipeline.evaluate()
print(f"Best model: {result.best_pipeline}")
for rank, (model, win_rate) in enumerate(result.rankings, 1):
print(f"{rank}. {model}: {win_rate:.1%}")
asyncio.run(main())
import asyncio
from cookbooks.auto_arena.auto_arena_pipeline import AutoArenaPipeline
from cookbooks.auto_arena.schema import OpenAIEndpoint
async def main():
pipeline = AutoArenaPipeline(
task_description="Customer service chatbot for e-commerce",
target_endpoints={
"gpt4": OpenAIEndpoint(
base_url="https://api.openai.com/v1",
api_key="sk-...",
model="gpt-4",
),
"qwen": OpenAIEndpoint(
base_url="https://dashscope.aliyuncs.com/compatible-mode/v1",
api_key="sk-...",
model="qwen-max",
),
},
judge_endpoint=OpenAIEndpoint(
base_url="https://api.openai.com/v1",
api_key="sk-...",
model="gpt-4",
),
num_queries=20,
)
result = await pipeline.evaluate()
print(f"Best: {result.best_pipeline}")
asyncio.run(main())
| Flag | Default | Description |
|------|---------|-------------|
| --config | — | Path to YAML configuration file (required) |
| --output_dir | config value | Override output directory |
| --queries_file | — | Path to pre-generated queries JSON (skip generation) |
| --save | False | Save results to file |
| --fresh | False | Start fresh, ignore checkpoint |
| --rerun-judge | False | Re-run pairwise evaluation only (keep queries/responses/rubrics) |
task:
description: "Academic GPT assistant for research and writing tasks"
target_endpoints:
model_v1:
base_url: "https://api.openai.com/v1"
api_key: "${OPENAI_API_KEY}"
model: "gpt-4"
model_v2:
base_url: "https://api.openai.com/v1"
api_key: "${OPENAI_API_KEY}"
model: "gpt-3.5-turbo"
judge_endpoint:
base_url: "https://api.openai.com/v1"
api_key: "${OPENAI_API_KEY}"
model: "gpt-4"
| Field | Required | Description |
|-------|----------|-------------|
| description | Yes | Clear description of the task models will be tested on |
| scenario | No | Usage scenario for additional context |
| Field | Default | Description |
|-------|---------|-------------|
| base_url | — | API base URL (required) |
| api_key | — | API key, supports ${ENV_VAR} (required) |
| model | — | Model name (required) |
| system_prompt | — | System prompt for this endpoint |
| extra_params | — | Extra API params (e.g. temperature, max_tokens) |
Same fields as target_endpoints.<name>. Use a strong model (e.g. gpt-4, qwen-max) with low temperature (~0.1) for consistent judgments.
| Field | Default | Description |
|-------|---------|-------------|
| num_queries | 20 | Total number of queries to generate |
| seed_queries | — | Example queries to guide generation |
| categories | — | Query categories with weights for stratified generation |
| endpoint | judge endpoint | Custom endpoint for query generation |
| queries_per_call | 10 | Queries generated per API call (1–50) |
| num_parallel_batches | 3 | Parallel generation batches |
| temperature | 0.9 | Sampling temperature (0.0–2.0) |
| top_p | 0.95 | Top-p sampling (0.0–1.0) |
| max_similarity | 0.85 | Dedup similarity threshold (0.0–1.0) |
| enable_evolution | false | Enable Evol-Instruct complexity evolution |
| evolution_rounds | 1 | Evolution rounds (0–3) |
| complexity_levels | ["constraints", "reasoning", "edge_cases"] | Evolution strategies |
| Field | Default | Description |
|-------|---------|-------------|
| max_concurrency | 10 | Max concurrent API requests |
| timeout | 60 | Request timeout in seconds |
| retry_times | 3 | Retry attempts for failed requests |
| Field | Default | Description |
|-------|---------|-------------|
| output_dir | ./evaluation_results | Output directory |
| save_queries | true | Save generated queries |
| save_responses | true | Save model responses |
| save_details | true | Save detailed results |
| Field | Default | Description |
|-------|---------|-------------|
| enabled | false | Enable Markdown report generation |
| language | "zh" | Report language: "zh" or "en" |
| include_examples | 3 | Examples per section (1–10) |
| chart.enabled | true | Generate win-rate chart |
| chart.orientation | "horizontal" | "horizontal" or "vertical" |
| chart.show_values | true | Show values on bars |
| chart.highlight_best | true | Highlight best model |
| chart.matrix_enabled | false | Generate win-rate matrix heatmap |
| chart.format | "png" | Chart format: "png", "svg", or "pdf" |
Win rate: percentage of pairwise comparisons a model wins. Each pair is evaluated in both orders (original + swapped) to eliminate position bias.
Rankings example:
1. gpt4_baseline [################----] 80.0%
2. qwen_candidate [############--------] 60.0%
3. llama_finetuned [##########----------] 50.0%
Win matrix: win_matrix[A][B] = how often model A beats model B across all queries.
The pipeline saves progress after each step. Interrupted runs resume automatically:
--fresh — ignore checkpoint, start from scratch--rerun-judge — re-run only the pairwise evaluation step (useful when switching judge models); keeps queries, responses, and rubrics intactevaluation_results/
├── evaluation_results.json # Rankings, win rates, win matrix
├── evaluation_report.md # Detailed Markdown report (if enabled)
├── win_rate_chart.png # Win-rate bar chart (if enabled)
├── win_rate_matrix.png # Matrix heatmap (if matrix_enabled)
├── queries.json # Generated test queries
├── responses.json # All model responses
├── rubrics.json # Generated evaluation rubrics
├── comparison_details.json # Pairwise comparison details
└── checkpoint.json # Pipeline checkpoint
| Model prefix | Environment variable |
|-------------|---------------------|
| gpt-*, o1-*, o3-* | OPENAI_API_KEY |
| claude-* | ANTHROPIC_API_KEY |
| qwen-*, dashscope/* | DASHSCOPE_API_KEY |
| deepseek-* | DEEPSEEK_API_KEY |
| Custom endpoint | set api_key + base_url in config |
tools
Generate text, images, video, speech, and music via the MiniMax AI platform. Covers text generation (MiniMax-M2.7 model), image generation (image-01), video generation (Hailuo-2.3), speech synthesis (speech-2.8-hd, 300+ voices), music generation (music-2.6 with lyrics, cover, and instrumental), and web search. Use when the user needs to create AI-generated multimedia content, produce narrated audio from text, compose music, or search the web through MiniMax AI services.
development
Build RL reward signals using the OpenJudge framework. Covers choosing between pointwise and pairwise reward strategies based on RL algorithm, task type, and cost; aggregating multi-dimensional pointwise scores into a scalar reward; pairwise tournament reward for GRPO on subjective tasks (net win rate across group rollouts); generating preference pairs for DPO/RLAIF; and normalizing scores for training stability. Use when building reward models, scoring rollouts for GRPO/REINFORCE, generating preference data for DPO, or doing Best-of-N selection.
tools
Benchmark LLM reference recommendation capabilities by verifying every cited paper against Crossref, PubMed, arXiv, and DBLP. Measures hallucination rate, per-field accuracy (title/author/year/DOI), discipline breakdown, and year constraint compliance. Supports tool-augmented (ReAct + web search) mode. Use when the user asks to evaluate, benchmark, or compare models on academic reference hallucination, literature recommendation quality, or citation accuracy.
testing
Review academic papers for correctness, quality, and novelty using OpenJudge's multi-stage pipeline. Supports PDF files and LaTeX source packages (.tar.gz/.zip). Covers 10 disciplines: cs, medicine, physics, chemistry, biology, economics, psychology, environmental_science, mathematics, social_sciences. Use when the user asks to review, evaluate, critique, or assess a research paper, check references, or verify a BibTeX file.