skills/skillxiv-v0.0.2-claude-opus-4.6/answer-matching-evaluation/SKILL.md
Evaluate language models using open-ended answer generation and semantic matching instead of multiple choice, eliminating test-taking shortcuts and achieving near-perfect alignment with human judgment.
npx skillsauth add ADu2021/skillXiv answer-matching-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Multiple choice benchmarks like MMLU have become standard for evaluating large language models, but they contain a critical vulnerability: answers can often be selected without fully comprehending the question. This occurs because language models can learn statistical shortcuts—patterns in option formulations, letter distributions, or option lengths—that correlate with correct answers. Answer matching eliminates this vulnerability by asking models to generate free-form responses and comparing them semantically against reference answers, capturing genuine understanding rather than test-taking ability.
The shift from multiple choice to answer matching represents a fundamental change in evaluation methodology. Instead of constraining models to select from given options, the approach asks "what is your answer?" and then uses another language model to judge whether the generated response matches the reference. Human annotation studies show this method achieves near-perfect alignment with human grading, while multiple choice approaches often diverge significantly from human judgments.
Answer matching operates on three principles: generate freely, compare semantically, and judge holistically. Models produce unrestricted responses to questions without seeing answer options. These responses are then compared against reference answers using a language model as judge, which evaluates whether the generated answer is semantically equivalent to the reference, regardless of exact wording. This separates genuine understanding from pattern exploitation.
The evaluation framework is model-agnostic—any LLM can serve as the judge, making it scalable and adaptable to different settings. The judge must assess meaning equivalence, accounting for synonyms, rephrasing, and different but correct formulations. This requires semantic reasoning rather than exact matching, which naturally aligns with how humans grade open-ended questions.
The answer matching system comprises four components:
The architecture is deliberately simple to maximize reproducibility and minimize judge bias. The judge's role is binary semantic equivalence determination, not complex reasoning or partial credit assignment.
Begin by preparing evaluation questions and reference answers without multiple choice options:
import json
from typing import Dict, List
def prepare_open_ended_dataset(multiple_choice_data: List[Dict]) -> List[Dict]:
"""
Convert multiple choice data to open-ended format.
Extracts questions and gold standard answers, removing multiple choice
options to prevent models from exploiting letter patterns or shortcuts.
"""
open_ended_dataset = []
for item in multiple_choice_data:
# Extract question and map answer option to text
question = item['question']
correct_option = item['correct_option'] # 'A', 'B', 'C', or 'D'
options = item['options']
# Get the reference answer text
option_index = ord(correct_option) - ord('A')
reference_answer = options[option_index]
open_ended_dataset.append({
'question': question,
'reference_answer': reference_answer,
'domain': item.get('domain', 'general')
})
return open_ended_dataset
def format_for_model_inference(question: str) -> str:
"""
Format question for free-form generation without options.
Uses a clear, open-ended prompt that encourages direct answers
without giving away reference answer format or length.
"""
prompt = f"""Answer the following question directly and concisely:
Question: {question}
Answer:"""
return prompt
Now implement the judging mechanism:
import re
from transformers import AutoTokenizer, AutoModelForCausalLM
import torch
class AnswerMatcher:
"""
Judge semantic equivalence between model responses and reference answers.
Uses an LLM to evaluate whether a generated answer matches the reference
answer semantically, accounting for rephrasing, synonyms, and minor variations.
"""
def __init__(self, judge_model="meta-llama/Llama-2-70b-chat-hf"):
self.tokenizer = AutoTokenizer.from_pretrained(judge_model)
self.model = AutoModelForCausalLM.from_pretrained(
judge_model,
torch_dtype=torch.float16,
device_map="auto"
)
self.device = next(self.model.parameters()).device
def build_judge_prompt(self, question: str, generated: str, reference: str) -> str:
"""
Construct judge prompt for semantic equivalence evaluation.
Provides context to judge about the question, both responses, and
asks for a binary equivalence judgment with brief reasoning.
"""
judge_prompt = f"""You are evaluating whether a model's generated answer is equivalent to a reference answer for a question.
Question: {question}
Reference Answer: {reference}
Generated Answer: {generated}
Determine if the generated answer is semantically equivalent to the reference answer. The answer can be rephrased or worded differently, but must convey the same core meaning and factual content.
Respond ONLY with:
EQUIVALENT - if the answers are semantically equivalent
NOT_EQUIVALENT - if the answers differ significantly in meaning or facts
Your judgment:"""
return judge_prompt
def judge_answer(self, question: str, generated: str, reference: str) -> bool:
"""
Return True if generated answer matches reference semantically.
Uses LLM judgment with structured output parsing to determine
whether a free-form response conveys the correct answer.
"""
prompt = self.build_judge_prompt(question, generated, reference)
# Tokenize and generate
inputs = self.tokenizer(prompt, return_tensors="pt").to(self.device)
with torch.no_grad():
outputs = self.model.generate(
**inputs,
max_new_tokens=50,
temperature=0.2,
top_p=0.9,
do_sample=False
)
# Parse judgment from output
response = self.tokenizer.decode(outputs[0], skip_special_tokens=True)
response_lower = response.lower()
if "equivalent" in response_lower and "not" not in response_lower:
return True
elif "not_equivalent" in response_lower or "not equivalent" in response_lower:
return False
else:
# Default to stricter judgment if unclear
return False
def evaluate_model(self, model, dataset: List[Dict], batch_size: int = 8) -> Dict:
"""
Evaluate a model using answer matching across a dataset.
Generates responses for all questions, judges each against reference
answers, and returns accuracy metrics broken down by domain.
"""
total_correct = 0
total_evaluated = 0
domain_scores = {}
for item in dataset:
question = item['question']
reference = item['reference_answer']
domain = item.get('domain', 'general')
# Generate model response
prompt = format_for_model_inference(question)
generated = model.generate(prompt) # Assumed method exists
# Judge equivalence
is_correct = self.judge_answer(question, generated, reference)
total_correct += is_correct
total_evaluated += 1
# Track by domain
if domain not in domain_scores:
domain_scores[domain] = {'correct': 0, 'total': 0}
domain_scores[domain]['correct'] += is_correct
domain_scores[domain]['total'] += 1
# Compute accuracy
accuracy = total_correct / total_evaluated
domain_accuracy = {d: s['correct'] / s['total'] for d, s in domain_scores.items()}
return {
'overall_accuracy': accuracy,
'total_evaluated': total_evaluated,
'domain_accuracy': domain_accuracy
}
Implement comparison with traditional multiple choice evaluation:
def compare_evaluation_methods(model, dataset_mc: List[Dict], dataset_oe: List[Dict]) -> Dict:
"""
Compare multiple choice vs answer matching evaluation on same content.
Tests model under both evaluation paradigms to quantify shortcut
exploitation and show ranking differences between methods.
"""
# Multiple choice evaluation
mc_correct = 0
for item in dataset_mc:
# Model selects from options - susceptible to shortcuts
question_with_options = format_mc_question(item)
selection = model.select_option(question_with_options)
correct_option = item['correct_option']
if selection == correct_option:
mc_correct += 1
# Answer matching evaluation
matcher = AnswerMatcher()
oe_results = matcher.evaluate_model(model, dataset_oe)
# Compare agreement with human annotations
comparison = {
'multiple_choice_accuracy': mc_correct / len(dataset_mc),
'answer_matching_accuracy': oe_results['overall_accuracy'],
'human_agreement_mc': compute_human_agreement(dataset_mc, 'multiple_choice'),
'human_agreement_am': compute_human_agreement(dataset_oe, 'answer_matching'),
'ranking_shift': compute_model_ranking_difference()
}
return comparison
Hyperparameter Table:
| Parameter | Default | Range | Notes | |-----------|---------|-------|-------| | Judge model size | 70B | 7B to 405B | Larger judges more accurate but slower; 70B+ recommended | | Judge temperature | 0.2 | 0.0 to 0.5 | Lower = more consistent judgments; avoid >0.5 for evaluation | | Response length limit | 1024 tokens | 128 to 2048 | Prevent gaming via verbose answers; adjust per domain | | Judge batch size | 1 | 1-32 | Larger batches faster but use more memory | | Equivalence threshold | 0.5 | N/A | Binary decision; no partial credit by default |
When to Use:
When NOT to Use:
Common Pitfalls:
Authors (2025). Answer Matching Outperforms Multiple Choice for Language Model Evaluation. arXiv preprint arXiv:2507.02856. https://arxiv.org/abs/2507.02856
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.