skills/cross-lingual-stability-judges-under/SKILL.md
Detect and fix cross-lingual evaluation instabilities in LLM-as-a-judge pipelines. Use when: 'audit my multilingual eval pipeline', 'check if my LLM judge is stable across languages', 'set up cross-lingual evaluation', 'calibrate judge scoring for non-English languages', 'diagnose ranking inversions in multilingual benchmarks', 'build controlled generation tests for eval reliability'.
npx skillsauth add ndpvt-web/arxiv-claude-skills cross-lingual-stability-judges-underInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to diagnose, measure, and fix evaluation instability when LLM judges score outputs across multiple languages. Based on Chung & Freienthal (2026), it implements a controlled generation protocol that isolates measurement noise from genuine performance differences: by generating semantically identical content across languages with fixed parameters, you can detect which evaluation dimensions (surface metrics vs. pragmatic judgments) break down in cross-lingual transfer. The core insight is that coherence and instruction-following scores from zero-shot LLM judges exhibit rank inversions and near-zero correlations across morphologically rich languages, while lexical diversity and semantic similarity remain stable.
Controlled generation as a diagnostic probe. The central method is to eliminate variation from the thing being evaluated so that any score differences must come from the evaluator. You generate synthetic dialogues using identical prompts, temperature, system instructions, and topic parameters across every target language. When generation conditions are held constant, a reliable judge should produce stable model rankings regardless of language. Ranking inversions under these conditions are direct evidence of evaluator failure, not model failure.
Two-tier metric taxonomy. The paper distinguishes surface-level metrics (lexical diversity via type-token ratio, surface similarity via character n-gram overlap, semantic similarity via multilingual embeddings) from pragmatic metrics (coherence, instruction-following, grammaticality, readability, fluency). The surface metrics maintain cross-language ranking stability. The pragmatic metrics -- which require discourse-level understanding -- exhibit rank inversions and near-zero Kendall tau correlations across language pairs. This means you cannot trust zero-shot LLM judges for discourse-level assessment in morphologically rich languages without language-specific calibration.
Language-specific calibration against human baselines. The fix is not to abandon LLM judges but to collect a small targeted set of human annotations per language (the paper uses Estonian native speaker annotations as reference) and calibrate judge scores against that ground truth. Measure Kendall tau and Spearman rho between judge rankings and human rankings per evaluation dimension. Dimensions where correlation drops below an acceptable threshold (e.g., tau < 0.3) need either prompt adaptation, few-shot examples in the target language, or replacement with a language-specific metric.
Define the evaluation dimensions. Separate your metrics into surface-level (lexical diversity, n-gram overlap, embedding similarity) and pragmatic (coherence, instruction-following, fluency, grammaticality). Label each dimension explicitly -- surface metrics will likely transfer; pragmatic metrics need validation.
Build the controlled generation protocol. Create a generation config that fixes all parameters: system prompt, topic/domain, temperature, max tokens, few-shot examples, and any structural constraints. The config must be identical across all target languages -- only the language instruction changes. Use a structured schema:
generation_config = {
"domain": "customer_support",
"industry": "telecommunications",
"problem_type": "billing_dispute",
"channel": "chat",
"temperature": 0.7,
"max_tokens": 512,
"models": ["gpt-4.1-mini", "claude-sonnet-4-20250514"],
"languages": ["et", "fi", "hu"], # target languages
"num_samples": 50 # per model per language
}
Generate synthetic parallel data. For each (model, language) pair, generate N dialogues using the fixed config. Store outputs in JSONL with metadata tracking the generation parameters. Use liteLLM or a similar abstraction layer to switch providers without changing the evaluation interface:
python -m src.conversation_generator \
--provider openai -m gpt-4.1-mini \
--lang et fi hu --samples 50
Run surface-level metrics. Compute lexical diversity (type-token ratio, hapax legomena ratio), surface similarity (character/word n-gram overlap between parallel outputs), and semantic similarity (cosine distance using multilingual embeddings like LaBSE or multilingual-e5). Record per-model scores grouped by language.
Run LLM-as-a-judge scoring. Score each generated dialogue on pragmatic dimensions using a structured rubric. Use explicit 0-N scales with anchor descriptions per level. Score grammaticality (0-4), readability (0-4), content coherence (0-3), and fluency (0-3). Capture both numeric scores and free-text explanations:
python -m src.llm_judge \
--provider openai -m gpt-4.1 \
--input data/conversations.json \
--output results/judge_scores.jsonl
Compute ranking stability. For each evaluation dimension, rank the generator models within each language, then compute pairwise Kendall tau and Spearman rho across all language pairs. Use bootstrap confidence intervals (2000+ iterations) and permutation tests (5000+ iterations) for significance:
from scipy.stats import kendalltau, spearmanr
# For each dimension, compare model rankings between language pairs
tau, p_value = kendalltau(ranks_estonian, ranks_finnish)
Identify unstable dimensions. Flag any dimension where tau < 0.3 or where rank inversions occur (model A > model B in language X, but model B > model A in language Y). These dimensions are unreliable for cross-lingual comparison without calibration.
Collect targeted human annotations. For each unstable dimension, collect 30-50 human judgments per target language from native speakers. This is the minimum viable calibration set. Compute inter-annotator agreement (Krippendorff's alpha) to verify annotation quality.
Calibrate or replace unstable metrics. For each flagged dimension, compute correlation between judge scores and human scores. If correlation is acceptable after seeing the human distribution (tau > 0.5), apply a linear calibration offset per language. If correlation is low, either (a) add language-specific few-shot examples to the judge prompt, (b) switch to a judge model with stronger multilingual training, or (c) replace the LLM judge dimension with a specialized metric for that language.
Document and version the stability report. Output a per-dimension, per-language-pair stability matrix showing tau values, confidence intervals, and calibration status. This report accompanies the evaluation results and flags which scores are trustworthy.
Example 1: Auditing a multilingual chatbot evaluation pipeline
User: "I'm evaluating three chatbot models across 5 languages (English, Finnish, Estonian, Turkish, Korean) using GPT-4 as a judge. The rankings keep changing between languages. Is my judge broken?"
Approach:
Output:
Stability Report
=================
Dimension | en-fi | en-et | en-hu | fi-et | fi-hu | et-hu
-----------------------|-------|-------|-------|-------|-------|------
Lexical diversity | 0.87 | 0.82 | 0.85 | 0.91 | 0.83 | 0.88
Semantic similarity | 0.79 | 0.75 | 0.81 | 0.84 | 0.77 | 0.80
Coherence | 0.34 | 0.28 | 0.41 | 0.62 | 0.19 | 0.22 [!]
Instruction-following | 0.21 | 0.15 | 0.29 | 0.45 | 0.11 | 0.18 [!]
[!] = Unstable: tau < 0.3, calibration required before cross-lingual comparison
Example 2: Building a controlled generation test suite from scratch
User: "I want to test whether my custom LLM judge is reliable for scoring Estonian and Hungarian outputs. How do I set this up?"
Approach:
Output:
# controlled_gen_config.py
CONFIG = {
"domain": "tech_support",
"problem_types": ["connectivity", "billing", "account_access"],
"channel": "live_chat",
"temperature": 0.7,
"max_tokens": 512,
"generator_models": [
"gpt-4.1-mini",
"claude-sonnet-4-20250514",
"mistral-large-latest"
],
"languages": ["en", "et", "hu"],
"samples_per_cell": 50,
"judge_model": "custom-judge-v2",
"judge_dimensions": {
"grammaticality": {"scale": [0, 4], "type": "surface-adjacent"},
"readability": {"scale": [0, 4], "type": "surface-adjacent"},
"coherence": {"scale": [0, 3], "type": "pragmatic"},
"fluency": {"scale": [0, 3], "type": "pragmatic"},
}
}
Example 3: Diagnosing a specific ranking inversion
User: "Model A scores higher than Model B in Finnish coherence, but Model B scores higher in Estonian coherence. Same judge, same prompt. What's going on?"
Approach:
Output:
Inversion Diagnosis: Model A vs Model B, Coherence
===================================================
Language | Model A mean | Model B mean | Gap CI (95%) | Judge-Human tau
----------|-------------|-------------|------------------|----------------
Finnish | 2.4 | 2.1 | [0.1, 0.5] | 0.15 [UNRELIABLE]
Estonian | 1.9 | 2.3 | [-0.6, -0.2] | 0.55 [OK]
Diagnosis: Judge scoring is unreliable for Finnish coherence (tau=0.15).
The ranking inversion reflects judge instability, not a real performance difference.
Action: Calibrate Finnish coherence with native speaker annotations + few-shot examples.
Chung, I., & Freienthal, L. (2026). Cross-Lingual Stability of LLM Judges Under Controlled Generation: Evidence from Finno-Ugric Languages. First Workshop on Multilingual Multicultural Evaluation (MME), co-located with EACL 2026. arXiv:2602.02287 | Code
Key takeaway from the paper: surface metrics transfer cross-lingually but pragmatic judgments (coherence, instruction-following) do not -- use the controlled generation protocol to detect which dimensions of your evaluation pipeline are unreliable before trusting cross-lingual rankings.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".