skills/compar-ia-french-governments/SKILL.md
Build multilingual LLM evaluation arenas and preference data collection pipelines modeled on France's compar:IA platform. Collects human preference pairs for RLHF/DPO training in non-English languages using blind pairwise comparison, Bradley-Terry ranking, and privacy-preserving filtering. Trigger phrases: 'build an LLM arena', 'collect preference data for DPO', 'create a chatbot comparison platform', 'multilingual RLHF data pipeline', 'pairwise LLM evaluation system', 'French language model leaderboard'.
npx skillsauth add ndpvt-web/arxiv-claude-skills compar-ia-french-governmentsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design and implement LLM evaluation arenas that collect human preference data for non-English languages, following the architecture pioneered by the French government's compar:IA platform. The core technique is a blind pairwise comparison interface backed by a FastAPI + SvelteKit stack, where users submit unconstrained prompts, receive anonymized side-by-side model responses, and cast preference votes that feed directly into RLHF/DPO training pipelines. The platform uses Bradley-Terry ranking models to aggregate pairwise judgments into a leaderboard, and applies conservative full-conversation exclusion (rather than span-level redaction) for privacy filtering.
Blind Pairwise Comparison with Two-Level Feedback. The compar:IA approach collects preference data through a three-step flow: (1) the user submits an unconstrained free-form prompt, (2) two randomly selected anonymous models generate side-by-side responses, and (3) the user provides feedback at two granularities — message-level reactions (per-turn thumbs up/down) and conversation-level votes (preferred model A, model B, or tie). Only after voting does the platform reveal model identities and metadata. This dual-granularity feedback yields richer training signal than single-vote systems: conversation-level votes produce standard DPO preference pairs, while message-level reactions enable turn-level reward modeling.
Bradley-Terry Ranking from Noisy Crowdsourced Data. Rather than raw win-rate tallies, the leaderboard uses the Bradley-Terry statistical model to aggregate pairwise votes into consistent global rankings. This handles the transitivity problem (model A beats B, B beats C, but C beats A) and accounts for uneven matchup frequencies. The method treats each vote as evidence for a latent "strength" parameter per model, then fits via maximum likelihood. This is the same approach used by Chatbot Arena (LMSYS) but applied here to French-language evaluation with 250,000+ votes across 104 models.
Conservative Privacy Filtering via Full-Conversation Exclusion. Instead of attempting span-level PII anonymization (which risks incomplete masking and semantic distortion), compar:IA uses an LLM-based classifier to flag conversations likely containing personal data, then excludes the entire conversation and all associated votes. This removes approximately 5% of data but avoids the residual re-identification risks inherent in token-level redaction. This is a key architectural decision: accept modest data loss for strong privacy guarantees.
Create three complementary datasets with clear separation of concerns:
# conversations schema
conversation = {
"conversation_id": "uuid",
"timestamp": "ISO-8601",
"language": "fr", # detected via langdetect or fasttext
"turns": [
{
"role": "user",
"content": "Explique-moi la photosynthèse",
"turn_index": 0
},
{
"role": "assistant_a",
"content": "La photosynthèse est le processus...",
"model_id": "hidden_until_vote",
"turn_index": 1
},
{
"role": "assistant_b",
"content": "C'est un mécanisme biologique...",
"model_id": "hidden_until_vote",
"turn_index": 1
}
]
}
# votes schema (conversation-level preference)
vote = {
"vote_id": "uuid",
"conversation_id": "uuid",
"winner": "assistant_a" | "assistant_b" | "tie",
"timestamp": "ISO-8601"
}
# reactions schema (message-level feedback)
reaction = {
"reaction_id": "uuid",
"conversation_id": "uuid",
"turn_index": 1,
"target": "assistant_a" | "assistant_b",
"value": "positive" | "negative",
"timestamp": "ISO-8601"
}
Implement these core endpoints:
POST /conversations — accepts a user prompt, selects two models via weighted random sampling, fans out inference requests in parallel, returns anonymized responsesPOST /conversations/{id}/votes — records conversation-level preference (A, B, or tie) and reveals model identitiesPOST /conversations/{id}/reactions — records message-level thumbs up/down per turnGET /leaderboard — returns current Bradley-Terry rankingsRoute inference through a provider abstraction layer that supports OpenRouter, HuggingFace Inference Providers, and direct API calls, so models can be added or swapped without code changes.
Build a split-pane chat interface where:
import random
def select_model_pair(models: list, elo_ratings: dict) -> tuple:
"""Select two models for comparison, favoring similar-strength pairings
to maximize information gain for Bradley-Terry fitting."""
# Weight pairings by proximity in current ELO ratings
# to get more discriminative comparisons
weights = []
pairs = []
for i, m1 in enumerate(models):
for m2 in models[i+1:]:
diff = abs(elo_ratings.get(m1, 1200) - elo_ratings.get(m2, 1200))
weights.append(1.0 / (1.0 + diff / 400.0))
pairs.append((m1, m2))
chosen = random.choices(pairs, weights=weights, k=1)[0]
# Randomize left/right assignment to prevent position bias
return chosen if random.random() > 0.5 else (chosen[1], chosen[0])
import numpy as np
from scipy.optimize import minimize
def fit_bradley_terry(matchups: list[dict], model_names: list[str]) -> dict:
"""Fit Bradley-Terry model from pairwise votes.
matchups: [{"winner": "modelA", "loser": "modelB"}, ...]
Returns dict of model_name -> strength score.
"""
n = len(model_names)
idx = {name: i for i, name in enumerate(model_names)}
def neg_log_likelihood(params):
nll = 0.0
for m in matchups:
wi = idx[m["winner"]]
li = idx[m["loser"]]
nll -= params[wi] - np.logaddexp(params[wi], params[li])
# L2 regularization to prevent divergence
nll += 0.01 * np.sum(params ** 2)
return nll
result = minimize(neg_log_likelihood, np.zeros(n), method="L-BFGS-B")
strengths = result.x
# Convert to ELO-scale ratings (centered at 1200)
elo = 1200 + 400 * (strengths - strengths.mean())
return {name: float(elo[i]) for i, name in enumerate(model_names)}
Run an LLM-based PII classifier on every conversation before dataset export. Use full-conversation exclusion, not token-level redaction:
async def filter_conversation(conv: dict, classifier_model: str) -> bool:
"""Returns True if conversation is safe to publish, False if it
likely contains PII and should be excluded entirely."""
full_text = " ".join(t["content"] for t in conv["turns"])
prompt = (
"Does the following conversation contain personal information "
"(names, addresses, phone numbers, emails, ID numbers, medical "
"details, or any data that could identify a real person)? "
"Answer ONLY 'yes' or 'no'.\n\n" + full_text
)
response = await call_llm(classifier_model, prompt)
return response.strip().lower() == "no"
Expect approximately 5% exclusion rate. Provide a public reporting form so users can flag conversations that slipped through or were incorrectly removed.
Classify each prompt by language (using fasttext lid.176.bin) and topic category. The compar:IA taxonomy uses 15 categories: Natural Science & Technology (17.7%), Education (14.3%), Business & Economics (10.1%), and others. Apply classification post-hoc for dataset analysis — do not constrain user input.
Use the ecologits Python library to estimate energy consumption per inference call based on token count, model architecture, and GPU manufacturing amortization. Display per-conversation carbon estimates to users after they vote.
Convert conversation-level votes into preference pairs formatted for DPO training:
def export_dpo_pairs(conversations: list, votes: list) -> list[dict]:
"""Convert arena votes to DPO training format."""
pairs = []
for vote in votes:
if vote["winner"] == "tie":
continue # ties are excluded from DPO pairs
conv = get_conversation(vote["conversation_id"], conversations)
prompt = extract_user_turns(conv)
chosen = extract_assistant_turns(conv, vote["winner"])
rejected = extract_assistant_turns(
conv, "assistant_b" if vote["winner"] == "assistant_a" else "assistant_a"
)
pairs.append({
"prompt": prompt,
"chosen": chosen,
"rejected": rejected,
"language": conv["language"]
})
return pairs
Release three dataset splits (conversations, votes, reactions) on HuggingFace with Etalab 2.0 or CC-BY-4.0 licensing. Gate raw data for research-only access. Add a restriction that proprietary model responses may be used for analysis and evaluation but not for training.
Example 1: Building a French LLM Arena
User: "I want to build a platform like Chatbot Arena but focused on French. Help me set up the backend."
Approach:
/conversations, /votes, /reactions, and /leaderboardOutput:
project/
app/
main.py # FastAPI app with CORS, routes
routers/
conversations.py # prompt submission, parallel inference
votes.py # preference recording, model reveal
leaderboard.py # Bradley-Terry rankings
services/
inference.py # provider abstraction (OpenRouter, HF, direct)
pairing.py # model selection with ELO-proximity weighting
ranking.py # Bradley-Terry MLE fitting
privacy.py # LLM-based PII filter, full-conversation exclusion
models/
schemas.py # Pydantic models for all three datasets
db/
migrations/ # Alembic migrations for PostgreSQL
Example 2: Converting Arena Votes to DPO Training Data
User: "I have 50,000 pairwise preference votes from my LLM arena. How do I convert them into a DPO dataset?"
Approach:
conversation_id{"prompt": ..., "chosen": ..., "rejected": ...}Output:
{
"prompt": "Quels sont les avantages et inconvénients du nucléaire en France ?",
"chosen": "L'énergie nucléaire représente environ 70% de la production électrique française. Parmi les avantages : faibles émissions de CO2, indépendance énergétique, coût de production stable...",
"rejected": "Le nucléaire c'est bien parce que ça produit beaucoup d'énergie. Les inconvénients c'est les déchets.",
"language": "fr"
}
Example 3: Adding a New Language to an Existing Arena
User: "Our arena works for French. We want to add Swedish and Lithuanian support as compar:IA did with their European expansion."
Approach:
lid.176.bin to auto-tag each conversationOutput: The leaderboard now shows separate rankings per language, and the exported HuggingFace datasets include a language column for filtering.
Paper: compar:IA: The French Government's LLM arena to collect French-language human prompts and preference data — Termignon et al., 2026. Focus on Section 3 (platform architecture and data schema), Section 4 (Bradley-Terry ranking methodology), and Section 5 (privacy filtering pipeline) for implementation details.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".