skills/benchmarking-uncertainty-calibration-long-form/SKILL.md
Implement uncertainty quantification and calibration assessment for LLM-generated long-form answers. Apply answer-frequency consistency, verbalized confidence elicitation, token-level analysis, and multi-metric calibration benchmarking based on the UQ framework from Müller et al. (2026). Trigger phrases: - "measure how confident the model is in this answer" - "calibrate uncertainty on these QA results" - "benchmark uncertainty quantification for my LLM pipeline" - "which uncertainty method should I use for scientific QA" - "detect unreliable LLM answers" - "evaluate calibration of model confidence scores"
npx skillsauth add ndpvt-web/arxiv-claude-skills benchmarking-uncertainty-calibration-long-formInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design, implement, and evaluate uncertainty quantification (UQ) pipelines for LLM-generated long-form answers. It applies the findings from Müller et al. (2026), which benchmarked UQ methods across 20 LLMs and 685,000 responses on scientific QA tasks. The core actionable insight: answer frequency (consistency across multiple sampled generations) yields the most reliable calibration, while verbalized confidence is systematically biased and token-level probabilities are degraded by instruction tuning. This skill teaches how to build systems that surface trustworthy uncertainty estimates and avoid common calibration measurement pitfalls.
The problem. LLMs produce answers with no built-in reliability signal. Practitioners need uncertainty estimates to decide when to trust model output. Four main approaches exist: (1) token-level probabilities (the softmax confidence on generated tokens), (2) verbalized confidence (asking the model "how confident are you?"), (3) answer frequency (sampling N responses and measuring consistency), and (4) claim-conditioned probability (CCP, computing entailment-vs-contradiction token ratios). The paper finds these methods are not equally reliable, and the standard way of measuring them (ECE alone) is misleading.
What works. Answer frequency — generating 10+ responses to the same prompt and computing the proportion of semantically equivalent answers — provides the best-calibrated uncertainty signal. It is robust to the probability mass polarization that instruction tuning induces (where models collapse nearly all softmax mass onto a single token, destroying the information in probability distributions). Verbalized confidence, by contrast, is systematically overconfident and poorly correlated with correctness. Token-level methods (including P(True) and CCP) degrade as models become more heavily fine-tuned.
How to measure calibration correctly. Expected Calibration Error (ECE) alone is insufficient — it collapses when confidence scores cluster in a narrow range, making poorly calibrated models appear well-calibrated. Always pair ECE with AUROC (discrimination ability), Brier score (proper scoring rule), and visual calibration plots. Evaluate on domain-specific data: factual retrieval tasks show different calibration profiles than multi-step reasoning tasks like GSM8K or GPQA.
Define the QA task and correctness criterion. Determine whether answers are multiple-choice (compare to ground-truth label), arithmetic (extract and compare numerical result), or open-ended (require NLI-based semantic matching). This choice determines how you compute the binary correctness signal needed for calibration.
Subsample and structure the evaluation set. Select 200-500 questions from your dataset. For each question, prepare a prompt template that elicits long-form reasoning (use Chain-of-Thought or APriCoT-style counterfactual prompting for MCQA).
Generate multiple responses per question. For each question, sample N=10 completions at temperature 0.7-1.0. Store each response with its full token-level log-probabilities if the API exposes them (OpenAI, Mistral, and vLLM-served models do). This gives you the raw material for all UQ methods.
Compute answer-frequency uncertainty. For each question, cluster the N responses by semantic equivalence. For MCQA, extract the selected option letter. For arithmetic QA, extract the final numerical answer. For open-ended QA, use an NLI model (e.g., DeBERTa-v3-large-mnli) to group responses that mutually entail each other. The frequency of the most common cluster divided by N is the confidence score. confidence = count_of_most_common_cluster / N.
Compute verbalized confidence (for comparison). After generating the answer, issue a follow-up prompt: "On a scale from 0.0 to 1.0, what is the probability that your answer above is correct? Respond with only a decimal number." Parse the returned number. Note: this method is included for benchmarking, not as a recommended production signal.
Compute token-level confidence (if logprobs available). For MCQA, use the P(True) approach: prompt the model with its own answer and ask it to classify as "(A) True" or "(B) False"; extract the softmax probability assigned to the "True" token. For arithmetic, take the mean log-probability of the tokens in the final numerical answer.
Score correctness for every response. Compare each response to the ground-truth answer. Produce a binary array y_correct[i] for each question. For answer-frequency, correctness is whether the majority-cluster answer matches ground truth.
Compute calibration metrics — never ECE alone. Bin confidence scores into 10-15 equal-width bins. Compute:
Diagnose failure modes. Check for probability mass polarization: if >90% of token-level confidence scores fall in the [0.95, 1.0] bin, token-level methods are unreliable for this model. Check for verbalized overconfidence: if mean verbalized confidence exceeds accuracy by >15 percentage points, the model is systematically overconfident. Check for ECE-accuracy coupling: if ECE is low but AUROC is also low (~0.5), the ECE is misleadingly optimistic.
Select the best UQ method for your deployment. Rank methods by AUROC (discrimination) first, then by Brier score (calibration + sharpness). Use the winning method as the production uncertainty signal for selective prediction or human-escalation thresholds.
Example 1: Building a confidence filter for a science tutoring chatbot
User: "I'm building a science QA chatbot. I want to only show answers the model is confident about and route uncertain ones to human tutors. How should I measure confidence?"
Approach:
confidence = size_of_largest_cluster / 10.Output:
import collections
from transformers import pipeline
nli = pipeline("text-classification", model="microsoft/deberta-v3-large-mnli")
def compute_answer_frequency(responses: list[str], n_samples: int = 10) -> tuple[str, float]:
"""Cluster responses by semantic equivalence, return (best_answer, confidence)."""
clusters = []
for resp in responses:
placed = False
for cluster in clusters:
result = nli(f"{cluster[0]} [SEP] {resp}", top_k=1)
if result[0]["label"] == "ENTAILMENT" and result[0]["score"] > 0.7:
cluster.append(resp)
placed = True
break
if not placed:
clusters.append([resp])
largest = max(clusters, key=len)
return largest[0], len(largest) / n_samples
# Usage in pipeline
responses = [llm.generate(question, temperature=0.8) for _ in range(10)]
best_answer, confidence = compute_answer_frequency(responses)
if confidence >= 0.7:
show_to_user(best_answer, confidence)
else:
escalate_to_human(question)
Example 2: Auditing verbalized confidence for systematic bias
User: "Our LLM pipeline asks the model to rate its own confidence 0-1. Is that reliable?"
Approach:
Output:
import numpy as np
from sklearn.metrics import roc_auc_score, brier_score_loss
def calibration_report(confidences: np.ndarray, correctness: np.ndarray, n_bins: int = 10):
"""Compute ECE, Brier, AUROC and print calibration diagnostics."""
bin_edges = np.linspace(0, 1, n_bins + 1)
ece = 0.0
print(f"{'Bin':>12} {'Count':>6} {'Acc':>6} {'Conf':>6} {'|Gap|':>6}")
for lo, hi in zip(bin_edges[:-1], bin_edges[1:]):
mask = (confidences >= lo) & (confidences < hi)
if mask.sum() == 0:
continue
bin_acc = correctness[mask].mean()
bin_conf = confidences[mask].mean()
gap = abs(bin_acc - bin_conf)
ece += gap * mask.sum()
print(f" [{lo:.1f},{hi:.1f}) {mask.sum():>6} {bin_acc:>6.3f} {bin_conf:>6.3f} {gap:>6.3f}")
ece /= len(confidences)
brier = brier_score_loss(correctness, confidences)
auroc = roc_auc_score(correctness, confidences)
print(f"\nECE: {ece:.4f}")
print(f"Brier: {brier:.4f}")
print(f"AUROC: {auroc:.4f}")
if auroc < 0.55:
print("WARNING: AUROC near chance — confidence scores have no discriminative power.")
if confidences.mean() - correctness.mean() > 0.15:
print("WARNING: Systematic overconfidence detected (mean conf >> mean accuracy).")
return {"ece": ece, "brier": brier, "auroc": auroc}
# Compare verbalized vs answer-frequency
verb_report = calibration_report(verbalized_confs, correct_labels)
freq_report = calibration_report(frequency_confs, correct_labels)
Example 3: Detecting probability mass polarization in a fine-tuned model
User: "I fine-tuned Llama-3 for chemistry QA. Can I trust the logprob-based confidence?"
Approach:
Output:
def detect_polarization(max_token_probs: list[float], threshold: float = 0.95) -> bool:
"""Check if token-level probs are polarized (unreliable for UQ)."""
fraction_above = sum(1 for p in max_token_probs if p > threshold) / len(max_token_probs)
print(f"Fraction of max-token probs > {threshold}: {fraction_above:.2%}")
if fraction_above > 0.85:
print("POLARIZED: Token-level confidence is unreliable for this model.")
print("Recommendation: Use answer-frequency (sample N=10) instead.")
return True
print("Token-level confidence may be usable. Validate with calibration metrics.")
return False
Müller, P., Popovič, N., Färber, M., & Steinbach, P. (2026). Benchmarking Uncertainty Calibration in Large Language Model Long-Form Question Answering. arXiv:2602.00279v1. https://arxiv.org/abs/2602.00279v1
Key takeaway: Answer frequency (consistency across sampled generations) is the most reliable UQ method for instruction-tuned LLMs; verbalized confidence and token-level probabilities are systematically compromised. Never evaluate calibration with ECE alone. Open-source framework: https://github.com/muelphil/llm-uncertainty-bench.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".