skills/evaluating-they-not-know/SKILL.md
Build statistically efficient LLM evaluation pipelines that combine direct accuracy with pairwise comparison signals as control variates. Use when the user asks to 'evaluate LLM accuracy on a benchmark', 'rank models with small sample sizes', 'reduce variance in LLM evaluation', 'build a model comparison pipeline', 'get tighter confidence intervals for model performance', or 'statistically compare reasoning models'.
npx skillsauth add ndpvt-web/arxiv-claude-skills evaluating-they-not-knowInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to build evaluation pipelines that go beyond naive accuracy averaging when benchmarking LLMs. The core technique augments standard correctness labels with pairwise comparison signals -- having models judge which of two auxiliary reasoning chains is better -- then uses these comparisons as statistical control variates via an efficient influence function (EIF) estimator. This yields strictly tighter confidence intervals and more reliable model rankings, especially on small or difficult benchmarks where raw accuracy estimates are noisy.
The problem: Evaluating LLM accuracy on a benchmark with N problems gives you a sample mean with variance ~p(1-p)/N. On small benchmarks (AIME has 30 problems, GPQA Diamond has 198), this yields wide confidence intervals and unstable rankings. Running more samples is expensive.
The insight: Even when a model cannot solve a problem, it can often reliably judge which of two candidate solutions is better. These pairwise comparison signals are correlated with correctness and carry exploitable information. The paper treats them as control variates -- auxiliary random variables with known expectation that are subtracted from the estimator to reduce variance without introducing bias.
The estimator: For each problem X with model answer Y and ground truth G, generate auxiliary reasoning chains W1, W2 from helper models and collect the target model's preference V between them. Fit a regression function tau(X, Z) = E[correctness | comparisons, problem] using cross-fitting. The one-step estimator is: theta_hat = (1/N) * sum_i [ m_hat(X_i) + phi(Y_i, G_i) - tau_hat(X_i, Z_i) ], where m_hat is the marginal expectation integrated over auxiliary samples. This achieves the semiparametric efficiency bound -- no unbiased estimator can have lower asymptotic variance with the same data structure.
Define the evaluation target. Specify the benchmark (problem set), the metric phi (typically exact-match correctness: phi(y,g) = 1 if y == g else 0), and the set of target models to rank.
Sample the evaluation subset. Select N problems from the benchmark. For small benchmarks use all problems; for large ones, sample N=50-100. Record each problem input X_i and ground truth G_i.
Collect target model outputs. For each problem X_i, query each target model to get answer Y_i. Score correctness phi(Y_i, G_i). This gives the naive estimator baseline.
Generate auxiliary reasoning chains. For each problem X_i, generate M+1 pairs of candidate solutions (W1_j, W2_j) from auxiliary models. Use two distinct auxiliary models for robustness (e.g., one strong closed-source, one open-source). M=10 is sufficient for the Monte Carlo integration step.
Collect pairwise preferences. For each problem X_i, present the first auxiliary pair (W1_1, W2_1) to the target model and ask it to judge which solution is better. Record V_i in {0, 1}. This is the comparison signal.
Fit the outcome regression tau via cross-fitting. Partition data into K=5 folds. For each fold k, fit tau_hat on the remaining 4 folds, where tau(X, Z) predicts correctness from the problem features and comparison signal. In small-sample regimes (N <= 30), use an LLM as the regressor via in-context learning instead of training a model.
Compute the marginal regression m_hat. For each problem X_i, average tau_hat over the remaining M auxiliary samples: m_hat(X_i) = (1/M) * sum_{j=2}^{M+1} tau_hat(W1_j, W2_j, V_j, X_i). This integrates out the auxiliary randomness.
Calculate per-instance influence scores. For each instance i in fold k: psi_i = m_hat(X_i) + phi(Y_i, G_i) - tau_hat(W1_{i,1}, W2_{i,1}, V_{i,1}, X_i).
Average influence scores to get the one-step estimate. theta_hat_onestep = (1/N) * sum_i psi_i. Compute the standard error as se = std(psi_i) / sqrt(N) and construct a 95% CI as theta_hat +/- 1.96 * se.
Rank models and report. Repeat steps 3-9 for each target model. Rank by theta_hat_onestep. Report rankings with confidence intervals. Flag pairs whose CIs overlap as statistically indistinguishable.
Example 1: Ranking 5 models on GPQA Diamond with N=50
User: "I have 5 models and want to rank them on GPQA Diamond, but I can only afford 50 problems. How do I get reliable rankings?"
Approach:
Output:
Model Rankings (GPQA Diamond, N=50, 95% CI):
1. Model-A: 42.0% [36.1%, 47.9%] (naive: 40.0% [26.5%, 53.5%])
2. Model-B: 38.5% [33.0%, 44.0%] (naive: 38.0% [24.7%, 51.3%])
3. Model-C: 34.2% [28.8%, 39.6%] (naive: 36.0% [22.9%, 49.1%])
4. Model-D: 28.7% [23.5%, 33.9%] (naive: 30.0% [17.3%, 42.7%])
5. Model-E: 22.1% [17.2%, 27.0%] (naive: 22.0% [10.7%, 33.3%])
Variance reduction: CIs are ~55% narrower than naive binomial intervals.
Models A and B are statistically indistinguishable at alpha=0.05.
Example 2: Evaluating on AIME 2025 (N=15, extreme small-sample)
User: "I need to compare two reasoning models on AIME 2025. Only 15 problems available."
Approach:
Output:
Model Comparison (AIME 2025, N=15):
Model-X: 33.3% [22.8%, 43.8%] (naive: 33.3% [9.5%, 57.2%])
Model-Y: 26.7% [17.1%, 36.3%] (naive: 26.7% [4.3%, 49.1%])
One-step CI width: ~20pp vs naive ~48pp.
Conclusion: Model-X preferred but difference not significant (p=0.31).
Example 3: Python implementation skeleton
User: "Write me code for the one-step estimator."
import numpy as np
from sklearn.model_selection import KFold
from sklearn.linear_model import LogisticRegression
def one_step_estimator(
correctness: np.ndarray, # phi(Y_i, G_i), shape (N,)
comparison_features: np.ndarray, # Z_i features, shape (N, M+1, d)
n_folds: int = 5,
n_mc_samples: int = 10
) -> dict:
"""
Semiparametric one-step estimator with comparison control variates.
comparison_features[:, 0, :] = first auxiliary sample (used in influence)
comparison_features[:, 1:, :] = remaining M samples (used for m_hat)
"""
N = len(correctness)
psi = np.zeros(N)
kf = KFold(n_splits=n_folds, shuffle=True, random_state=42)
for train_idx, test_idx in kf.split(correctness):
# Fit tau on training fold
X_train = comparison_features[train_idx, 0, :]
y_train = correctness[train_idx]
reg = LogisticRegression().fit(X_train, y_train)
# Compute tau_hat on test fold (first auxiliary sample)
X_test_first = comparison_features[test_idx, 0, :]
tau_hat = reg.predict_proba(X_test_first)[:, 1]
# Compute m_hat via Monte Carlo over remaining M samples
m_hat = np.zeros(len(test_idx))
for j in range(1, n_mc_samples + 1):
X_test_j = comparison_features[test_idx, j, :]
m_hat += reg.predict_proba(X_test_j)[:, 1]
m_hat /= n_mc_samples
# Influence scores
psi[test_idx] = m_hat + correctness[test_idx] - tau_hat
theta_hat = np.mean(psi)
se = np.std(psi, ddof=1) / np.sqrt(N)
ci_lower = theta_hat - 1.96 * se
ci_upper = theta_hat + 1.96 * se
return {
"estimate": theta_hat,
"std_error": se,
"ci_95": (ci_lower, ci_upper),
"naive_estimate": np.mean(correctness),
"naive_se": np.std(correctness, ddof=1) / np.sqrt(N),
}
Dong, Z., Zhang, Z., Zhou, Y., Jin, C., & Wu, R. (2026). Evaluating LLMs When They Do Not Know the Answer: Statistical Evaluation of Mathematical Reasoning via Comparative Signals. arXiv:2602.03061. Key sections: Proposition 3.1 (EIF derivation), Algorithm 1 (one-step estimator pseudocode), Theorem 4.1 (variance reduction guarantee), Table 1 (experimental results on GPQA/AIME/GSM8K).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".