skills/llm/agreement-confidence-ensemble/SKILL.md
Ensemble multi-model LLM predictions using weighted combination of average probability, cross-model agreement ratio, and max confidence
npx skillsauth add wenmin-wu/ds-skills llm-agreement-confidence-ensembleInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When ensembling multiple LLMs that output class probabilities, simple averaging ignores model agreement patterns. Score each candidate class as a weighted blend of: (1) mean probability across models, (2) fraction of models that ranked it in top-K (agreement), and (3) maximum single-model confidence. Agreement acts as a voting signal that breaks ties between similarly-scored classes.
import numpy as np
def agreement_ensemble(model_probs, class_names, weights=(0.6, 0.3, 0.1), top_k=3):
"""Ensemble predictions from multiple models.
Args:
model_probs: list of (n_samples, n_classes) arrays
class_names: list of class name strings
weights: (avg_prob_weight, agreement_weight, max_conf_weight)
top_k: number of top classes per model for agreement counting
Returns:
list of top-K predicted class names per sample
"""
n_models = len(model_probs)
n_samples = model_probs[0].shape[0]
w_avg, w_agree, w_conf = weights
results = []
for i in range(n_samples):
scores = {}
for c, name in enumerate(class_names):
avg_prob = np.mean([mp[i, c] for mp in model_probs])
votes = sum(1 for mp in model_probs if c in np.argsort(-mp[i])[:top_k])
max_conf = max(mp[i, c] for mp in model_probs)
scores[name] = w_avg * avg_prob + w_agree * (votes / n_models) + w_conf * max_conf
ranked = sorted(scores, key=scores.get, reverse=True)
results.append(ranked[:top_k])
return results
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF