skills/nlp/pairwise-ranking-validation/SKILL.md
Evaluates ranking models by computing the fraction of preference pairs where the model correctly scores the preferred item higher.
npx skillsauth add wenmin-wu/ds-skills nlp-pairwise-ranking-validationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When the task is ranking (not classification), standard metrics like accuracy or AUC don't apply directly. Pairwise ranking accuracy measures the fraction of (item_a, item_b) pairs where the model correctly assigns a higher score to the preferred item. For toxicity ranking: given pairs of (less_toxic, more_toxic) texts, check if score(more_toxic) > score(less_toxic). This metric directly reflects competition evaluation and is the right validation signal for margin ranking or regression-based ranking models.
import numpy as np
def pairwise_accuracy(model, vectorizer, val_df):
"""Fraction of pairs correctly ordered by model scores."""
X_less = vectorizer.transform(val_df['less_toxic'])
X_more = vectorizer.transform(val_df['more_toxic'])
score_less = model.predict(X_less)
score_more = model.predict(X_more)
return (score_less < score_more).mean()
# For transformer models
def pairwise_accuracy_nn(model, val_loader):
scores_less, scores_more = [], []
for batch in val_loader:
with torch.no_grad():
s_less = model(batch['less_ids'], batch['less_mask'])
s_more = model(batch['more_ids'], batch['more_mask'])
scores_less.append(s_less.cpu())
scores_more.append(s_more.cpu())
less = torch.cat(scores_less)
more = torch.cat(scores_more)
return (less < more).float().mean().item()
acc = pairwise_accuracy(ridge_model, tfidf, val_df)
print(f"Pairwise accuracy: {acc:.4f}")
(less < more).mean() + 0.5 * (less == more).mean()data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF