skills/cv/quadratic-weighted-kappa-callback/SKILL.md
Custom training callback that computes Quadratic Weighted Kappa on validation data each epoch and checkpoints the best model.
npx skillsauth add wenmin-wu/ds-skills cv-quadratic-weighted-kappa-callbackInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Quadratic Weighted Kappa (QWK) measures agreement between predicted and true ordinal labels, penalizing larger disagreements quadratically. It's the standard metric for medical grading tasks (diabetic retinopathy, pathology staging) but isn't available as a built-in training metric. This callback computes QWK at epoch end using sklearn, tracks the best score, and saves the model only when QWK improves — replacing loss-based checkpointing with metric-based checkpointing.
from sklearn.metrics import cohen_kappa_score
import numpy as np
# PyTorch version
def compute_qwk(model, val_loader, device):
model.eval()
all_preds, all_labels = [], []
with torch.no_grad():
for images, labels in val_loader:
preds = model(images.to(device)).cpu().numpy()
all_preds.extend(np.rint(preds).clip(0, 4).flatten())
all_labels.extend(labels.numpy().flatten())
return cohen_kappa_score(all_labels, all_preds, weights='quadratic')
# Keras version
class QWKCallback(tf.keras.callbacks.Callback):
def __init__(self, val_data, save_path='best_model.h5'):
self.val_data = val_data
self.save_path = save_path
self.best_kappa = -1
def on_epoch_end(self, epoch, logs=None):
X_val, y_val = self.val_data
y_pred = np.rint(self.model.predict(X_val)).astype(int).clip(0, 4)
kappa = cohen_kappa_score(y_val, y_pred.flatten(), weights='quadratic')
print(f" - val_kappa: {kappa:.4f}")
if kappa > self.best_kappa:
self.best_kappa = kappa
self.model.save(self.save_path)
cohen_kappa_score(weights='quadratic')np.rint for nearest-integer; or optimized thresholds for better QWKdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF