skills/cv/per-class-soft-f1-threshold-fitting/SKILL.md
Optimize per-class decision thresholds for macro-F1 by replacing the non-differentiable hard threshold with a sigmoid-sharpened soft-F1 surrogate and fitting the per-class threshold vector via least-squares — averaged over multiple random validation splits to suppress overfitting on rare classes
npx skillsauth add wenmin-wu/ds-skills cv-per-class-soft-f1-threshold-fittingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Picking one global sigmoid threshold (e.g. 0.5) for multi-label classification leaves macro-F1 points on the table because rare classes need lower thresholds and easy classes need higher ones. The naive per-class grid search overfits when validation has only a handful of positives. The trick: replace the hard (p > th) step with a sigmoid σ(d·(p − th)) surrogate, plug into the F1 formula to get a differentiable per-class soft-F1, and use scipy.optimize.leastsq with a small L2 penalty to fit the threshold vector. Then average over 10 random train/val splits to denoise. This was the per-class threshold trick that pushed top HPA notebooks from 0.40 to 0.46 LB.
import numpy as np
from sklearn.model_selection import train_test_split
from scipy import optimize as opt
def sigmoid_np(x): return 1.0 / (1.0 + np.exp(-x))
def F1_soft(preds, targs, th=0.0, d=25.0):
p = sigmoid_np(d * (preds - th))
return 2.0 * (p * targs).sum(0) / ((p + targs).sum(0) + 1e-6)
def fit_thresholds(preds, targs, n_classes, wd=1e-5):
params = np.zeros(n_classes)
err = lambda p: np.concatenate(
(F1_soft(preds, targs, p) - 1.0, wd * p), axis=None)
p, _ = opt.leastsq(err, params)
return p
# Average across 10 random splits
th = np.zeros(n_classes)
for i in range(10):
xt, xv, yt, yv = train_test_split(val_pred, val_y, test_size=0.5, random_state=i)
th += fit_thresholds(xt, yt, n_classes)
th /= 10
F1_soft with sharpness d ≈ 20–30 — too low and the surrogate isn't tight, too high and gradients vanishleastsq residual that drives soft-F1 to 1.0, with small L2 weight decaypred_binary = (test_score > th).astype(int)d: 25 is the sweet spot — high enough to track hard-F1, low enough to keep gradients non-zero.leastsq not gradient descent: F1 surface has many flats; least-squares with small wd converges in milliseconds and is reproducible.data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF