skills/nlp/span-overlap-f1-metric/SKILL.md
Evaluates NER span predictions using bidirectional word-index overlap (>=50% both ways) to compute micro-F1 over predicted vs ground-truth spans.
npx skillsauth add wenmin-wu/ds-skills nlp-span-overlap-f1-metricInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Standard NER metrics require exact span matches, which is too strict for discourse-level or paragraph-level span prediction. Span overlap F1 relaxes this: a predicted span is a true positive if it overlaps >=50% of the ground truth AND the ground truth overlaps >=50% of the prediction (bidirectional). This penalizes both under-segmentation and over-segmentation while tolerating minor boundary errors.
import pandas as pd
def span_overlap_f1(gt_df, pred_df):
"""Compute micro-F1 using bidirectional word overlap.
gt_df/pred_df: columns [id, class, predictionstring]
predictionstring: space-separated word indices (e.g., "3 4 5 6 7")
"""
joined = gt_df.merge(pred_df, on=["id", "class"], suffixes=("_gt", "_pred"))
def calc_overlap(row):
s_pred = set(row["predictionstring_pred"].split())
s_gt = set(row["predictionstring_gt"].split())
inter = len(s_gt & s_pred)
return inter / len(s_gt), inter / len(s_pred)
joined[["overlap_gt", "overlap_pred"]] = joined.apply(
calc_overlap, axis=1, result_type="expand"
)
joined["tp"] = (joined["overlap_gt"] >= 0.5) & (joined["overlap_pred"] >= 0.5)
# Greedy matching: best overlap first, one-to-one
tp_ids = (joined[joined["tp"]]
.sort_values("overlap_gt", ascending=False)
.drop_duplicates("predictionstring_gt")
.drop_duplicates("predictionstring_pred"))
TP = len(tp_ids)
FP = len(pred_df) - TP
FN = len(gt_df) - TP
return TP / (TP + 0.5 * (FP + FN)) if (TP + FP + FN) > 0 else 0.0
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF