skills/nlp/tfidf-weighted-word-match/SKILL.md
Computes word overlap ratio between two texts weighted by inverse corpus frequency, giving rare shared words more importance than common ones.
npx skillsauth add wenmin-wu/ds-skills nlp-tfidf-weighted-word-matchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Simple word overlap (Jaccard) treats all words equally — "the" counts as much as "tensorflow". Weighting each shared word by inverse corpus frequency (IDF-style) makes rare shared terms contribute more to the similarity score. This produces a single float feature that strongly discriminates duplicate question pairs and similar text-matching tasks.
from collections import Counter
import numpy as np
# Build IDF-style weights from corpus
all_words = (" ".join(all_questions)).lower().split()
counts = Counter(all_words)
weights = {w: 1 / (c + 10000) for w, c in counts.items() if c >= 2}
stops = set(stopwords.words("english"))
def tfidf_word_match(row):
q1 = {w for w in str(row["q1"]).lower().split() if w not in stops}
q2 = {w for w in str(row["q2"]).lower().split() if w not in stops}
if not q1 or not q2:
return 0.0
shared = [weights.get(w, 0) for w in q1 & q2]
total = [weights.get(w, 0) for w in q1 | q2]
return np.sum(shared) / (np.sum(total) + 1e-8)
df["tfidf_word_match"] = df.apply(tfidf_word_match, axis=1)
1 / (count + smoothing) per wordeps=10000 prevents rare words from dominating; tune to corpus sizedata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF