skills/nlp/tfidf-pair-difference-encoding/SKILL.md
Encodes text pairs by computing the absolute difference of their TF-IDF vectors, collapsing a pair into a single fixed-length feature vector.
npx skillsauth add wenmin-wu/ds-skills nlp-tfidf-pair-difference-encodingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
For text pair tasks (duplicate detection, semantic similarity, paraphrase identification), you need to represent two texts as a single feature vector. Fit TF-IDF on the combined corpus of both columns interleaved, then compute the absolute element-wise difference between each pair's vectors. This captures which terms differ between the two texts — shared terms cancel out, unique terms produce large values.
from sklearn.feature_extraction.text import TfidfVectorizer
import numpy as np
# Interleave q1 and q2 into a single corpus
corpus = []
for _, row in df.iterrows():
corpus.append(str(row["question1"]))
corpus.append(str(row["question2"]))
tfidf = TfidfVectorizer(max_features=256)
vectors = tfidf.fit_transform(corpus)
# Absolute difference: q1 vectors are even indices, q2 are odd
X_diff = np.abs(vectors[0::2] - vectors[1::2])
TfidfVectorizer on the combined corpus|q1 - q2| element-wise as the pair encodingq1 * q2 captures co-occurrence, not just difference.toarray() for densedata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF