skills/nlp/tfidf-translation-memory/SKILL.md
TF-IDF similarity retrieval from a translation memory with SequenceMatcher reranking as a fallback or ensemble component
npx skillsauth add wenmin-wu/ds-skills nlp-tfidf-translation-memoryInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
For repetitive or formulaic text, retrieve the closest source from a translation memory using TF-IDF similarity, then return its paired target. Combines character n-gram and word n-gram TF-IDF with SequenceMatcher reranking. Works as a standalone system or ensemble component alongside neural MT.
from sklearn.feature_extraction.text import TfidfVectorizer
from difflib import SequenceMatcher
import numpy as np
class TranslationMemory:
def __init__(self, sources, targets):
self.targets = targets
self.char_vec = TfidfVectorizer(analyzer="char_wb", ngram_range=(3, 6))
self.word_vec = TfidfVectorizer(analyzer="word", ngram_range=(1, 2))
self.Xc = self.char_vec.fit_transform(sources)
self.Xw = self.word_vec.fit_transform(sources)
self.sources = sources
def retrieve(self, query, top_k=5, min_score=0.3):
sc = (self.char_vec.transform([query]) @ self.Xc.T).toarray()[0]
sw = (self.word_vec.transform([query]) @ self.Xw.T).toarray()[0]
combined = 0.6 * sc + 0.4 * sw
top_idx = np.argsort(-combined)[:top_k]
# Rerank with SequenceMatcher
best_i, best_s = top_idx[0], -1
for idx in top_idx:
s = SequenceMatcher(None, query, self.sources[idx]).ratio()
if s > best_s:
best_i, best_s = idx, s
return self.targets[best_i] if combined[best_i] > min_score else None
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF