skills/nlp/word2vec-spell-correction/SKILL.md
Uses Word2Vec vocabulary rank as a word frequency proxy for Norvig-style spell correction, avoiding the need for a separate frequency corpus.
npx skillsauth add wenmin-wu/ds-skills nlp-word2vec-spell-correctionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Norvig's spell checker needs word frequencies to pick the most likely correction. If you already have Word2Vec embeddings (e.g., Google News 300d), the vocabulary is sorted by corpus frequency — word rank directly approximates inverse frequency. Use negative rank as the "probability" to select the best candidate from edit-distance neighbors, eliminating the need for a separate word frequency file.
import gensim
model = gensim.models.KeyedVectors.load_word2vec_format(
"GoogleNews-vectors-negative300.bin.gz", binary=True)
# Build rank-based "probability" lookup
w_rank = {word: i for i, word in enumerate(model.index_to_key)}
def P(word):
return -w_rank.get(word, 0)
def edits1(word):
letters = "abcdefghijklmnopqrstuvwxyz"
splits = [(word[:i], word[i:]) for i in range(len(word) + 1)]
return set(
[a + b[1:] for a, b in splits if b] + # deletes
[a + b[1] + b[0] + b[2:] for a, b in splits if len(b) > 1] + # transposes
[a + c + b[1:] for a, b in splits if b for c in letters] + # replaces
[a + c + b for a, b in splits for c in letters] # inserts
)
def known(words):
return {w for w in words if w in w_rank}
def correction(word):
candidates = known([word]) or known(edits1(word)) or [word]
return max(candidates, key=P)
df["text"].apply(lambda t: " ".join(correction(w) for w in t.split()))data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF