skills/nlp/embedding-coverage-analysis/SKILL.md
Measure pretrained embedding coverage over dataset vocab and return OOV words sorted by frequency for targeted preprocessing
npx skillsauth add wenmin-wu/ds-skills nlp-embedding-coverage-analysisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Before training with pretrained embeddings, measure how much of your dataset vocabulary actually has a pretrained vector. Low coverage means most real signal is being replaced with zeros/noise. This diagnostic function returns vocab coverage, text coverage (weighted by frequency), and a sorted list of OOV words — letting you target preprocessing at the OOV words that hurt most.
import operator
def check_coverage(vocab, embeddings_index):
"""Return OOV words sorted by frequency."""
a = {}
oov = {}
k = 0 # count of in-vocab tokens (weighted)
i = 0 # count of OOV tokens (weighted)
for word in vocab:
if word in embeddings_index:
a[word] = embeddings_index[word]
k += vocab[word]
else:
oov[word] = vocab[word]
i += vocab[word]
print(f'Found embeddings for {len(a)/len(vocab):.2%} of vocab')
print(f'Found embeddings for {k/(k+i):.2%} of all text')
return sorted(oov.items(), key=operator.itemgetter(1), reverse=True)
# Usage
vocab = build_vocab(train_texts) # {word: count}
oov = check_coverage(vocab, embeddings_index)
print(oov[:30]) # top 30 most-frequent OOV words
check_coverage to get vocab % and text %data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF