skills/nlp/embedding-aware-punct-normalization/SKILL.md
Use pretrained embedding vocab to decide which punctuation to keep, split, or remove before tokenization
npx skillsauth add wenmin-wu/ds-skills nlp-embedding-aware-punct-normalizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Different pretrained embeddings tokenize punctuation differently. GoogleNews has a vector for & but not for ?. GloVe has vectors for ! and ?. Instead of applying uniform punctuation rules, query the embedding vocab to decide: (1) keep as a token if it has an embedding, (2) split around characters that act as word boundaries, (3) remove characters with no vector. This routinely pushes text coverage from ~90% to >99%.
def clean_text(x):
x = str(x)
# Split punctuation the embedding treats as word boundaries
# (so 'foo/bar' becomes two tokens 'foo bar')
for punct in "/-'":
x = x.replace(punct, ' ')
# Keep tokens known to have embeddings (surround with spaces so tokenizer splits them)
for punct in '&':
x = x.replace(punct, f' {punct} ')
# Remove punctuation with no embedding coverage
strip_chars = '?!.,"#$%\'()*+-/:;<=>@[\\]^_`{|}~' + '\u201c\u201d\u2018\u2019'
for punct in strip_chars:
x = x.replace(punct, '')
return x
clean_text before tokenizationclean_text per embedding, or use the intersection.& in R&D). Embedding-aware rules preserve it.data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF