skills/nlp/cp1252-utf8-encoding-normalization/SKILL.md
Resolves mixed cp1252/utf-8 encoding artifacts in text via round-trip encode/decode with custom error handlers and unidecode normalization.
npx skillsauth add wenmin-wu/ds-skills nlp-cp1252-utf8-encoding-normalizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
User-generated text (essays, reviews, web scrapes) often contains mojibake — garbled characters from mismatched encodings (cp1252 bytes interpreted as utf-8 or vice versa). Fix this with a round-trip: raw_unicode_escape → utf-8 → cp1252 → utf-8, using custom error handlers to recover gracefully. Finish with unidecode to normalize remaining Unicode to ASCII.
import codecs
from text_unidecode import unidecode
def replace_encoding_with_utf8(error):
return (error.object[error.start:error.end].encode("utf-8"), error.end)
def replace_decoding_with_cp1252(error):
return (error.object[error.start:error.end].decode("cp1252"), error.end)
codecs.register_error("replace_encoding_with_utf8", replace_encoding_with_utf8)
codecs.register_error("replace_decoding_with_cp1252", replace_decoding_with_cp1252)
def resolve_encodings_and_normalize(text: str) -> str:
text = (
text.encode("raw_unicode_escape")
.decode("utf-8", errors="replace_decoding_with_cp1252")
.encode("cp1252", errors="replace_encoding_with_utf8")
.decode("utf-8", errors="replace_decoding_with_cp1252")
)
return unidecode(text)
resolve_encodings_and_normalize() to all text columns before tokenizationdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF