skills/cv/generative-output-numeric-cleaning/SKILL.md
Clean noisy numeric strings from generative model output by removing invalid characters, fixing malformed floats, and handling multiple decimal points
npx skillsauth add wenmin-wu/ds-skills cv-generative-output-numeric-cleaningInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Generative vision-language models (Donut, Pix2Struct, Florence) often produce noisy numeric strings — extra spaces, stray characters, multiple decimal points, or malformed scientific notation. A robust cleaning pipeline strips invalid characters, fixes structural issues (multiple dots, multiple signs), and falls back to zero for unparseable values, preventing downstream errors.
import re
def clean_numeric(str_list):
result = []
for s in str_list:
s = re.sub(r"\s", "", s)
dtype = float if "." in s else int
try:
result.append(dtype(s))
continue
except ValueError:
pass
s = re.sub(r"[^0-9.\-eE]", "", s)
if not s:
result.append(0)
continue
if s.count(".") > 1:
parts = s.split(".")
s = parts[0] + "." + "".join(parts[1:])
if s.count("-") > 1:
s = "-" + s.replace("-", "")
try:
result.append(dtype(s))
except ValueError:
result.append(0)
return result
raw = ["12.5", "1,234.5", " -3..2 ", "abc", "1.2e3"]
clean_numeric(raw) # [12.5, 1234.5, -3.2, 0, 1200.0]
int() or float() parse — fast path for clean values., -, e, E)re.sub removes commas naturally in the stripping stepe/E characters for values like 1.2e3. presence to choose int vs floatdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF