skills/nlp/translation-regex-postprocessing/SKILL.md
Multi-rule regex pipeline to clean seq2seq translation outputs — deduplicate phrases, fix punctuation, remove artifacts
npx skillsauth add wenmin-wu/ds-skills nlp-translation-regex-postprocessingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Seq2seq models produce artifacts: repeated phrases, prompt leakage, trailing fragments, inconsistent punctuation. A cascaded regex pipeline fixes these systematically. Apply after decoding, before MBR or final submission.
import re
RULES = [
# Remove leaked prompt prefix
(re.compile(r'(?i)^translate \w+ to \w+:\s*'), ''),
# Deduplicate repeated phrases (2-4 word spans)
(re.compile(r'\b(\w+(?:\s+\w+){1,3})\s+\1\b'), r'\1'),
# Collapse repeated single words
(re.compile(r'\b(\w+)(\s+\1){2,}\b'), r'\1'),
# Remove trailing short fragments
(re.compile(r'\s+\w{1,3}$'), ''),
# Normalize multiple spaces
(re.compile(r'\s{2,}'), ' '),
# Fix space before punctuation
(re.compile(r'\s+([.,;:!?])'), r'\1'),
]
def postprocess_translation(text):
text = text.strip()
for pattern, repl in RULES:
text = pattern.sub(repl, text)
if text and text[-1] not in '.!?"':
text += '.'
return text.strip()
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF