skills/nlp/sentence-alignment-augmentation/SKILL.md
Split multi-sentence parallel pairs into aligned sentence pairs to expand training data for seq2seq models
npx skillsauth add wenmin-wu/ds-skills nlp-sentence-alignment-augmentationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Many parallel corpora contain document-level pairs where multiple sentences are joined. Split them into sentence-level pairs when source and target sentence counts match. This multiplies training examples and helps the model learn finer-grained alignments.
import re
import pandas as pd
def sentence_align(df, src_col='source', tgt_col='target'):
aligned = []
for _, row in df.iterrows():
src, tgt = str(row[src_col]), str(row[tgt_col])
tgt_sents = [s.strip() for s in re.split(r'(?<=[.!?])\s+', tgt) if s.strip()]
src_lines = [s.strip() for s in src.split('\n') if s.strip()]
if len(tgt_sents) > 1 and len(tgt_sents) == len(src_lines):
for s, t in zip(src_lines, tgt_sents):
if len(s) > 3 and len(t) > 3:
aligned.append({src_col: s, tgt_col: t})
else:
aligned.append({src_col: src, tgt_col: tgt})
return pd.DataFrame(aligned)
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF