skills/nlp/train-short-infer-long-sequence/SKILL.md
Trains a transformer at shorter sequence length for speed, then runs inference at a longer sequence length to capture more context, exploiting position embedding generalization.
npx skillsauth add wenmin-wu/ds-skills nlp-train-short-infer-long-sequenceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Transformer training cost scales quadratically with sequence length. Training at 1024 tokens is 4x faster than 2048. But at inference time, longer contexts improve predictions — especially for NER where entity meaning depends on surrounding sentences. This technique trains at a short sequence length (e.g., 1024) then infers at a longer one (e.g., 2048). Modern transformers with relative position embeddings (DeBERTa, RoPE-based models) generalize well to unseen lengths. The result: training-time savings with inference-time accuracy.
from transformers import AutoTokenizer
tokenizer = AutoTokenizer.from_pretrained("microsoft/deberta-v3-base")
TRAIN_MAX_LEN = 1024 # shorter for fast training
INFER_MAX_LEN = 2048 # longer for better context at inference
# Training: tokenize at short length
train_encodings = tokenizer(
train_texts, max_length=TRAIN_MAX_LEN,
truncation=True, padding='max_length',
return_offsets_mapping=True
)
# Inference: tokenize at longer length
test_encodings = tokenizer(
test_texts, max_length=INFER_MAX_LEN,
truncation=True, padding='max_length',
return_offsets_mapping=True
)
# Same model handles both — no retraining needed
max_length for training tokenization (e.g., 1024)max_length (e.g., 2048)data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF