skills/nlp/adafactor-label-smoothing-seq2seq/SKILL.md
Use Adafactor optimizer with label smoothing for seq2seq fine-tuning — memory-efficient and regularizes overconfident predictions
npx skillsauth add wenmin-wu/ds-skills nlp-adafactor-label-smoothing-seq2seqInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Adafactor uses ~3x less memory than Adam by factorizing second-moment estimates. Combined with label smoothing (0.1–0.2), it regularizes the model against overconfident token predictions. Essential for fine-tuning large seq2seq models on limited GPU memory.
from transformers import Seq2SeqTrainingArguments, Seq2SeqTrainer
args = Seq2SeqTrainingArguments(
output_dir="./output",
optim="adafactor",
label_smoothing_factor=0.15,
learning_rate=3e-4,
fp16=False, # use FP32 for byte-level models (ByT5) to avoid NaN
per_device_train_batch_size=2,
gradient_accumulation_steps=8, # effective batch = 16
num_train_epochs=10,
weight_decay=0.01,
predict_with_generate=True,
save_strategy="epoch",
eval_strategy="epoch",
save_total_limit=2,
load_best_model_at_end=True,
metric_for_best_model="geo_mean",
)
trainer = Seq2SeqTrainer(model=model, args=args, tokenizer=tokenizer,
train_dataset=train_ds, eval_dataset=val_ds, compute_metrics=compute_metrics)
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF