skills/nlp/in-task-pretraining/SKILL.md
Further pretrains a transformer with masked language modeling on the target task's own text before fine-tuning.
npx skillsauth add wenmin-wu/ds-skills nlp-in-task-pretrainingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Before fine-tuning a pretrained transformer on your target task, run additional masked language modeling (MLM) on the task's own text corpus. This adapts the model's language understanding to the domain and vocabulary of your specific data, often improving downstream performance by 0.5-2% RMSE.
from transformers import (
AutoModelForMaskedLM, AutoTokenizer,
DataCollatorForLanguageModeling, TrainingArguments, Trainer
)
tokenizer = AutoTokenizer.from_pretrained("roberta-base")
model = AutoModelForMaskedLM.from_pretrained("roberta-base")
# Tokenize task text
def tokenize(examples):
return tokenizer(examples["text"], truncation=True, max_length=256)
dataset = dataset.map(tokenize, batched=True)
collator = DataCollatorForLanguageModeling(tokenizer=tokenizer, mlm_probability=0.15)
args = TrainingArguments(
output_dir="./itpt", num_train_epochs=5,
per_device_train_batch_size=16, learning_rate=5e-5,
)
trainer = Trainer(model=model, args=args, train_dataset=dataset, data_collator=collator)
trainer.train()
model.save_pretrained("./itpt-roberta")
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF