skills/nlp/encoder-decoder-lr-split/SKILL.md
Use separate learning rates for pretrained backbone (low) and randomly initialized classification head (high)
npx skillsauth add wenmin-wu/ds-skills nlp-encoder-decoder-lr-splitInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Pretrained transformer backbones need a lower learning rate to avoid catastrophic forgetting, while the randomly initialized classification head needs a higher learning rate to converge quickly. Split optimizer param groups into encoder vs decoder with different LRs.
def get_optimizer_params(model, encoder_lr=2e-5, decoder_lr=1e-3, weight_decay=0.01):
no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
return [
{"params": [p for n, p in model.backbone.named_parameters()
if not any(nd in n for nd in no_decay)],
"lr": encoder_lr, "weight_decay": weight_decay},
{"params": [p for n, p in model.backbone.named_parameters()
if any(nd in n for nd in no_decay)],
"lr": encoder_lr, "weight_decay": 0.0},
{"params": [p for n, p in model.named_parameters()
if "backbone" not in n],
"lr": decoder_lr, "weight_decay": 0.0},
]
optimizer = torch.optim.AdamW(get_optimizer_params(model))
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF