skills/nlp/layerwise-lr-decay/SKILL.md
Applies different learning rates to transformer encoder vs task-specific head, with no weight decay on bias and LayerNorm.
npx skillsauth add wenmin-wu/ds-skills nlp-layerwise-lr-decayInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Pretrained transformer layers need gentle updates to preserve knowledge, while the randomly initialized task head needs aggressive learning. Use a lower learning rate for the encoder and a higher one for the decoder/head. Exclude bias and LayerNorm parameters from weight decay to prevent underfitting.
def get_optimizer_params(model, encoder_lr=2e-5, decoder_lr=1e-3, weight_decay=0.01):
no_decay = ["bias", "LayerNorm.bias", "LayerNorm.weight"]
params = [
{"params": [p for n, p in model.model.named_parameters()
if not any(nd in n for nd in no_decay)],
"lr": encoder_lr, "weight_decay": weight_decay},
{"params": [p for n, p in model.model.named_parameters()
if any(nd in n for nd in no_decay)],
"lr": encoder_lr, "weight_decay": 0.0},
{"params": [p for n, p in model.named_parameters()
if "model" not in n],
"lr": decoder_lr, "weight_decay": 0.0},
]
return params
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF