skills/nlp/weight-decay-bias-exclusion/SKILL.md
Exclude bias terms and LayerNorm weights from weight decay to prevent regularization from distorting normalization layers
npx skillsauth add wenmin-wu/ds-skills nlp-weight-decay-bias-exclusionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Weight decay (L2 regularization) should only apply to weight matrices, not to bias terms or LayerNorm parameters. Regularizing biases pushes them toward zero unnecessarily, and regularizing LayerNorm weights distorts the learned scale. Standard practice for fine-tuning transformers — split parameters into two groups with different decay rates.
from torch.optim import AdamW
def get_optimizer_grouped_parameters(model, lr=2e-5, weight_decay=0.01):
"""Split parameters into decay and no-decay groups."""
no_decay = ['bias', 'LayerNorm.bias', 'LayerNorm.weight']
return [
{
'params': [p for n, p in model.named_parameters()
if not any(nd in n for nd in no_decay)],
'weight_decay': weight_decay
},
{
'params': [p for n, p in model.named_parameters()
if any(nd in n for nd in no_decay)],
'weight_decay': 0.0
}
]
optimizer = AdamW(
get_optimizer_grouped_parameters(model, lr=2e-5, weight_decay=0.01)
)
any(nd in n ...) checks parameter names — verify naming convention matches your modeldata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF