skills/nlp/gradient-checkpointing/SKILL.md
Trades compute for memory by recomputing intermediate activations during backprop instead of storing them, reducing memory from O(n) to O(sqrt(n)).
npx skillsauth add wenmin-wu/ds-skills nlp-gradient-checkpointingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Large transformer models store all intermediate activations for backpropagation, consuming massive GPU memory. Gradient checkpointing discards most activations during the forward pass and recomputes them on-the-fly during backward. This reduces activation memory from O(n_layers) to O(sqrt(n_layers)) at the cost of ~30% slower training. Essential for fine-tuning large models on limited GPU memory.
from transformers import AutoModel
model = AutoModel.from_pretrained("microsoft/deberta-v3-base")
model.gradient_checkpointing_enable()
# Now train as usual — memory usage drops significantly
# Pair with mixed precision for maximum memory savings
model.gradient_checkpointing_enable() before trainingtorch.cuda.amp) for further savingstorch.utils.checkpoint.checkpoint() for custom modelsdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF