skills/llm/4bit-nf4-double-quantization/SKILL.md
Load large LLMs with 4-bit NF4 quantization and optional double quantization via BitsAndBytes to reduce GPU memory by 4x while preserving inference quality
npx skillsauth add wenmin-wu/ds-skills llm-4bit-nf4-double-quantizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
NormalFloat4 (NF4) quantization maps weights to a 4-bit data type optimized for normally-distributed neural network weights. Double quantization further compresses the quantization constants themselves. Together they reduce a 7B model from ~14GB (fp16) to ~4GB, fitting on a single consumer GPU. Quality loss is minimal for inference tasks like scoring, generation, and classification.
import torch
from transformers import AutoModelForCausalLM, AutoTokenizer, BitsAndBytesConfig
bnb_config = BitsAndBytesConfig(
load_in_4bit=True,
bnb_4bit_quant_type="nf4",
bnb_4bit_compute_dtype=torch.bfloat16,
bnb_4bit_use_double_quant=True,
)
model = AutoModelForCausalLM.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.1",
quantization_config=bnb_config,
device_map="auto",
torch_dtype=torch.bfloat16,
)
tokenizer = AutoTokenizer.from_pretrained(
"mistralai/Mistral-7B-Instruct-v0.1")
BitsAndBytesConfig with NF4 quant type and bfloat16 compute dtypedouble_quant=True for additional memory savings (~0.4GB on 7B)device_map="auto" for automatic GPU placementdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF