skills/llm/vllm-lora-adapter-inference/SKILL.md
Serve a quantized base LLM with a hot-swappable LoRA adapter under vLLM, enabling prefix caching and tensor parallelism so a single fine-tuned adapter runs at production throughput without merging weights
npx skillsauth add wenmin-wu/ds-skills llm-vllm-lora-adapter-inferenceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
After LoRA-fine-tuning a base model, the naive deploy is to merge the adapter into the base and ship the merged checkpoint. That kills two things you actually want on Kaggle: (1) the GPTQ quantization on the base, which merging breaks, and (2) the ability to A/B multiple adapters against the same loaded base. vLLM solves both: load the quantized base once with enable_lora=True, then attach the adapter per-request with LoRARequest. Prefix caching reuses the system-prompt KV across all rows in a batch, which on a 4k-context classification task is usually a 3–5x throughput win.
from vllm import LLM, SamplingParams
from vllm.lora.request import LoRARequest
llm = LLM(
model=BASE_MODEL, # e.g. Qwen2.5-7B-Instruct-GPTQ-Int4
quantization='gptq',
tensor_parallel_size=2,
enable_lora=True,
max_lora_rank=64,
enable_prefix_caching=True,
max_model_len=4096,
gpu_memory_utilization=0.92,
)
sampling = SamplingParams(temperature=0, max_tokens=4, logprobs=20)
lora = LoRARequest('default', 1, LORA_PATH)
outputs = llm.generate(prompts, sampling, lora_request=lora)
quantization=max_lora_rank to match your training rank (default 16 silently truncates)LoRARequest(name, int_id, path) — the int_id must be unique per adapterlora_request= on each generate call (or per-prompt as a list)enable_prefix_caching=True so repeated system prompts are KV-cached across the batchtensor_parallel_size=N for the visible GPU count; vLLM shards the base weights, the adapter is replicatedmax_lora_rank must be ≥ training rank: vLLM does not auto-detect; setting it too low is silent corruption.max_model_len=4096, not 32k: classification prompts rarely need long context; smaller max_model_len frees KV cache for larger batch.temperature=0 + tiny max_tokens: for binary scoring you want one token deterministically; combine with logprobs to get class probs without sampling.multi-gpu-process-isolated-vllm.data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF