skills/llm/threaded-multi-gpu-inference/SKILL.md
Run multiple LLM inference jobs in parallel using Python threads, each pinned to a separate GPU with staggered starts
npx skillsauth add wenmin-wu/ds-skills llm-threaded-multi-gpu-inferenceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
When you have multiple LLMs to run and multiple GPUs available, parallelize using Python threads with each model pinned to a dedicated GPU. Stagger thread starts by ~10 seconds to avoid simultaneous memory allocation spikes. Collect results into shared DataFrames. Cuts wall-clock time by N for N GPUs.
import threading
import time
import torch
def run_inference_on_gpu(model_path, gpu_id, data, output_dict, name):
"""Run LLM inference on a specific GPU."""
device = f"cuda:{gpu_id}"
model = load_model(model_path, device=device)
tokenizer = load_tokenizer(model_path)
results = []
for batch in create_batches(data, batch_size=8):
inputs = tokenizer(batch, return_tensors="pt", padding=True,
truncation=True).to(device)
with torch.no_grad():
outputs = model(**inputs)
results.append(outputs.logits.cpu())
output_dict[name] = torch.cat(results)
# Launch parallel inference
results = {}
gpu_assignments = [
("path/to/gemma", 0, "gemma"),
("path/to/qwen", 1, "qwen"),
]
threads = []
for model_path, gpu_id, name in gpu_assignments:
t = threading.Thread(
target=run_inference_on_gpu,
args=(model_path, gpu_id, test_data, results, name)
)
threads.append(t)
t.start()
time.sleep(10) # stagger to avoid OOM
for t in threads:
t.join()
# results["gemma"], results["qwen"] now available
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF