skills/gpu-optimizer/SKILL.md
GPU optimization for consumer NVIDIA GPUs (8-24GB VRAM) covering mixed precision, gradient checkpointing, XGBoost GPU, CuPy/cuDF migration, and torch.compile. Triggers on: "optimize GPU training", "speed up CUDA", "reduce OOM", "migrate NumPy to CuPy", "manage GPU memory", "benchmark PyTorch".
npx skillsauth add mathews-tom/armory gpu-optimizerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Expert GPU optimization for consumer GPUs with 8–24GB VRAM. Evidence-based patterns only.
Fill in your hardware before applying optimizations:
| Property | Your Value |
| ----------------- | ------------------------------------------------ |
| GPU model | (e.g., RTX 4080 Mobile, RTX 3090, RTX 4090) |
| VRAM | (e.g., 12GB, 16GB, 24GB) |
| CUDA version | (nvidia-smi → top-right) |
| TDP / power limit | (laptop vs desktop affects sustained throughput) |
| Driver version | (nvidia-smi → top-left) |
Key constraint: VRAM capacity determines which strategies apply. Patterns below are annotated with minimum VRAM requirements where relevant.
DMatrix vs QuantileDMatrix:
# GPU-optimized: QuantileDMatrix is 1.8x faster
dtrain = xgb.QuantileDMatrix(X_train.astype(np.float32))
dval = xgb.QuantileDMatrix(X_val.astype(np.float32))
# Standard: DMatrix (use for inference only)
dtest = xgb.DMatrix(X_test.astype(np.float32))
Critical Parameters:
params = {
'tree_method': 'hist', # GPU-accelerated histogram
'device': 'cuda:0', # Explicit GPU device
'max_bin': 256, # Higher bins = better splits (VRAM permitting)
'grow_policy': 'depthwise', # vs 'lossguide' for imbalanced data
'predictor': 'gpu_predictor', # GPU inference
}
# Training with explicit device
model = xgb.train(params, dtrain, num_boost_round=100)
GPU Verification (fail-fast):
def verify_gpu():
"""Verify XGBoost GPU availability. Raises if unavailable."""
import subprocess
try:
result = subprocess.run(["nvidia-smi"], capture_output=True, text=True)
if result.returncode != 0:
raise RuntimeError("nvidia-smi failed - no GPU available")
except FileNotFoundError:
raise RuntimeError("nvidia-smi not found - no GPU available")
build_info = xgb.build_info()
if not build_info.get("USE_CUDA"):
raise RuntimeError("XGBoost not compiled with CUDA support")
Memory Management:
# Single-pass training (reuse QuantileDMatrix across slots)
dtrain = xgb.QuantileDMatrix(X_train.astype(np.float32))
for slot_idx in range(num_slots):
dtrain.set_label(y_train[:, slot_idx]) # Reuse matrix
model = xgb.train(params, dtrain, num_boost_round=100)
BF16 (preferred) vs FP16:
from torch.amp import autocast, GradScaler
# Auto-detect best precision
if torch.cuda.is_bf16_supported():
amp_dtype = torch.bfloat16 # Ampere+ GPUs support BF16
else:
amp_dtype = torch.float16
# Training step
scaler = GradScaler('cuda') if amp_dtype == torch.float16 else None
with autocast('cuda', dtype=amp_dtype):
output = model(input_ids, attention_mask)
loss = criterion(output, targets)
# Backward with scaling (FP16 only)
if scaler:
scaler.scale(loss).backward()
scaler.step(optimizer)
scaler.update()
else:
loss.backward()
optimizer.step()
Why BF16 > FP16:
Gradient Checkpointing:
# Saves ~40% VRAM, adds ~20% compute time
model.gradient_checkpointing_enable()
# For transformers:
model.base_model.model.gradient_checkpointing_enable()
VRAM Monitoring:
import torch
torch.cuda.reset_peak_memory_stats()
# ... training ...
peak_vram_gb = torch.cuda.max_memory_allocated() / 1024**3
print(f"Peak VRAM: {peak_vram_gb:.2f} GB")
# Clear cache between experiments
torch.cuda.empty_cache()
Gradient Accumulation:
# Simulate larger batch size without OOM
grad_accum_steps = max(1, target_batch_size // actual_batch_size)
for i, batch in enumerate(dataloader):
loss = model(batch) / grad_accum_steps
loss.backward()
if (i + 1) % grad_accum_steps == 0:
optimizer.step()
optimizer.zero_grad()
DoE for VRAM Optimization:
EXPERIMENTS = [
{"batch_size": 2, "seq_len": 128, "grad_ckpt": True, "amp": "bf16"},
{"batch_size": 4, "seq_len": 256, "grad_ckpt": True, "amp": "bf16"},
{"batch_size": 8, "seq_len": 512, "grad_ckpt": False, "amp": "bf16"},
{"batch_size": 16, "seq_len": 256, "grad_ckpt": False, "amp": "bf16"},
]
Tensor Lookups (not Python loops):
# Slow: Python loop
for i, token_id in enumerate(input_ids):
type_id = token_to_type[token_id]
embeddings[i] = type_embeddings[type_id]
# Fast: Vectorized
type_ids = token_to_type[input_ids] # Broadcast lookup
embeddings = type_embeddings[type_ids] # Single GPU kernel
Registered Buffers (persistent GPU data):
class Model(nn.Module):
def __init__(self):
super().__init__()
# Build lookup tensors once
type_ids = torch.zeros(vocab_size, dtype=torch.long)
self.register_buffer('_type_ids', type_ids) # Stays on GPU
def forward(self, input_ids):
return self._type_ids[input_ids] # Vectorized lookup
Batch Operations:
# Slow: Per-sample processing
outputs = [model(x.unsqueeze(0)) for x in batch]
# Fast: Batched
outputs = model(batch) # Single forward pass
When to Use CuPy:
Migration Pattern:
import cupy as cp
import numpy as np
# NumPy (CPU)
x = np.random.randn(10000, 1000)
y = np.dot(x, x.T)
# CuPy (GPU) - SAME API
x_gpu = cp.random.randn(10000, 1000)
y_gpu = cp.dot(x_gpu, x_gpu.T)
# Transfer back if needed
y_cpu = cp.asnumpy(y_gpu)
Interop with PyTorch:
# CuPy → PyTorch (zero-copy)
x_cupy = cp.random.randn(1000, 1000)
x_torch = torch.as_tensor(x_cupy, device='cuda')
# PyTorch → CuPy (zero-copy)
x_torch = torch.randn(1000, 1000, device='cuda')
x_cupy = cp.asarray(x_torch)
Install:
uv pip install cupy-cuda12x # For CUDA 12.x
When to Use cuDF:
Migration Pattern:
import cudf
import pandas as pd
# Pandas (CPU)
df = pd.read_csv('large.csv')
grouped = df.groupby('category')['value'].mean()
# cuDF (GPU) - SAME API
df_gpu = cudf.read_csv('large.csv')
grouped_gpu = df_gpu.groupby('category')['value'].mean()
# Transfer back
grouped_cpu = grouped_gpu.to_pandas()
XGBoost Integration:
import cudf
import xgboost as xgb
# Load data on GPU
df = cudf.read_csv('train.csv')
X = df[feature_cols]
y = df['target']
# Create DMatrix directly from cuDF (no CPU copy)
dtrain = xgb.DMatrix(X, label=y)
Install:
# RAPIDS (includes cuDF, cuML, cuGraph)
uv pip install cudf-cu12 --extra-index-url=https://pypi.nvidia.com
Fused Optimizer:
# Check availability
use_fused = (
torch.cuda.is_available()
and "fused" in torch.optim.AdamW.__init__.__code__.co_varnames
)
optimizer = torch.optim.AdamW(
model.parameters(),
lr=1e-3,
fused=use_fused, # Single GPU kernel (2-3x faster)
)
Torch Compile:
# PyTorch 2.0+ compile
if hasattr(torch, "compile"):
model = torch.compile(model, mode="reduce-overhead")
cuDNN Benchmarking:
# Auto-tune kernels (slower startup, faster training)
torch.backends.cudnn.benchmark = True
# Disable for determinism
torch.backends.cudnn.deterministic = True
Weighted Slot Loss:
class WeightedSlotLoss(nn.Module):
def __init__(self, slot_weights):
super().__init__()
self.slot_weights = torch.tensor(slot_weights)
def forward(self, logits_list, targets):
weighted_losses = []
for i, logits in enumerate(logits_list):
loss = F.cross_entropy(logits, targets[:, i])
weighted_losses.append(loss * self.slot_weights[i])
return torch.stack(weighted_losses).sum() / self.slot_weights.sum()
Focal Loss (hard example mining):
class FocalLoss(nn.Module):
def __init__(self, gamma=2.0):
super().__init__()
self.gamma = gamma
def forward(self, logits, targets):
ce_loss = F.cross_entropy(logits, targets, reduction='none')
pt = torch.exp(-ce_loss)
focal_loss = ((1 - pt) ** self.gamma) * ce_loss
return focal_loss.mean()
Position Embedding Cache:
class Model(nn.Module):
def __init__(self):
super().__init__()
self._pos_cache = {} # {seq_len: positions}
def forward(self, x):
T = x.size(1)
if T not in self._pos_cache:
self._pos_cache[T] = torch.arange(T, device=x.device)
# Limit cache size
if len(self._pos_cache) > 10:
self._pos_cache.pop(next(iter(self._pos_cache)))
return self.pos_embed(self._pos_cache[T])
Attention Mask Cache:
def _create_causal_mask(self, T, device):
if T not in self._mask_cache:
mask = torch.triu(torch.ones(T, T), diagonal=1).bool()
self._mask_cache[T] = mask.to(device)
return self._mask_cache[T]
Check GPU Utilization:
watch -n 1 nvidia-smi # Monitor in real-time
Profile PyTorch:
with torch.profiler.profile(
activities=[torch.profiler.ProfilerActivity.GPU],
with_stack=True,
) as prof:
model(batch)
print(prof.key_averages().table(sort_by="cuda_time_total"))
Bottleneck Detection:
import torch.utils.bottleneck as bottleneck
bottleneck.main(['script.py'])
QuantileDMatrix, set device='cuda:0'Avoid:
.cpu() in training loop (kills GPU pipeline)torch.cuda.synchronize() unnecessarily (breaks async)Documentation:
nvidia-smi first to confirm the GPU is visible; if the command fails, verify driver installation with sudo nvidia-smi or reinstall drivers before proceeding.RuntimeError: XGBoost not compiled with CUDA support: install the CUDA build via uv pip install xgboost from a CUDA-enabled environment, or build from source with -DUSE_CUDA=ON.ImportError or version mismatch): verify CUDA toolkit version with nvcc --version and install the matching CuPy wheel (e.g., cupy-cuda12x for CUDA 12.x).--extra-index-url=https://pypi.nvidia.com) and match the cudf-cu12 suffix to your CUDA major version.torch.compile produces incorrect results or crashes: disable with model = model (no compile) to isolate; known to fail on some custom ops — fall back to eager mode for those layers.Each optimization recommendation includes a before/after code pair showing the original pattern and the GPU-optimized equivalent. Performance gain estimates are provided as ranges (e.g., "1.8x faster", "~40% VRAM reduction") based on typical consumer GPU benchmarks — actual gains depend on workload and hardware. Where a change introduces a trade-off (e.g., gradient checkpointing adds compute time), the trade-off is stated explicitly inline.
testing
Manages dependent branch stacks and stacked pull requests using safe Git topology rules. Triggers on: "create stacked PRs", "publish this stack", "sync my PR stack", "rebase this stack", "merge the stack", "retarget child PRs", "split this branch into stacked PRs", "validate this stack", "cleanup stacked branches". Use when local branches or one source branch need to become a dependency-ordered PR stack with correct parent bases, validation, synchronization, merge order, and cleanup.
development
Scaffolds per-repository agent context so coding agents share the same issue tracker rules, triage label vocabulary, domain glossary, ADR layout, and handoff conventions. Triggers on: "set up project context", "configure agent docs", "create CONTEXT.md", "setup agent workflow", "agent issue tracker setup", "triage labels", "domain glossary for agents". Use when a repo needs durable context files before planning, triage, debugging, TDD, architecture review, or multi-agent implementation.
testing
Produces phased task boards from feature requests: dependency-mapped work items, parallelization flags, risk flags, edge cases, test matrices. Triggers on: "decompose this feature", "task breakdown with dependencies", "phased implementation plan", "work breakdown structure". NOT for effort estimates, use estimate-calibrator.
development
Hypothesis-driven debugging with ranked hypotheses, git bisect strategy, instrumentation planning, and minimal reproduction design. Triggers on: "debug this systematically", "root cause analysis", "bisect this bug", "rank hypotheses", "isolate this issue", "minimal reproduction". NOT for general reasoning.