skills/cv/tensor-core-aligned-padding/SKILL.md
Pad batch sequence lengths to multiples of 8 for efficient tensor core utilization on GPUs, with -100 masking for label padding
npx skillsauth add wenmin-wu/ds-skills cv-tensor-core-aligned-paddingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
NVIDIA tensor cores operate most efficiently on dimensions that are multiples of 8 (FP16) or 16 (INT8). When creating batches with variable-length sequences, pad to the next multiple of 8 instead of just the max length. Combined with masking padding tokens as -100 in labels (ignored by CrossEntropyLoss), this yields up to 15% throughput improvement with no accuracy impact.
import torch
def collate_aligned(samples, pad_token_id, align=8):
max_len = max(len(s["input_ids"]) for s in samples)
if max_len % align != 0:
max_len = (max_len // align + 1) * align
input_ids = []
for s in samples:
padded = s["input_ids"] + [pad_token_id] * (max_len - len(s["input_ids"]))
input_ids.append(padded)
input_ids = torch.tensor(input_ids)
labels = input_ids.clone()
labels[labels == pad_token_id] = -100 # ignore padding in loss
attention_mask = (input_ids != pad_token_id).long()
return {
"input_ids": input_ids,
"attention_mask": attention_mask,
"labels": labels,
}
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF