skills/cv/chunked-gpu-similarity-search/SKILL.md
Compute pairwise cosine similarity on GPU in fixed-size chunks to avoid OOM, transferring only threshold-passing results to CPU
npx skillsauth add wenmin-wu/ds-skills cv-chunked-gpu-similarity-searchInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Computing an N×N similarity matrix on GPU runs out of memory for large N (>50K items). Process in chunks: multiply the full embedding matrix by a chunk of rows, threshold on GPU, then transfer only the sparse results to CPU. Reduces peak VRAM from O(N²) to O(N×chunk_size).
import numpy as np
import torch
def chunked_cosine_matches(embeddings, threshold=0.95, chunk_size=4096):
"""Find matches via chunked cosine similarity on GPU.
Args:
embeddings: (N, D) L2-normalized numpy array
threshold: cosine similarity cutoff
chunk_size: rows per GPU batch
Returns:
list of matched index arrays per query
"""
N = len(embeddings)
emb_gpu = torch.from_numpy(embeddings).cuda()
n_chunks = (N + chunk_size - 1) // chunk_size
all_matches = [None] * N
for j in range(n_chunks):
a = j * chunk_size
b = min(a + chunk_size, N)
sims = torch.matmul(emb_gpu, emb_gpu[a:b].T).T # (chunk, N)
sims_cpu = sims.cpu().numpy()
for k in range(b - a):
idx = np.where(sims_cpu[k] > threshold)[0]
all_matches[a + k] = idx
return all_matches
# Usage: embeddings must be L2-normalized
from sklearn.preprocessing import normalize
embeddings = normalize(raw_embeddings)
matches = chunked_cosine_matches(embeddings, threshold=0.9)
torch.where before .cpu() saves transfer time for sparse resultsdata-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF