skills/cv/knn-distance-threshold-matching/SKILL.md
KNN-based retrieval with grid-searched distance threshold to convert embedding neighbors into match predictions
npx skillsauth add wenmin-wu/ds-skills cv-knn-distance-threshold-matchingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
After extracting embeddings (image or text), find candidate matches via K-nearest neighbors, then apply a distance threshold to decide which neighbors are true matches. Grid-search the threshold on a validation set using F1 score. Optionally filter outliers using z-score on the distance distribution per query.
import numpy as np
from sklearn.neighbors import NearestNeighbors # or cuml.neighbors
def find_matches(embeddings, ids, threshold, k=50):
"""Find matching items via KNN + distance threshold.
Args:
embeddings: (N, D) normalized feature vectors
ids: array of item identifiers
threshold: max distance to consider a match
k: number of neighbors to retrieve
Returns:
list of matched ID arrays per query
"""
knn = NearestNeighbors(n_neighbors=min(k, len(embeddings)), metric='cosine')
knn.fit(embeddings)
distances, indices = knn.kneighbors(embeddings)
predictions = []
for i in range(len(embeddings)):
mask = distances[i] < threshold
predictions.append(ids[indices[i][mask]])
return predictions
# Grid search optimal threshold
best_score, best_thresh = 0, 0
for thresh in np.arange(0.1, 1.0, 0.05):
preds = find_matches(val_embeddings, val_ids, thresh)
score = compute_f1(val_labels, preds)
if score > best_score:
best_score, best_thresh = score, thresh
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF