skills/tabular/haversine-knn-candidate-generation/SKILL.md
Generates geographically proximate candidate pairs for entity matching using KNN with haversine distance, optionally partitioned by country.
npx skillsauth add wenmin-wu/ds-skills tabular-haversine-knn-candidate-generationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Entity matching at scale requires a candidate generation step — comparing all N^2 pairs is infeasible. For location-based entities (POIs, stores, addresses), KNN with haversine distance finds the K geographically closest candidates per record. Partitioning by country/region reduces the search space further and prevents cross-continent false matches. The resulting candidate pairs are then scored by a downstream classifier.
import numpy as np
from sklearn.neighbors import NearestNeighbors
def generate_geo_candidates(df, n_neighbors=20, partition_col="country"):
"""Generate candidate pairs using haversine KNN per partition."""
all_candidates = []
for group, group_df in df.groupby(partition_col):
group_df = group_df.reset_index(drop=True)
coords = np.deg2rad(group_df[["latitude", "longitude"]].values)
knn = NearestNeighbors(
n_neighbors=min(len(group_df), n_neighbors),
metric="haversine", n_jobs=-1
)
knn.fit(coords)
dists, indices = knn.kneighbors(coords)
for i in range(len(group_df)):
for j in range(1, len(indices[i])): # skip self-match
all_candidates.append({
"id": group_df.iloc[i]["id"],
"match_id": group_df.iloc[indices[i][j]]["id"],
"geo_dist": dists[i][j] * 6371, # km
"neighbor_rank": j,
})
return pd.DataFrame(all_candidates)
candidates = generate_geo_candidates(df, n_neighbors=20)
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF