skills/nlp/multi-metric-string-similarity/SKILL.md
Computes multiple complementary string similarity scores (Gestalt, Levenshtein, Jaro-Winkler, LCS) per field pair as features for entity matching classifiers.
npx skillsauth add wenmin-wu/ds-skills nlp-multi-metric-string-similarityInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
No single string similarity metric captures all types of text variation. Gestalt (SequenceMatcher) handles rearrangements, Levenshtein captures edit distance, Jaro-Winkler rewards prefix matches (good for names), and LCS measures shared subsequences. Computing all four per field pair — plus normalized variants — gives a downstream classifier rich signals for entity matching, deduplication, and record linkage.
import difflib
import Levenshtein
def string_similarity_features(s1, s2):
"""Compute multiple similarity metrics for a string pair."""
if not s1 or not s2:
return {"gestalt": -1, "levenshtein": -1, "jaro_winkler": -1,
"norm_levenshtein": -1}
gestalt = difflib.SequenceMatcher(None, s1, s2).ratio()
leven = Levenshtein.distance(s1, s2)
jaro = Levenshtein.jaro_winkler(s1, s2)
max_len = max(len(s1), len(s2))
return {
"gestalt": gestalt,
"levenshtein": leven,
"jaro_winkler": jaro,
"norm_levenshtein": leven / max_len if max_len > 0 else 0,
}
# Apply to each field pair in candidate matches
for field in ["name", "address", "city"]:
feats = df.apply(
lambda r: string_similarity_features(r[f"{field}_1"], r[f"{field}_2"]),
axis=1, result_type="expand"
)
feats.columns = [f"{field}_{c}" for c in feats.columns]
df = pd.concat([df, feats], axis=1)
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF