skills/nlp/jaccard-dedup-prediction-filter/SKILL.md
Deduplicates extracted entity predictions by filtering out candidates whose Jaccard word-overlap with already-accepted labels exceeds a threshold.
npx skillsauth add wenmin-wu/ds-skills nlp-jaccard-dedup-prediction-filterInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
NER and entity extraction pipelines often produce near-duplicate predictions ("National Health Survey" vs "National Health Survey Data"). Greedy Jaccard deduplication: sort candidates by length, accept each only if its word-level Jaccard similarity with all previously accepted labels is below a threshold. Removes redundant extractions while preserving distinct entities.
def jaccard_similarity(s1, s2):
words1 = set(s1.lower().split())
words2 = set(s2.lower().split())
intersection = len(words1 & words2)
union = len(words1 | words2)
return intersection / union if union > 0 else 0.0
def dedup_entities(entities, threshold=0.75):
filtered = []
for entity in sorted(entities, key=len, reverse=True):
if not filtered or all(
jaccard_similarity(entity, kept) < threshold
for kept in filtered
):
filtered.append(entity)
return filtered
# Usage:
raw_predictions = ["National Health Survey", "National Health Survey Data", "Census Bureau"]
clean = dedup_entities(raw_predictions, threshold=0.75)
# -> ["National Health Survey Data", "Census Bureau"]
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF