skills/cv/phash-duplicate-grouping/SKILL.md
Group near-duplicate images by perceptual hash (pHash) as a zero-cost baseline signal for product or image matching
npx skillsauth add wenmin-wu/ds-skills cv-phash-duplicate-groupingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Perceptual hashing (pHash) produces a compact fingerprint that is identical for visually similar images regardless of resolution or minor edits. Group items sharing the same hash to find near-duplicates without any model inference. Use as a cheap baseline or combine with learned embeddings for higher recall.
import pandas as pd
def phash_group_matches(df, hash_col='image_phash', id_col='posting_id'):
"""Group items by perceptual hash.
Args:
df: DataFrame with hash and ID columns
hash_col: column containing perceptual hash strings
id_col: column containing item identifiers
Returns:
Series mapping each item to its hash-group matches
"""
hash_groups = df.groupby(hash_col)[id_col].agg(list).to_dict()
return df[hash_col].map(hash_groups)
# Usage
df['phash_matches'] = phash_group_matches(df)
# Compute hash if not provided
from PIL import Image
import imagehash
df['image_phash'] = df['image_path'].apply(
lambda p: str(imagehash.phash(Image.open(p)))
)
data-ai
Scaled Pinball Loss (SPL) metric for evaluating quantile forecasts, normalized by mean absolute successive differences of training data
data-ai
Walk backward through a time series and multiplicatively rescale segments when jumps exceed a fraction of the running mean to correct data collection anomalies
testing
Transform forecasting target to next/current ratio minus one so that optimizing MAE or squared error implicitly minimizes SMAPE
tools
Convert point forecasts to prediction intervals by scaling with logit-transformed quantile ratios passed through a Normal CDF