skills/evaluation-entity-matching-recommender/SKILL.md
Build and evaluate cross-dataset entity matching pipelines for recommender systems. Implements the Reddit-Amazon-EM methodology: rule-based, lexical, embedding-based, graph neural, and LLM-based entity matching with systematic evaluation. Use when: 'match products across catalogs', 'link entities between datasets', 'deduplicate items across platforms', 'entity resolution for recommendations', 'cross-dataset product mapping', 'evaluate entity matching methods'.
npx skillsauth add ndpvt-web/arxiv-claude-skills evaluation-entity-matching-recommenderInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to build rigorous cross-dataset entity matching pipelines following the methodology from the Reddit-Amazon-EM paper. The core technique systematically evaluates five families of entity matching methods — rule-based, lexical (BM25), embedding-based (FAISS + fuzzy), graph neural (GNEM), and LLM-based (zero-shot classification) — against a human-annotated gold standard. Claude can apply this to any domain where items described differently across two catalogs must be linked: products, movies, books, restaurants, or any named entities in recommender systems.
The Reddit-Amazon-EM methodology treats entity matching as a multi-stage pipeline rather than a single-model problem. The insight is that different matching families excel at different failure modes: lexical methods handle exact and near-exact title matches cheaply, embedding methods catch semantic paraphrases, graph methods exploit relational structure, and LLMs resolve genuinely ambiguous cases requiring world knowledge. The paper's evaluation framework measures precision, recall, F1, and hit rate across all five families on the same gold-standard pairs, enabling principled method selection.
The practical takeaway is a cascade architecture: start with cheap high-precision methods (exact string match, BM25 retrieval) to handle easy cases, then escalate uncertain candidates through embedding similarity (using sentence-transformers + FAISS for fast approximate nearest neighbor search combined with fuzzy string matching), and finally route the hardest cases to an LLM for zero-shot classification. This cascade balances accuracy against computational cost — the paper found LLM-based methods (GPT-3.5-Turbo, GPT-4) achieved the highest accuracy overall, but hybrid strategies combining lightweight retrieval with selective LLM verification are more practical at scale.
The gold-standard construction process is equally important: human annotators manually verified entity pairs across Reddit-Movies and Amazon '23 catalogs, creating positive matches and hard negatives. This annotation methodology — sampling candidate pairs from multiple matching methods, then having humans judge — produces evaluation sets that stress-test each approach's weaknesses rather than just confirming easy matches.
Define the entity schema for both sources. Extract the key identifying fields from each dataset (e.g., title, year, author/director, category). Normalize field names into a common schema with columns like source_id, entity_name, entity_year, entity_category, and source_dataset.
Preprocess and normalize entity text. Apply lowercasing, strip special characters, expand common abbreviations, and handle brand synonyms. For movie/product titles, remove parenthetical qualifiers like "(2023 Edition)" or "[Blu-ray]" that differ across platforms but refer to the same entity.
Build the candidate index using BM25 + FAISS. Index the target dataset (e.g., Amazon catalog) with both a BM25 sparse index (using rank_bm25 or Elasticsearch) and a dense FAISS index (using sentence-transformers embeddings like all-MiniLM-L6-v2). This dual index enables both lexical and semantic retrieval of candidates.
from sentence_transformers import SentenceTransformer
import faiss
import numpy as np
model = SentenceTransformer('all-MiniLM-L6-v2')
target_titles = [item['title'] for item in amazon_catalog]
embeddings = model.encode(target_titles, normalize_embeddings=True)
index = faiss.IndexFlatIP(embeddings.shape[1])
index.add(embeddings.astype('float32'))
Retrieve top-K candidates for each source entity. For each entity in the source dataset (e.g., Reddit mentions), query both the BM25 and FAISS indexes. Merge results using reciprocal rank fusion or simple union of top-K (K=10-20) from each retriever. This generates a candidate set per source entity.
Score candidates with embedding similarity + fuzzy matching. Compute cosine similarity from the dense embeddings and token-level fuzzy match ratio (using thefuzz / rapidfuzz). Combine scores: combined_score = 0.6 * cosine_sim + 0.4 * fuzzy_ratio. Candidates above a threshold of 0.80 are classified as matches; those between 0.65-0.80 are uncertain.
from rapidfuzz import fuzz
def combined_score(query_emb, candidate_emb, query_text, candidate_text):
cosine = np.dot(query_emb, candidate_emb)
fuzzy = fuzz.token_sort_ratio(query_text, candidate_text) / 100.0
return 0.6 * cosine + 0.4 * fuzzy
Route uncertain candidates to LLM verification. For candidate pairs in the uncertainty band (0.65-0.80 combined score), construct a zero-shot prompt asking the LLM whether the two entities refer to the same real-world item. Include all available metadata (title, year, category) in the prompt. Use structured output to get a binary yes/no with a confidence score.
Prompt template:
"Do these two entries refer to the same movie?
Entry A: {source_title} ({source_year})
Entry B: {target_title} ({target_year}, {target_category})
Answer 'yes' or 'no' with a confidence between 0 and 1."
Construct positive and negative pairs for evaluation. From confirmed matches (high-confidence automatic + LLM-verified), create positive pairs. Sample hard negatives from near-miss candidates (high similarity but confirmed non-match). Split into train/validation/test sets (60/20/20) if training a supervised matcher like GNEM.
Evaluate all methods against the gold standard. Compute precision, recall, F1, and hit@K for each matching method independently on the same test set. Use the gold-standard pairs as ground truth. Report results in a comparison table.
Build the final mapping using the cascade. Apply the full cascade (exact match -> BM25+FAISS -> embedding+fuzzy -> LLM verification) to produce the complete cross-dataset mapping. Store as a CSV with columns: source_id, target_id, match_score, match_method, confidence.
Validate a sample with human evaluation. Randomly sample 50-100 matched pairs and 50-100 non-matched pairs from the final mapping. Build a simple evaluation interface (Streamlit works well) for human annotators to verify. Compute inter-annotator agreement and adjust thresholds if precision or recall is inadequate.
Example 1: Matching Reddit Movie Mentions to a Product Catalog
User: "I have a dataset of Reddit posts mentioning movies and an Amazon product catalog. I need to link each Reddit movie mention to the correct Amazon product entry."
Approach:
Output:
reddit_post_id,reddit_title,amazon_asin,amazon_title,score,method
r_001,"inception","B004LWZWGQ","Inception (2010) [Blu-ray]",0.94,embedding_fuzzy
r_002,"the dark night","B001GZ6QEC","The Dark Knight (2008)",0.87,embedding_fuzzy
r_003,"shawshank","B000P0J0AQ","The Shawshank Redemption",0.72,llm_verified
Example 2: Evaluating Multiple Entity Matching Strategies
User: "I want to compare different entity matching methods on my product dataset and find which works best."
Approach:
all-MiniLM-L6-v2, threshold 0.80Output:
| Method | Precision | Recall | F1 | Hit@1 | Hit@5 |
|---------------|-----------|--------|-------|-------|-------|
| Exact Match | 0.98 | 0.42 | 0.59 | 0.42 | 0.42 |
| BM25 | 0.81 | 0.73 | 0.77 | 0.68 | 0.85 |
| Embedding | 0.76 | 0.82 | 0.79 | 0.71 | 0.89 |
| EmbFuzzy | 0.83 | 0.80 | 0.81 | 0.76 | 0.91 |
| LLM (GPT-3.5) | 0.91 | 0.86 | 0.88 | 0.84 | 0.93 |
| Cascade | 0.90 | 0.87 | 0.88 | 0.85 | 0.94 |
Example 3: Building a Cross-Platform Item Deduplication Pipeline
User: "I'm merging two e-commerce catalogs and need to find which items are the same product listed differently."
Approach:
Output:
{
"canonical_id": "prod_00123",
"matched_entries": [
{"source": "catalog_a", "id": "A-789", "title": "Sony WH-1000XM5 Headphones"},
{"source": "catalog_b", "id": "B-456", "title": "Sony WH1000XM5 Wireless Noise Canceling"}
],
"match_score": 0.92,
"match_method": "embedding_fuzzy"
}
null rather than forcing a low-confidence match. Flag these for manual review. In the Reddit-Amazon-EM study, approximately 15-20% of mentions had no catalog match.unicodedata.normalize('NFKD', text) and strip HTML before any matching step.paraphrase-multilingual-MiniLM-L12-v2) and is not covered by this methodology.development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".