skills/cost-efficient-rag-entity-matching/SKILL.md
Build cost-efficient RAG pipelines for entity matching and deduplication using blocking-based batch retrieval and generation. Reduces LLM API calls and latency by grouping similar entity pairs into blocks before retrieval and inference. Use when the user asks to 'match entities across datasets', 'deduplicate records with LLMs', 'build a RAG pipeline for entity resolution', 'reduce cost of LLM-based record matching', 'link records between two tables', or 'entity matching with knowledge augmentation'.
npx skillsauth add ndpvt-web/arxiv-claude-skills cost-efficient-rag-entity-matchingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design and implement cost-efficient retrieval-augmented generation pipelines for entity matching — the task of determining whether two records from different data sources refer to the same real-world entity. Rather than invoking an LLM once per candidate pair (which is prohibitively expensive at scale), this approach groups similar records into blocks, performs batch retrieval of contextual knowledge across entire blocks, and feeds multiple pairs into a single LLM prompt. The technique, based on the CE-RAG4EM architecture, typically achieves comparable or better F1 scores while drastically reducing API calls and end-to-end runtime.
Blocking-based batch retrieval and generation. Standard RAG-for-entity-matching retrieves context and generates a match decision independently for each candidate pair. With N candidate pairs, that means N retrieval calls and N LLM invocations. CE-RAG4EM reduces this by first grouping records into similarity-based blocks using character q-gram or token-based blocking. Within each block, candidate pairs share overlapping attributes, so a single batch retrieval query can fetch contextual knowledge relevant to the entire block. This amortizes retrieval cost across all pairs in the block.
Three retrieval granularities. The framework supports entity-level retrieval (fetching Wikidata entity descriptions by vector similarity), predicate-level retrieval (fetching relation types), and triple-level retrieval (BFS or neighborhood expansion on a knowledge graph). Entity/predicate-level retrieval is cheaper and works well for textual attributes; triple-level retrieval captures richer structural context for numeric or ambiguous attributes but costs more.
Batch generation. Instead of one LLM call per pair, the system aggregates all pairs in a block (capped at max_bs, typically 4-6) into a single prompt with shared contextual knowledge. The LLM processes pairs sequentially within the prompt and outputs a match decision per pair. This reduces total LLM invocations by a factor roughly equal to the block size, with minimal impact on F1 when block size stays in the 4-6 range.
Load and normalize both tables. Parse source table Ts and target table Tt into a common schema. Identify matching columns (title, name, manufacturer, price, etc.) and normalize data types. Handle missing values by leaving fields empty rather than imputing.
Select blocking keys and build blocks. Choose a blocking strategy based on attribute types:
P_B = {(ri, rj) | ri in B ∩ Ts, rj in B ∩ Tt}.Deduplicate and partition blocks. Remove duplicate pairs that appear in multiple blocks (keep first occurrence only). If any block exceeds max_bs pairs (recommended: 4-6), split it into non-overlapping sub-blocks of at most max_bs pairs each.
Construct batch retrieval queries. For each block, aggregate the key attributes of all entity pairs into a single query string. Embed this query using a dense encoder (e.g., Jina Embeddings V3 or sentence-transformers).
Retrieve contextual knowledge. Query a vector index over your knowledge base (Wikidata, domain KG, or a curated reference table) with Top-k=2. Choose granularity:
Enrich and filter retrieved context. Rank retrieved items by cosine similarity to the original block query. Optionally use a lightweight LLM call to filter irrelevant results. Attach textual descriptions to entity IDs.
Compose batch prompts. Build an LLM prompt containing: (a) all entity pairs in the block formatted as numbered items, (b) the shared retrieved knowledge as "Additional Information", and (c) instructions to process each pair independently and output Yes/No per pair.
Run batch LLM inference. Send the prompt to the LLM with temperature=0.5, top_p=0.8, max_tokens=1024. Parse the response to extract per-pair match decisions.
Aggregate results and evaluate. Collect all match decisions across blocks. Compute precision, recall, and F1. If ground-truth labels are available, use them to tune max_bs and Top-k on a validation split.
Iterate on configuration. Adjust block size (larger blocks = fewer API calls but potential F1 drop), retrieval granularity (entity-level for text attributes, triple-level for numeric/ambiguous), and Top-k to find the cost-quality sweet spot for your dataset.
Example 1: Product catalog deduplication
User: "I have two product CSVs from different vendors — amazon.csv and google.csv — with columns title, manufacturer, price. Help me find matching products without spending a fortune on API calls."
Approach:
title column to generate candidate blocks.max_bs=6, split oversized blocks.import pandas as pd
from collections import defaultdict
def qgram_blocking(ts, tt, attr="title", q=3, max_bs=6):
"""Group records into blocks using character q-grams."""
blocks = defaultdict(lambda: {"source": [], "target": []})
for idx, row in ts.iterrows():
val = str(row[attr]).lower()
for i in range(len(val) - q + 1):
blocks[val[i:i+q]]["source"].append(idx)
for idx, row in tt.iterrows():
val = str(row[attr]).lower()
for i in range(len(val) - q + 1):
blocks[val[i:i+q]]["target"].append(idx)
# Form candidate pairs, deduplicate, partition
seen = set()
batches = []
current_batch = []
for block in blocks.values():
for s_idx in set(block["source"]):
for t_idx in set(block["target"]):
pair = (s_idx, t_idx)
if pair not in seen:
seen.add(pair)
current_batch.append(pair)
if len(current_batch) >= max_bs:
batches.append(current_batch)
current_batch = []
if current_batch:
batches.append(current_batch)
return batches
def build_batch_prompt(pairs, ts, tt, context, attrs):
"""Compose a single LLM prompt for an entire block of pairs."""
lines = ["You are an expert in entity matching.\n## Input:"]
for i, (s, t) in enumerate(pairs, 1):
e1 = {a: str(ts.loc[s, a]) for a in attrs}
e2 = {a: str(tt.loc[t, a]) for a in attrs}
lines.append(f"Pair {i} - Entity 1: {e1} Entity 2: {e2}")
lines.append(f"\nAdditional Information: {context}")
lines.append("\n## Instruction:")
lines.append("Process each pair independently. Compare semantics and")
lines.append("use the additional information to resolve ambiguity.")
lines.append("\n## Output: For each pair, respond 'Yes' or 'No'.")
return "\n".join(lines)
Output: A CSV of matched pairs with one LLM call per 6 pairs instead of one per pair — roughly 6x fewer API calls.
Example 2: Citation record linking with knowledge graph augmentation
User: "I need to link author records between DBLP and ACM datasets. Some authors have similar names but different publication histories. Can you use a knowledge graph to help disambiguate?"
Approach:
title, authors, venue, year.authors to group candidates.Block 3 prompt (max_bs=4):
---
Pair 1 - Entity 1: {title: "Graph databases", authors: "A. Khan", venue: "VLDB"}
Entity 2: {title: "Graph database systems", authors: "Arijit Khan", venue: "PVLDB"}
Pair 2 - Entity 1: {title: "RDF indexing", authors: "P. Groth", venue: "ISWC"}
Entity 2: {title: "RDF index structures", authors: "Paul Groth", venue: "ISWC 2023"}
Additional Information:
- Arijit Khan: Professor at Aalborg University, research areas: graph databases, knowledge graphs
- Paul Groth: Professor at University of Amsterdam, research areas: data provenance, knowledge graphs
Match Decisions: [Yes, Yes]
Example 3: Reducing cost of an existing entity matching pipeline
User: "Our current entity matching pipeline calls GPT-4o for every candidate pair. We have 50,000 pairs and it costs too much. How can I reduce the cost?"
Approach:
max_bs from 6 to 4.max_bs between 4 and 6. This range consistently yields near-peak F1 while providing substantial cost reduction. Going above 8 degrades quality.Top-k higher than 2 for retrieval. Experiments show diminishing returns beyond k=2 with increasing noise and cost.max_bs and split oversized blocks into sub-blocks. Log blocks that exceed 2x the cap as warnings.max_bs if still over limit.Paper: Cost-Efficient RAG for Entity Matching with LLMs: A Blocking-based Exploration — Ma et al., 2026. Look for Table 1 (six CE-RAG4EM design variants), Figure 2 (unified framework pipeline), and Section 5 (experiments across 9 benchmarks showing F1 improvements of +2-24% with reduced API calls).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".