skills/efficient-table-retrieval-understanding/SKILL.md
Build TabRAG-style pipelines that retrieve relevant tables from large image collections and answer natural language queries over them using multimodal LLMs. Implements a three-stage retrieve-rerank-reason architecture for table question answering at scale. Trigger phrases: - "find the right table and answer my question" - "search across table images to answer a query" - "build a table retrieval pipeline" - "RAG over table images" - "table QA from document scans" - "retrieve and reason over tabular data"
npx skillsauth add ndpvt-web/arxiv-claude-skills efficient-table-retrieval-understandingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design and implement TabRAG-style pipelines — three-stage systems that (1) retrieve candidate tables from large image collections using visual-text embeddings, (2) rerank candidates with a multimodal LLM for fine-grained relevance scoring, and (3) reason over the top-ranked tables to generate answers. The technique is based on the EACL 2026 paper by Xu et al. and is specifically designed for scenarios where the relevant table is not known in advance and must be found from thousands of table images.
TabRAG decomposes the problem of answering queries over large table collections into three stages with distinct computational profiles. Stage 1 (Retrieval) uses jointly trained visual and text encoders — a vision encoder like LayoutLMv3 for table images and a text encoder like GTE for queries — to compute embeddings and perform fast approximate nearest-neighbor search via FAISS. This runs in ~57ms and filters thousands of tables down to ~10 candidates. The key insight is that layout-aware vision encoders (LayoutLMv3) outperform generic vision encoders (CLIP) because they understand spatial structure inherent in tables. The encoders are trained with contrastive learning (InfoNCE loss) to maximize cosine similarity between matched query-table pairs.
Stage 2 (Reranking) passes the top-k candidates through a multimodal LLM (e.g., Mistral-7B with a CLIP ViT visual encoder) that performs fine-grained relevance assessment. The MLLM is trained on three complementary tasks: retrieval-augmented QA, binary context ranking (True/False relevance classification), and multi-table relevance identification. During inference, the model outputs the probability of the "True" token as a relevance score. This stage costs ~810ms but provides the critical precision boost (+7.0% recall improvement).
Stage 3 (Answer Generation) feeds the top-ranked table images directly into an MLLM alongside the user query — no OCR conversion needed. The model reasons over the visual table representation to produce answers, achieving 6.1% higher accuracy than prior methods. The direct image input avoids OCR errors that plague text-based approaches, especially for complex layouts, merged cells, and handwritten data.
Inventory and preprocess the table collection. Catalog all table images, normalize to a consistent resolution, and strip any surrounding non-table content (headers, footers, page numbers). Store metadata (source document, page number) alongside each image for provenance tracking.
Encode all table images into embeddings. Use a layout-aware vision encoder (LayoutLMv3 or similar) to compute a fixed-size embedding vector for each table image. Store these in a FAISS index with IVF (inverted file) partitioning for sub-linear search. For collections under 100K tables, a flat L2 or cosine index is sufficient.
Encode the user query into the shared embedding space. Use a text encoder (GTE, or the text tower of your jointly trained model) to map the natural language query into the same vector space as the table embeddings. Preprocess the query by removing formatting instructions (e.g., "Show answer in JSON") that degrade embedding quality.
Retrieve top-k candidate tables via approximate nearest neighbor search. Query the FAISS index with the text embedding and retrieve the top 10-20 candidates ranked by cosine similarity. This stage is fast (~57ms) and acts as a coarse filter.
Rerank candidates with a multimodal LLM. For each candidate table image, construct a prompt: "For the question '{query}', assess whether this table contains relevant information. Answer True or False." Feed the table image and prompt to the MLLM, extract the logit probability for "True", and re-sort candidates by this score. Keep the top 1-3 tables.
Generate the answer using the top-ranked table(s). Construct a final prompt that includes the user query and the top-ranked table image(s) as visual inputs. The MLLM reasons directly over the image — no OCR intermediate step. Use task-appropriate prompting: "Answer the following question based on the table:" for QA, or "Verify whether the following statement is supported by the table:" for fact-checking.
Post-process and validate the answer. For numerical answers, verify units and magnitude against visible table data. For text generation tasks (summaries, descriptions), check that generated entities actually appear in the table. Return the answer along with a confidence indicator based on the reranking score.
Evaluate retrieval and generation quality. Measure retrieval with MRR (Mean Reciprocal Rank) and Recall@k. Measure generation with task-specific metrics: exact-match accuracy for QA/fact verification, BLEU for text generation. Log per-query retrieval rank for debugging.
Iterate on the contrastive training data. Collect hard negatives — tables that are visually similar but contain different data — and retrain the retrieval encoders. The original paper uses batch size 32, learning rate 2e-5 with cosine decay, and Adam optimizer (beta1=0.9, beta2=0.98).
Example 1: Financial Report QA Pipeline
User: "I have 5,000 table images extracted from quarterly SEC filings. Build a system that can answer questions like 'What was Apple's revenue in Q3 2025?'"
Approach:
Output:
Retrieved table: SEC_filing_AAPL_10Q_2025Q3_page12.png (rerank score: 0.94)
Answer: Apple's revenue in Q3 2025 was $94.8 billion.
Retrieval rank: 1 (MRR: 1.0)
Example 2: Multi-Table Fact Verification
User: "I need to verify claims against a database of 20,000 Wikipedia tables stored as images. Check: 'The population of Tokyo exceeded 14 million in 2023.'"
Approach:
Output:
Retrieved table: wiki_tokyo_demographics_table3.png (rerank score: 0.89)
Verdict: SUPPORTED
Evidence: Table shows Tokyo population as 14,094,034 for 2023.
Example 3: Building the Pipeline from Scratch in Python
User: "Show me how to set up the retrieval and reranking stages."
import faiss
import numpy as np
from transformers import AutoModel, AutoTokenizer
from PIL import Image
# Stage 1: Build the retrieval index
vision_encoder = AutoModel.from_pretrained("microsoft/layoutlmv3-base")
text_encoder = AutoModel.from_pretrained("thenlper/gte-base")
tokenizer = AutoTokenizer.from_pretrained("thenlper/gte-base")
# Encode all table images
table_embeddings = []
for img_path in table_image_paths:
img = preprocess_image(img_path) # resize, normalize
emb = vision_encoder(img).last_hidden_state[:, 0, :] # CLS token
emb = emb / emb.norm(dim=-1, keepdim=True) # L2 normalize
table_embeddings.append(emb.detach().numpy())
table_embeddings = np.vstack(table_embeddings).astype("float32")
index = faiss.IndexFlatIP(table_embeddings.shape[1]) # cosine sim on normalized vecs
index.add(table_embeddings)
# Stage 1: Retrieve candidates
query = "What was the GDP growth rate in 2024?"
query_tokens = tokenizer(query, return_tensors="pt", padding=True, truncation=True)
query_emb = text_encoder(**query_tokens).last_hidden_state[:, 0, :]
query_emb = query_emb / query_emb.norm(dim=-1, keepdim=True)
scores, indices = index.search(query_emb.detach().numpy(), k=10)
candidate_paths = [table_image_paths[i] for i in indices[0]]
# Stage 2: Rerank with MLLM
rerank_scores = []
for path in candidate_paths:
prompt = f"For the question '{query}', does this table contain relevant information? Answer True or False."
score = mllm_score(image_path=path, prompt=prompt) # P(True)
rerank_scores.append(score)
top_table = candidate_paths[np.argmax(rerank_scores)]
# Stage 3: Generate answer
answer = mllm_generate(
image_path=top_table,
prompt=f"Answer the following question based on this table: {query}"
)
| Problem | Cause | Solution | |---------|-------|----------| | Retrieval returns irrelevant tables | Query-image embedding space misalignment | Fine-tune encoders with contrastive loss on domain-specific query-table pairs | | Reranker assigns high scores to wrong tables | Visually similar tables with different content | Add hard negatives to reranker training; increase candidate pool size | | MLLM generates hallucinated numbers | Table image resolution too low for fine print | Ensure table images are at least 1024px on the long edge; use high-res visual encoders (ViT-L-336px) | | FAISS search is slow on large collections | Flat index doesn't scale past ~1M vectors | Switch to IndexIVFFlat or IndexIVFPQ with nprobe tuning | | Answer contradicts visible table data | MLLM over-relies on parametric knowledge | Add explicit instruction: "Answer ONLY based on the provided table, not your prior knowledge" |
Paper: Efficient Table Retrieval and Understanding with Multimodal Large Language Models (Xu et al., EACL 2026 Findings)
What to look for: Section 3 for the three-stage architecture details, Table 3 for ablation results showing each stage's contribution, and Section 4.6 for computational cost analysis (57ms retrieval / 810ms reranking / 520ms generation).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".