skills/chunking-retrieval-re-ranking-empirical-evaluation/SKILL.md
Build and optimize two-stage RAG pipelines with bi-encoder retrieval, cross-encoder re-ranking, and empirically-validated chunking strategies. Use when: 'build a RAG pipeline', 'add re-ranking to retrieval', 'optimize chunking for documents', 'set up document QA with re-ranking', 'improve RAG faithfulness', 'two-stage retrieval pipeline'.
npx skillsauth add ndpvt-web/arxiv-claude-skills chunking-retrieval-re-ranking-empirical-evaluationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design, implement, and optimize Retrieval-Augmented Generation (RAG) pipelines that use a two-stage retrieval architecture: fast bi-encoder retrieval followed by precise cross-encoder re-ranking. Based on empirical results from Maharjan & Yadav (2026), this approach raises faithfulness from ~0.35 (vanilla LLM) to ~0.80 (Advanced RAG) on domain-specific document QA, a 129% improvement over baseline. The skill covers chunking strategy selection, retrieval configuration, re-ranking integration, and RAGAS-based evaluation.
Two-stage retrieval separates the search problem into recall (finding candidates) and precision (selecting the best ones). Stage 1 uses a bi-encoder (e.g., all-MiniLM-L6-v2) that independently encodes queries and document chunks into dense vectors, then retrieves the top-k candidates via cosine similarity. This is fast but shallow — the query and document never "see" each other during encoding. Stage 2 applies a cross-encoder (e.g., ms-marco-MiniLM-L-6-v2) that jointly processes the query concatenated with each candidate chunk, allowing token-level attention between them. This captures causal relationships and disambiguates closely related concepts that bi-encoders miss.
Chunking strategy directly impacts retrieval quality. Recursive character-based splitting divides text at natural boundaries (paragraphs, sentences) using a fixed character budget. Token-based semantic splitting segments by semantic coherence using sentence embeddings to detect topic shifts. The empirical finding: neither strategy is universally superior — recursive character splitting preserves document structure better for hierarchical policy documents, while semantic splitting handles heterogeneous corpora with mixed content types. The critical bottleneck is that naive chunking fragments multi-step reasoning chains; structure-aware chunking that respects section boundaries is essential for complex queries.
Quantitative benchmarks from the study: Vanilla LLM achieved 0.35 faithfulness / 0.45 relevance. Basic RAG (bi-encoder only) achieved 0.62 / 0.70. Advanced RAG (bi-encoder + cross-encoder re-ranking) achieved 0.80 / 0.80. The cross-encoder recovered catastrophic failures — queries where Basic RAG scored 0.00 faithfulness were rescued to 0.80 by re-ranking.
Analyze the document corpus. Inspect document structure (headings, sections, tables, lists), average document length, and content homogeneity. Determine if documents are hierarchical (policy frameworks) or flat (FAQ pages). This dictates chunking strategy.
Select and configure the chunking strategy.
Build the embedding index. Encode all chunks using a bi-encoder model (all-MiniLM-L6-v2 for lightweight deployments, bge-large-en-v1.5 or e5-large-v2 for higher quality). Store in a vector database (FAISS for local, Chroma/Qdrant for persistent, Pinecone for managed). Include chunk metadata: source document, section title, chunk position index.
Implement Stage 1: Bi-encoder retrieval. Given a user query, encode it with the same bi-encoder, retrieve top-k=10 candidates by cosine similarity. This casts a wide net with high recall.
Implement Stage 2: Cross-encoder re-ranking. Pass each of the 10 candidates through a cross-encoder (cross-encoder/ms-marco-MiniLM-L-6-v2) that takes (query, chunk_text) as input and outputs a relevance score. Sort by score, select top-3 chunks as final context.
Construct the generation prompt. Assemble the top-3 re-ranked chunks into a context block with clear delimiters. Instruct the LLM to answer solely based on the provided context, cite sources, and state when information is insufficient rather than speculating.
Generate and post-process the answer. Pass the prompt to the LLM. Strip any content not grounded in the retrieved chunks. If the answer references information not in the context, flag it.
Evaluate with RAGAS. Run the pipeline against a test set of question-answer pairs. Measure faithfulness (is the answer supported by retrieved context?) and relevance (does the answer address the query?). Target: faithfulness >= 0.75, relevance >= 0.75.
Iterate on failure modes. If faithfulness is low, the re-ranker is passing irrelevant chunks — tighten the cross-encoder threshold or increase top-k for more candidates. If relevance is low, the chunks are fragmenting key information — increase chunk overlap or switch chunking strategy.
Add structure-aware enhancements for multi-step queries. For queries requiring reasoning across multiple document sections, implement parent-child chunk retrieval: retrieve the matching chunk plus its surrounding context (previous and next chunks from the same section).
Example 1: Building a policy document QA system
User: "I have a folder of CDC policy PDFs. Build me a RAG pipeline that can answer questions about them accurately."
Approach:
pymupdf or unstructured, preserving section headers and hierarchyall-MiniLM-L6-v2, store in FAISS indexOutput (Python with LangChain):
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_community.embeddings import HuggingFaceEmbeddings
from langchain_community.vectorstores import FAISS
from sentence_transformers import CrossEncoder
import torch
# Step 1: Chunk documents
splitter = RecursiveCharacterTextSplitter(
chunk_size=512,
chunk_overlap=50,
separators=["\n\n", "\n", ". ", " "],
length_function=len,
)
chunks = []
for doc in documents:
splits = splitter.split_text(doc.page_content)
for i, split in enumerate(splits):
chunks.append({
"text": f"[{doc.metadata['section']}] {split}",
"metadata": {**doc.metadata, "chunk_index": i},
})
# Step 2: Build vector index
embedder = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
vectorstore = FAISS.from_texts(
[c["text"] for c in chunks],
embedder,
metadatas=[c["metadata"] for c in chunks],
)
# Step 3: Two-stage retrieval
cross_encoder = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def retrieve(query: str, top_k_initial: int = 10, top_k_final: int = 3):
# Stage 1: Bi-encoder retrieval
candidates = vectorstore.similarity_search(query, k=top_k_initial)
# Stage 2: Cross-encoder re-ranking
pairs = [(query, doc.page_content) for doc in candidates]
scores = cross_encoder.predict(pairs)
ranked = sorted(zip(candidates, scores), key=lambda x: x[1], reverse=True)
return [doc for doc, score in ranked[:top_k_final]]
# Step 4: Generate with grounding prompt
def answer(query: str):
context_docs = retrieve(query)
context = "\n---\n".join([doc.page_content for doc in context_docs])
prompt = f"""Answer the question based ONLY on the context below.
If the context doesn't contain enough information, say "Insufficient information."
Cite the relevant section for each claim.
Context:
{context}
Question: {query}
Answer:"""
return llm.invoke(prompt)
Example 2: Adding re-ranking to an existing RAG pipeline
User: "My RAG system retrieves ok results but the LLM still hallucinates. How do I add cross-encoder re-ranking?"
Approach:
Output (minimal integration):
from sentence_transformers import CrossEncoder
reranker = CrossEncoder("cross-encoder/ms-marco-MiniLM-L-6-v2")
def rerank(query: str, documents: list, top_k: int = 3) -> list:
"""Re-rank retrieved documents using cross-encoder."""
pairs = [(query, doc.page_content) for doc in documents]
scores = reranker.predict(pairs)
ranked_indices = scores.argsort()[::-1][:top_k]
return [documents[i] for i in ranked_indices]
# Integration into existing pipeline:
# Before: docs = vectorstore.similarity_search(query, k=3)
# After:
candidates = vectorstore.similarity_search(query, k=10) # wider recall
docs = rerank(query, candidates, top_k=3) # precise selection
Example 3: Evaluating chunking strategies
User: "Should I use recursive character splitting or semantic splitting for my legal documents?"
Approach:
from langchain.text_splitter import RecursiveCharacterTextSplitter
from langchain_experimental.text_splitter import SemanticChunker
from langchain_community.embeddings import HuggingFaceEmbeddings
from ragas import evaluate
from ragas.metrics import faithfulness, answer_relevancy
embedder = HuggingFaceEmbeddings(model_name="all-MiniLM-L6-v2")
# Strategy A: Recursive character splitting
recursive_splitter = RecursiveCharacterTextSplitter(
chunk_size=512, chunk_overlap=50
)
# Strategy B: Semantic splitting
semantic_splitter = SemanticChunker(
embedder, breakpoint_threshold_type="percentile",
breakpoint_threshold_amount=75,
)
# Run evaluation for each strategy
for name, splitter in [("recursive", recursive_splitter), ("semantic", semantic_splitter)]:
chunks = splitter.split_documents(documents)
vectorstore = build_index(chunks, embedder)
pipeline = build_rag_pipeline(vectorstore, reranker, llm)
results = evaluate(test_dataset, [faithfulness, answer_relevancy], pipeline)
print(f"{name}: faithfulness={results['faithfulness']:.3f}, "
f"relevance={results['answer_relevancy']:.3f}")
Decision guide:
unstructured that detect these elements.ms-marco-MiniLM-L-6-v2 on GPU. Batching helps.ms-marco-MiniLM-L-6-v2 cross-encoder is trained on web search data. For highly specialized domains (medical, legal), fine-tuning on domain QA pairs significantly improves re-ranking quality.Maharjan, A., & Yadav, U. (2026). Chunking, Retrieval, and Re-ranking: An Empirical Evaluation of RAG Architectures for Policy Document Question Answering. arXiv:2601.15457v1. https://arxiv.org/abs/2601.15457v1
Key takeaway: Two-stage retrieval (bi-encoder recall + cross-encoder precision) with top-10-to-top-3 filtering achieves 0.80 faithfulness on domain-specific QA — a 29% improvement over single-stage RAG and 129% over vanilla LLM baselines.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".