skills/comprehensive-comparison-rag-methods/SKILL.md
Select and configure the right RAG strategy for conversational QA systems based on dataset characteristics. Use when: 'build a conversational RAG pipeline', 'choose a RAG method for multi-turn QA', 'my RAG pipeline performs worse than no retrieval', 'optimize retrieval for dialogue systems', 'compare RAG strategies for my dataset', 'reranking vs HyDE vs hybrid BM25'.
npx skillsauth add ndpvt-web/arxiv-claude-skills comprehensive-comparison-rag-methodsInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to recommend, implement, and debug RAG (Retrieval-Augmented Generation) strategies for multi-turn conversational question answering systems. Based on a systematic comparison of 10 RAG methods across 8 diverse datasets (Alushi et al., EACL SRW 2026), this skill encodes the empirical finding that effective conversational RAG depends on alignment between retrieval strategy and dataset structure -- not method complexity. It provides a concrete decision framework for selecting between vanilla RAG, reranking, hybrid BM25, HyDE, query rewriting, and summarization approaches, and identifies the failure modes that cause advanced methods to underperform a no-retrieval baseline.
The core insight is that RAG method selection should be driven by three dataset characteristics: (1) the context-to-question ratio (how many candidate passages exist per query), (2) the dialogue pattern (stable topic vs. topic-switching), and (3) the answer format (extractive vs. abstractive/informal). These three axes predict which retrieval strategy will succeed far better than any universal ranking of methods.
Three methods consistently outperform vanilla RAG across diverse settings: Reranking (cross-encoder rescoring of top-k candidates), Hybrid BM25 (combining sparse lexical and dense semantic retrieval), and HyDE (generating a hypothetical answer, then using it as the retrieval query). Reranking adds minimal latency and is the safest default upgrade. Hybrid BM25 handles vocabulary mismatch between conversational queries and formal documents. HyDE excels when queries are short or ambiguous, as the hypothetical answer expands the query's semantic footprint -- it tripled MRR on one dataset (8.0 to 25.2).
Critically, several "advanced" methods reliably hurt performance. Query rewriting degrades results on datasets with topic switching or large corpora (INSCIT, QReCC, TopiOCQA) because the LLM rewrites introduce drift. Summarization strips crucial contextual detail, producing the worst average F1 scores across all methods. Combining HyDE + Reranker underperforms either method individually, suggesting the pipeline introduces compounding noise. The practical takeaway: start simple, measure against a no-RAG baseline, and only add complexity when the data demands it.
Profile the dataset characteristics. Count the total contexts, compute the context-to-question ratio, and classify the dialogue pattern (stable topic, gradual drift, or hard topic switches). A ratio above 50:1 signals a large-corpus retrieval challenge where vanilla methods will struggle.
Establish baselines. Implement both a No-RAG baseline (LLM answers from parametric knowledge only) and a Vanilla RAG baseline (top-k dense retrieval with a sentence-transformer encoder). Measure F1 and MRR@5. If No-RAG already scores well, your dataset may not benefit from retrieval at all.
Select the primary retrieval strategy using the decision matrix:
Handle conversation history. Serialize the last N turns (typically 3-5) as context prepended to the current query. Do NOT summarize history -- the paper shows summarization degrades retrieval. For coreference resolution, include the raw dialogue turns rather than rewriting them, unless you have verified that query rewriting improves your specific dataset.
Implement the chosen retrieval method. Use a standard stack: sentence-transformers or OpenAI embeddings for dense retrieval, rank-bm25 or Elasticsearch for sparse retrieval, and a cross-encoder model (e.g., cross-encoder/ms-marco-MiniLM-L-6-v2) for reranking. Retrieve top-20 candidates, then rerank/filter to top-5.
Configure the generation prompt. Pass retrieved contexts as numbered references in the system prompt. Include the conversation history. Instruct the LLM to answer based on the provided contexts and to say when information is insufficient rather than hallucinating.
Evaluate with turn-aware metrics. Measure F1 (token overlap), MRR@5 (retrieval ranking quality), and Recall@5 (coverage). Plot these metrics across conversation turns (turn 1, 2, 3, ..., 8+). Watch for degradation patterns: stable-then-declining curves indicate history is becoming noise after a threshold.
Diagnose retrieval-generation misalignment. If MRR is high but F1 is low, the retrieval works but the LLM ignores or misuses the context (common with informal/opinionated content). If MRR is low but F1 is reasonable, the LLM is answering from parametric knowledge -- retrieval is not contributing.
Iterate on the strategy. If the chosen method underperforms No-RAG on any metric, do not add more complexity. Instead: (a) check chunking strategy and chunk overlap, (b) verify embedding model domain alignment, (c) try grouping contexts by metadata (titles, sections) before indexing, (d) reduce conversation history window.
Document the final configuration. Record the chosen method, hyperparameters (top-k, reranker threshold, BM25 weight, history window), and per-turn performance metrics so the decision can be revisited as the dataset evolves.
Example 1: Building a customer support chatbot with a knowledge base
User: "I'm building a conversational QA system for our customer support docs. We have about 1,200 help articles and users typically ask 3-5 follow-up questions. Which RAG approach should I use?"
Approach:
all-MiniLM-L6-v2)cross-encoder/ms-marco-MiniLM-L-6-v2, keep top-5Output configuration:
rag_config = {
"retrieval": "dense + cross-encoder-reranker",
"embedding_model": "all-MiniLM-L6-v2",
"reranker_model": "cross-encoder/ms-marco-MiniLM-L-6-v2",
"chunk_size": 300,
"chunk_overlap": 50,
"top_k_retrieve": 20,
"top_k_rerank": 5,
"history_window": 3,
"history_mode": "raw_turns", # not summarized
}
Example 2: Diagnosing a RAG pipeline that performs worse than no retrieval
User: "My RAG system answers worse than just asking GPT directly. I'm using query rewriting to handle follow-up questions in a multi-turn setup over a large Wikipedia corpus."
Approach:
Diagnostic steps:
1. Measure MRR@5 for current query-rewriting pipeline
2. Measure MRR@5 for vanilla RAG (no rewriting) as sanity check
3. Implement hybrid BM25 (alpha=0.5 sparse/dense blend)
4. Compare F1 scores across all three at turns 1, 3, 5, 7
5. If hybrid BM25 improves MRR but F1 stays flat, check if
the LLM is ignoring retrieved context (generation issue)
Example 3: Choosing between HyDE and reranking for a research paper QA system
User: "I have a corpus of 500 research papers chunked into paragraphs. Users ask technical questions in multi-turn conversations. Should I use HyDE or reranking?"
Approach:
# Start with this
pipeline_v1 = ["dense_retrieval", "cross_encoder_rerank"]
# Only if recall is low on short queries, switch to this
pipeline_v2 = ["hyde_query_expansion", "dense_retrieval"]
# Do NOT do this -- empirically shown to degrade performance
pipeline_bad = ["hyde_query_expansion", "dense_retrieval", "cross_encoder_rerank"]
Do:
Avoid:
Retrieval returns irrelevant passages: Check embedding model domain alignment. General-purpose encoders may fail on specialized vocabulary. Fine-tune or switch to a domain-adapted model.
Performance degrades mid-conversation: The history window is too large. Reduce from N turns to N-2 and re-measure. Alternatively, use a sliding window that drops the oldest turns.
HyDE generates hallucinated hypothetical answers: The hypothetical answer does not need to be factually correct -- it only needs to be semantically similar to relevant passages. However, if the domain is highly specialized, the LLM may generate off-topic hypotheticals. Mitigate by including a brief domain description in the HyDE prompt.
High MRR but low F1 (retrieval-generation gap): The LLM is not grounding its answers in the retrieved context. Strengthen the generation prompt with explicit instructions to cite or quote from provided passages. Consider few-shot examples that demonstrate grounded answering.
BM25 component returns nothing useful: Conversational queries are often fragmentary ("What about that?"). BM25 depends on lexical overlap, which fails on pronoun-heavy queries. Ensure the hybrid weight favors dense retrieval (e.g., 0.3 BM25 / 0.7 dense) for conversational settings, shifting toward BM25 only for terminology-heavy domains.
Alushi, K., Strich, J., Biemann, C., & Semmann, M. (2026). Comprehensive Comparison of RAG Methods Across Multi-Domain Conversational QA. EACL SRW. arXiv:2602.09552 | Code
Key takeaway: Table 2 (F1 scores across all 8 datasets) and Figure 4 (per-turn performance curves) are the most actionable references for strategy selection. The context-to-question ratio analysis in Section 5 explains why methods fail on specific datasets.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".