skills/compactrag-reducing-calls-token/SKILL.md
Build multi-hop RAG systems that answer complex questions with only 2 LLM calls total, regardless of reasoning depth. Applies CompactRAG's offline atomic QA decomposition and online entity-consistent retrieval to slash token costs by 2-5x vs iterative RAG. Trigger phrases: - "build a multi-hop RAG pipeline" - "reduce LLM calls in my RAG system" - "answer complex questions over a knowledge base efficiently" - "implement CompactRAG" - "optimize token usage in retrieval-augmented generation" - "build a cost-efficient question answering system"
npx skillsauth add ndpvt-web/arxiv-claude-skills compactrag-reducing-calls-tokenInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill teaches Claude to design and implement multi-hop retrieval-augmented generation (RAG) systems using the CompactRAG architecture. The core idea: decouple offline corpus restructuring from online reasoning so that complex multi-hop questions are answered with exactly two LLM calls — one for sub-question decomposition and one for final answer synthesis — regardless of how many reasoning hops are needed. Intermediate hops use lightweight models (RoBERTa for extraction, Flan-T5 for rewriting) and dense retrieval, cutting token consumption from 5-10K tokens/query (typical iterative RAG) down to ~1.9K tokens/query.
The Problem with Iterative RAG: Standard multi-hop RAG systems (IRCoT, Self-Ask, Iter-RetGen) call the LLM at every hop — retrieve context, reason, generate a follow-up query, retrieve again, reason again. For a 3-hop question, that's 3+ LLM calls, each consuming thousands of tokens. Worse, entity references degrade across hops: "she" in hop 3 may no longer clearly refer to the person identified in hop 1.
CompactRAG's Solution — Two-Phase Decoupling:
Offline Phase: An LLM reads each document once and converts it into atomic QA pairs — minimal, fine-grained question-answer units where each pair encodes exactly one fact. Entities are pre-annotated with SpaCy and enforced in generation prompts to ensure consistent naming. The Q and A are concatenated ([q;a]) and encoded with Contriever into a dense retrieval index. This is a one-time cost amortized over all future queries.
Online Phase: A complex query triggers exactly two LLM calls. Call 1 decomposes the query into sub-questions organized in a dependency graph (e.g., q1 must be answered before q2 can be resolved). Each sub-question is then processed without the LLM: a Flan-T5-small rewriter grounds entity references from prior answers into the current sub-question, Contriever retrieves top-k atomic QA pairs, and RoBERTa-base extracts the answer span. After all sub-questions resolve, Call 2 synthesizes the final answer from the collected sub-answers. The result: competitive accuracy with ~80% fewer tokens than IRCoT.
Annotate entities in your corpus. Run SpaCy NER over every document to tag named entities (people, organizations, locations, dates). Store annotations alongside the source text. These entities will be enforced during QA pair generation to prevent naming inconsistencies.
Generate atomic QA pairs from each document. Prompt an LLM (GPT-4 at temperature 0, or a strong open model) to decompose each document into atomic question-answer pairs. Each pair must encode a single fact at minimal granularity. Enforce that answers use the exact entity names from the SpaCy annotations. Example prompt structure:
Given the following document with annotated entities [ENTITIES],
generate atomic question-answer pairs where:
- Each pair encodes exactly ONE factual statement
- Answers use the exact entity names provided
- Questions are self-contained (no pronouns or implicit references)
- Pairs are non-overlapping in information content
Document: [TEXT]
Build the dense retrieval index. Concatenate each QA pair into a single text segment [question; answer] and encode it using Contriever (or another unsupervised contrastive dense retriever). Store the embeddings in a vector index (FAISS, Qdrant, or similar). This concatenation maximizes semantic coherence during retrieval.
Train or fine-tune the sub-question rewriter. Using Flan-T5-small, train a rewriter that takes an ambiguous sub-question and a grounding entity, then outputs a rewritten question with the entity explicitly inserted. Training data consists of triples: (ambiguous_question, grounding_entity, rewritten_question). Apply perturbations like entity masking for robustness.
Train the answer extractor. Fine-tune RoBERTa-base as a span-prediction model over the atomic QA pairs. Training samples include correct QA pairs with marked answer spans and distractor pairs (retrieved but irrelevant) to handle retrieval noise. Use start/end span prediction loss.
Decompose the complex query (LLM Call 1). Send the user's multi-hop question to the LLM with a prompt that outputs a sequence of sub-questions {q1, q2, ..., qn} and a dependency graph where edges qi -> qj indicate that qi's answer is needed to resolve qj. Process sub-questions in topological order of the dependency graph.
Rewrite each sub-question for entity consistency. For each sub-question (after the first), pass it through the Flan-T5 rewriter along with the answer entity from its parent sub-question. This replaces pronouns and vague references with explicit entity names. Example: "Who directed that film?" + entity="Inception" becomes "Who directed the film Inception?"
Retrieve and extract answers for each sub-question. Encode the rewritten sub-question with Contriever, retrieve top-5 atomic QA pairs by embedding similarity, and run RoBERTa-base span extraction over the retrieved pairs to identify the answer. No LLM call needed.
Synthesize the final answer (LLM Call 2). After all sub-questions are resolved, send the original query along with all sub-question/answer/evidence triples {Q, {qi, ai, Pi}} to the LLM for holistic reasoning and final answer generation.
Return the answer with provenance. Include the atomic QA pairs used as evidence so the user can trace each reasoning step back to source documents.
Example 1: Building a CompactRAG pipeline for a Wikipedia corpus
User: "I have a corpus of 50K Wikipedia articles. Build me a multi-hop QA system that can answer questions like 'What university did the director of Inception attend?' without making an LLM call for every hop."
Approach:
[q;a] pairs with Contriever into a FAISS indexOutput: 2 LLM calls, ~1.9K tokens total (vs. ~10K for IRCoT on equivalent queries)
Example 2: Optimizing an existing iterative RAG system
User: "My current RAG pipeline calls GPT-4 at every hop and costs $0.15 per query. Can I reduce this?"
Approach:
Output architecture:
Query -> [LLM: decompose] -> sub-questions with dependency graph
For each sub-question (in dependency order):
-> [Flan-T5: rewrite with entity grounding]
-> [Contriever: retrieve top-5 atomic QA pairs]
-> [RoBERTa: extract answer span]
Collected answers -> [LLM: synthesize final answer]
Example 3: Implementing the atomic QA generation step
User: "How do I convert my documents into the atomic QA knowledge base?"
Approach:
en_core_web_sm or en_core_web_trf)import spacy
from openai import OpenAI
nlp = spacy.load("en_core_web_trf")
client = OpenAI()
def generate_atomic_qa(document: str) -> list[dict]:
doc = nlp(document)
entities = list({ent.text for ent in doc.ents})
response = client.chat.completions.create(
model="gpt-4",
temperature=0,
messages=[{
"role": "user",
"content": f"""Convert this document into atomic QA pairs.
Rules:
- Each pair encodes exactly ONE fact
- Use these exact entity names: {entities}
- Questions must be self-contained (no pronouns)
- Output as JSON array of {{"q": "...", "a": "..."}}
Document: {document}"""
}]
)
return parse_qa_pairs(response.choices[0].message.content)
def build_retrieval_index(qa_pairs: list[dict], retriever):
texts = [f"{pair['q']} {pair['a']}" for pair in qa_pairs]
embeddings = retriever.encode(texts)
# Add to FAISS index
index.add(embeddings)
return index
[q;a] before encoding for retrieval. Encoding the question alone loses answer-side semantic signal, degrading retrieval quality.Paper: CompactRAG: Reducing LLM Calls and Token Overhead in Multi-Hop Question Answering (Yang et al., 2026)
Key takeaway: Look at Section 3 for the two-phase architecture, Section 3.3 for the entity-consistent rewriting formulation, and Table 1 for the token consumption comparison showing CompactRAG at 1.9K tokens/sample vs. IRCoT at 10.2K tokens/sample while maintaining competitive F1 scores.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".