skills/a-rag-scaling-agentic-retrieval-augmented/SKILL.md
Build agentic RAG systems where the LLM autonomously decides retrieval strategy using hierarchical interfaces (keyword search, semantic search, chunk read) instead of fixed retrieval pipelines. Replaces single-shot retrieval and predefined workflows with a ReAct-style loop that scales with model capability. Trigger phrases: - "build an agentic RAG system" - "implement hierarchical retrieval" - "make the LLM control its own retrieval" - "scale RAG with reasoning" - "multi-hop question answering over documents" - "replace my fixed RAG pipeline with an agent"
npx skillsauth add ndpvt-web/arxiv-claude-skills a-rag-scaling-agentic-retrieval-augmentedInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill teaches Claude to design and implement agentic RAG systems where the language model autonomously controls retrieval decisions through three hierarchical interfaces -- keyword search, semantic search, and chunk read -- rather than relying on a fixed retrieve-then-read pipeline. The core insight from A-RAG is that exposing retrieval as callable tools at different granularities and letting the model reason about which to invoke (and when) consistently outperforms both single-shot RAG and predefined workflow RAG, while using comparable or fewer tokens.
The problem with existing RAG: Traditional RAG systems either (1) retrieve passages in a single shot and concatenate them into the prompt, or (2) predefine a workflow (e.g., "first search, then rerank, then read") and prompt the model to execute it step-by-step. Neither approach lets the model participate in retrieval decisions. The model cannot say "that chunk wasn't useful, let me try a keyword search for the entity name instead."
A-RAG's solution -- hierarchical retrieval interfaces: Instead of a monolithic retrieval step, A-RAG exposes three tools at different granularities that the LLM calls in a ReAct-style loop (reason, act, observe, repeat):
Why this scales: Because the agent chooses its own retrieval strategy per reasoning step, stronger models make better tool-use decisions and achieve higher accuracy without changing the retrieval infrastructure. Experiments show ~8% improvement when scaling from 5 to 20 agent steps, and ~25% improvement when scaling reasoning effort from minimal to high. The dominant failure mode shifts from "couldn't find the information" (traditional RAG) to "reasoning chain errors" (A-RAG), meaning the retrieval bottleneck is largely removed.
Split documents into chunks of ~1,000 tokens each, respecting sentence boundaries. Store each chunk with a unique ID and metadata (source document, position). This is the unit of retrieval for all three interfaces.
# Example chunking logic
def chunk_document(text, max_tokens=1000):
sentences = split_into_sentences(text)
chunks, current_chunk, current_len = [], [], 0
for sent in sentences:
sent_len = count_tokens(sent)
if current_len + sent_len > max_tokens and current_chunk:
chunks.append(" ".join(current_chunk))
current_chunk, current_len = [sent], sent_len
else:
current_chunk.append(sent)
current_len += sent_len
if current_chunk:
chunks.append(" ".join(current_chunk))
return chunks
Encode every sentence in the corpus using a sentence encoder (e.g., sentence-transformers/all-MiniLM-L6-v2 or Qwen3-Embedding-0.6B). Store embeddings in a vector index (FAISS, Qdrant, or similar) with mappings from sentence back to parent chunk ID.
Implement each as a callable function the LLM agent can invoke:
def keyword_search(query: str, top_k: int = 10) -> list[dict]:
"""Exact lexical match across chunks. Returns chunk IDs + keyword snippets."""
def semantic_search(query: str, top_k: int = 10) -> list[dict]:
"""Dense embedding cosine similarity. Returns chunk IDs + matching sentences."""
def chunk_read(chunk_id: str) -> str:
"""Returns full text of a chunk. Tracks reads to avoid redundancy."""
Maintain a set of already-read chunk IDs per query session. When chunk_read is called on a previously-read chunk, return a brief "already read" message instead of the full text. This prevents the agent from wasting tokens re-consuming the same content.
Define each tool with precise descriptions so the LLM understands when to use each:
You are a research assistant answering questions using a document corpus.
You have three retrieval tools:
- keyword_search(query): Use when you need to find chunks containing specific
entity names, dates, numbers, or exact phrases. Returns abbreviated snippets.
- semantic_search(query): Use when you need to find passages related to a concept
but the exact wording may differ. Returns matching sentences with chunk IDs.
- chunk_read(chunk_id): Use to read the full content of a chunk identified by
search. Use this to get complete context after search narrows candidates.
Strategy: Start with search to identify relevant chunks, then read the most
promising ones. Reformulate searches if initial results are insufficient.
Reason about what information you still need before each tool call.
Run a loop where the LLM alternates between reasoning (thinking about what it knows and needs) and acting (calling a tool). Set a maximum step budget (10-20 steps) to prevent runaway loops.
def agentic_rag(question: str, max_steps: int = 15):
messages = [system_prompt, {"role": "user", "content": question}]
read_chunks = set()
for step in range(max_steps):
response = llm.chat(messages, tools=RETRIEVAL_TOOLS)
if response.has_tool_call:
result = execute_tool(response.tool_call, read_chunks)
messages.append(response.message)
messages.append({"role": "tool", "content": result})
else:
return response.content # Final answer
return "Could not determine answer within step budget."
When keyword or semantic search returns results, show only chunk IDs and brief snippets (not full chunk text). This keeps context compact and forces the agent to explicitly decide which chunks deserve a full read, reducing token waste.
After the agent gathers evidence, instruct it to produce a final answer citing specific chunk IDs. This enables verification and builds trust.
Start with max_steps=15 and top_k=10 for searches. Monitor average steps used and retrieved token count. Increase the step budget for harder multi-hop tasks; decrease top-k if snippets overwhelm the context.
Test on multi-hop QA benchmarks (HotpotQA, MuSiQue, 2WikiMultiHopQA) or your domain-specific test set. Track: answer accuracy, average steps used, total retrieved tokens, and tool call distribution. The signature of a well-tuned A-RAG system is high accuracy with low retrieved token count.
Example 1: Multi-hop question over a knowledge base
User: "Who directed the film that won Best Picture the year the Berlin Wall fell?"
Agent reasoning trace:
Step 1 [Reason]: The Berlin Wall fell in 1989. I need to find which film won
Best Picture at the 1989 Academy Awards (62nd ceremony, held March 1990).
Step 2 [keyword_search("Best Picture 1989")]:
-> Chunk 412: "...Best Picture nominees for the 62nd Academy Awards..."
-> Chunk 887: "...Driving Miss Daisy won Best Picture..."
Step 3 [chunk_read("887")]:
-> Full text confirms Driving Miss Daisy won Best Picture for 1989.
Step 4 [keyword_search("Driving Miss Daisy director")]:
-> Chunk 201: "...directed by Bruce Beresford..."
Step 5 [Reason]: I have the answer. Driving Miss Daisy (1989 Best Picture)
was directed by Bruce Beresford.
Output: Bruce Beresford directed Driving Miss Daisy, which won Best Picture
for the year the Berlin Wall fell (1989). [Sources: chunks 887, 201]
Example 2: Concept-based retrieval where exact keywords fail
User: "What causes the price of goods to rise when a country prints too much money?"
Agent reasoning trace:
Step 1 [Reason]: This is about inflation and monetary policy. Let me try
semantic search since the corpus may not use the exact phrasing "prints money."
Step 2 [semantic_search("effects of increasing money supply on prices")]:
-> Chunk 55: sentence about "monetary expansion leading to demand-pull inflation"
-> Chunk 312: sentence about "quantity theory of money"
Step 3 [chunk_read("55")]:
-> Detailed explanation of demand-pull inflation mechanism.
Step 4 [chunk_read("312")]:
-> Explains MV = PQ and how increasing M with constant V and Q raises P.
Step 5 [Reason]: Both chunks explain the mechanism. Chunk 55 covers the
practical effect, chunk 312 covers the theoretical foundation.
Output: When a country increases its money supply significantly, it causes
inflation through demand-pull effects: more money chasing the same goods
bids up prices. This is formalized by the quantity theory of money (MV = PQ),
where increasing M (money supply) with stable V (velocity) and Q (output)
forces P (price level) upward. [Sources: chunks 55, 312]
Example 3: Implementing A-RAG in a Python project
User: "I have a FastAPI app with a basic RAG endpoint that does single-shot
retrieval. Help me convert it to agentic RAG."
Approach:
1. Read the existing endpoint to understand the current retrieval logic
2. Refactor the retrieval into three separate tool functions (keyword_search,
semantic_search, chunk_read) matching the A-RAG interface
3. Add a context tracker (set of read chunk IDs per request)
4. Replace the single retrieve-then-generate call with a ReAct loop using
the LLM's tool-calling API
5. Add an iteration budget parameter to the endpoint
6. Update the system prompt to describe the three tools and when to use each
Output: Modified endpoint with agentic_rag() function, three tool
implementations, context tracking, and configurable max_steps parameter.
Do:
Avoid:
| Problem | Symptom | Fix | |---------|---------|-----| | Agent loops without finding information | Hits max_steps with no answer | Add a fallback: if step budget exhausted, return best-effort answer from chunks already read | | Search returns no results | Empty result set from keyword/semantic search | Instruct the agent to reformulate the query or switch search modality (keyword <-> semantic) | | Agent reads too many chunks | High token usage, slow responses | Reduce top-k, tighten snippet length, or lower max_steps budget | | Entity confusion in multi-hop | Agent conflates two entities with similar names | Add entity disambiguation in the system prompt; instruct the agent to verify entity identity before chaining | | Chunk boundaries split key information | Answer spans two chunks | Implement adjacent chunk reading: when chunk_read is called, optionally include the preceding/following chunk |
Paper: A-RAG: Scaling Agentic Retrieval-Augmented Generation via Hierarchical Retrieval Interfaces (Du et al., 2026)
Key takeaway: Exposing three retrieval interfaces (keyword search, semantic search, chunk read) as tools in a ReAct loop lets the LLM make its own retrieval decisions, achieving higher accuracy with fewer retrieved tokens than single-shot or workflow-based RAG -- and the performance scales naturally with model capability improvements.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".