Build AI-Powered Document Search

Guide the user through building an AI that searches documents and answers questions accurately. Uses DSPy's RAG (retrieval-augmented generation) pattern — retrieve relevant passages, then generate an answer grounded in them.

Step 0: Load your data

If you have documents in files, databases, or SaaS tools, use LangChain's document loaders to get them into a standard format before building your search pipeline.

LangChain document loaders

from langchain_community.document_loaders import (
    PyPDFLoader,
    TextLoader,
    CSVLoader,
    WebBaseLoader,
    DirectoryLoader,
    NotionDBLoader,
    JSONLoader,
)

# PDF files
docs = PyPDFLoader("report.pdf").load()

# All text files in a directory
docs = DirectoryLoader("./docs/", glob="**/*.txt", loader_cls=TextLoader).load()

# Web pages
docs = WebBaseLoader("https://example.com/help").load()

# CSV
docs = CSVLoader("data.csv", source_column="url").load()

# JSON
docs = JSONLoader("data.json", jq_schema=".records[]", content_key="text").load()

Text splitting

Split loaded documents into chunks sized for embedding and retrieval:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
# Each chunk has .page_content (text) and .metadata (source info)

| Splitter | Best for | |----------|----------| | RecursiveCharacterTextSplitter | General-purpose (recommended default) | | MarkdownHeaderTextSplitter | Markdown docs — splits by heading | | TokenTextSplitter | When you need strict token budgets |

Vector store setup with LangChain

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")  # or HuggingFaceEmbeddings, etc.
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")

# Use as a retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
results = retriever.invoke("How do refunds work?")

Other stores follow the same pattern: Pinecone.from_documents(...), FAISS.from_documents(...).

Once your data is loaded and chunked, wire it into a DSPy retriever (Step 3 below) or use ChromaDB directly. For the full LangChain/LangGraph API, see the LangChain docs.

Step 1: Understand the setup

Ask the user:

What documents are you searching? (PDFs, web pages, database, help articles, etc.)
What kind of questions will users ask? (factual lookups, how-to questions, multi-step research?)
Do you have a search backend already? (Elasticsearch, Pinecone, ChromaDB, pgvector, etc.)
Do questions need info from multiple documents? (simple lookup vs. combining info)

Step 2: Build the search-and-answer pipeline

Basic: search then answer

import dspy

lm = dspy.LM("openai/gpt-4o-mini")  # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)

class AnswerFromDocs(dspy.Signature):
    """Answer the question based on the given context."""
    context: list[str] = dspy.InputField(desc="Relevant passages from the knowledge base")
    question: str = dspy.InputField(desc="User's question")
    answer: str = dspy.OutputField(desc="Answer grounded in the context")

class DocSearch(dspy.Module):
    def __init__(self, num_passages=3):
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.answer = dspy.ChainOfThought(AnswerFromDocs)

    def forward(self, question):
        context = self.retrieve(question).passages
        return self.answer(context=context, question=question)

Configure the search backend

DSPy supports multiple search backends. Set up via dspy.configure:

# ColBERTv2 (hosted)
colbert = dspy.ColBERTv2(url="http://your-server:port/endpoint")
dspy.configure(lm=lm, rm=colbert)

# Or wrap your own search (Elasticsearch, Pinecone, pgvector, etc.)
class MySearchBackend(dspy.Retrieve):
    def forward(self, query, k=None):
        k = k or self.k
        # Your search logic here
        results = your_search_function(query, top_k=k)
        return dspy.Prediction(passages=[r["text"] for r in results])

Step 3: Set up a vector store

If you do not have a search backend yet, set one up. For prototyping, use dspy.Embeddings (built-in, no external DB needed) or ChromaDB:

DSPy built-in retriever (simplest option)

embedder = dspy.Embedder("openai/text-embedding-3-small")  # or any supported model
retriever = dspy.Embeddings(corpus=corpus_texts, embedder=embedder, k=5)
# Uses FAISS for large corpora (>20K docs), brute-force for smaller ones
# Use retriever("query") to search — returns dspy.Prediction(passages=..., indices=...)
dspy.configure(lm=lm, rm=retriever)

ChromaDB setup

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("my_docs")

Load and chunk documents

Split documents into passages before adding them to the vector store. Sentence-based chunking works well for most use cases:

import re

def chunk_text(text, max_sentences=5):
    """Split text into chunks of N sentences."""
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    chunks = []
    for i in range(0, len(sentences), max_sentences):
        chunk = " ".join(sentences[i:i + max_sentences])
        if chunk:
            chunks.append(chunk)
    return chunks

# Load and chunk your documents
for doc in documents:
    chunks = chunk_text(doc["text"])
    collection.add(
        documents=chunks,
        ids=[f"{doc['id']}_chunk_{i}" for i in range(len(chunks))],
        metadatas=[{"source": doc["source"]}] * len(chunks),
    )

Custom embeddings

ChromaDB uses its default embedding function, but you can swap in others:

# SentenceTransformers (local, free)
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
ef = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

# OpenAI embeddings (API, paid)
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
ef = OpenAIEmbeddingFunction(api_key="...", model_name="text-embedding-3-small")

collection = client.get_or_create_collection("my_docs", embedding_function=ef)

Chunking strategies

| Strategy | How it works | Best for | |----------|-------------|----------| | Sentence-based | Split on sentence boundaries | Articles, docs, help pages | | Fixed-size | Split every N characters with overlap | Long unstructured text | | Paragraph | Split on double newlines | Well-structured documents | | Overlap | Fixed-size with N-character overlap between chunks | When context at chunk boundaries matters |

Wire it up as a DSPy retriever

class ChromaRetriever(dspy.Retrieve):
    def __init__(self, collection, k=3):
        super().__init__(k=k)
        self.collection = collection

    def forward(self, query, k=None):
        k = k or self.k
        results = self.collection.query(query_texts=[query], n_results=k)
        return dspy.Prediction(passages=results["documents"][0])

# Use it
retriever = ChromaRetriever(collection)
dspy.configure(lm=lm, rm=retriever)

Step 3b: Connect an existing vector store

If you already have a vector store, wire it up as a DSPy retriever. Each follows the same pattern — subclass dspy.Retrieve and implement forward():

Provider comparison

| Store | Type | Best for | Setup | |-------|------|----------|-------| | ChromaDB | Embedded | Prototyping, small datasets | pip install chromadb | | Pinecone | Cloud | Managed, serverless, scales to billions | pip install pinecone | | Qdrant | Self-hosted or cloud | Open-source, filtering, hybrid search | pip install qdrant-client | | Weaviate | Self-hosted or cloud | Multi-modal, GraphQL, hybrid search | pip install weaviate-client | | pgvector | Postgres extension | Teams already using Postgres | pip install pgvector sqlalchemy |

Pinecone retriever

from pinecone import Pinecone

class PineconeRetriever(dspy.Retrieve):
    def __init__(self, index_name, api_key, embed_fn, k=3):
        super().__init__(k=k)
        pc = Pinecone(api_key=api_key)
        self.index = pc.Index(index_name)
        self.embed_fn = embed_fn  # function: str -> list[float]

    def forward(self, query, k=None):
        k = k or self.k
        vector = self.embed_fn(query)
        results = self.index.query(vector=vector, top_k=k, include_metadata=True)
        passages = [m["metadata"]["text"] for m in results["matches"]]
        return dspy.Prediction(passages=passages)

Qdrant retriever

from qdrant_client import QdrantClient

class QdrantRetriever(dspy.Retrieve):
    def __init__(self, collection_name, url, embed_fn, k=3):
        super().__init__(k=k)
        self.client = QdrantClient(url=url)
        self.collection = collection_name
        self.embed_fn = embed_fn

    def forward(self, query, k=None):
        k = k or self.k
        vector = self.embed_fn(query)
        results = self.client.query_points(
            collection_name=self.collection, query=vector, limit=k,
        )
        passages = [p.payload["text"] for p in results.points]
        return dspy.Prediction(passages=passages)

Weaviate retriever

import weaviate

class WeaviateRetriever(dspy.Retrieve):
    def __init__(self, collection_name, url, k=3):
        super().__init__(k=k)
        self.client = weaviate.connect_to_local(url=url)  # or connect_to_weaviate_cloud
        self.collection = self.client.collections.get(collection_name)

    def forward(self, query, k=None):
        k = k or self.k
        results = self.collection.query.near_text(query=query, limit=k)
        passages = [o.properties["text"] for o in results.objects]
        return dspy.Prediction(passages=passages)

pgvector retriever (PostgreSQL)

from sqlalchemy import create_engine, text

class PgvectorRetriever(dspy.Retrieve):
    def __init__(self, engine, table, embed_fn, k=3):
        super().__init__(k=k)
        self.engine = engine
        self.table = table
        self.embed_fn = embed_fn

    def forward(self, query, k=None):
        k = k or self.k
        vector = self.embed_fn(query)
        sql = text(f"""
            SELECT content FROM {self.table}
            ORDER BY embedding <=> :vec LIMIT :k
        """)
        with self.engine.connect() as conn:
            rows = conn.execute(sql, {"vec": str(vector), "k": k}).fetchall()
        passages = [row[0] for row in rows]
        return dspy.Prediction(passages=passages)

All of these work as drop-in replacements for dspy.Retrieve:

retriever = PineconeRetriever("my-index", api_key="...", embed_fn=embed)
dspy.configure(lm=lm, rm=retriever)
# Or pass directly to your module

Step 4: Multi-document search (for complex questions)

When questions need info from multiple places:

class GenerateSearchQuery(dspy.Signature):
    """Generate a search query to find missing information."""
    context: list[str] = dspy.InputField(desc="Information gathered so far")
    question: str = dspy.InputField(desc="The question to answer")
    query: str = dspy.OutputField(desc="Search query to find missing information")

class MultiStepSearch(dspy.Module):
    def __init__(self, num_passages=3, num_searches=2):
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(num_searches)]
        self.answer = dspy.ChainOfThought(AnswerFromDocs)

    def forward(self, question):
        context = []

        for hop in self.generate_query:
            query = hop(context=context, question=question).query
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)

        return self.answer(context=context, question=question)

def deduplicate(passages):
    seen = set()
    result = []
    for p in passages:
        if p not in seen:
            seen.add(p)
            result.append(p)
    return result

Step 5: Test the quality

def search_metric(example, prediction, trace=None):
    # Exact match (simple)
    return prediction.answer == example.answer

# Or use an AI judge for open-ended answers
class JudgeAnswer(dspy.Signature):
    """Is the predicted answer correct given the expected answer?"""
    question: str = dspy.InputField()
    gold_answer: str = dspy.InputField()
    predicted_answer: str = dspy.InputField()
    is_correct: bool = dspy.OutputField()

def judge_metric(example, prediction, trace=None):
    judge = dspy.Predict(JudgeAnswer)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return result.is_correct

Step 6: Improve accuracy

optimizer = dspy.BootstrapFewShot(metric=search_metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(DocSearch(), trainset=trainset)

# Typical improvement: 45-60% exact match -> 65-80% after optimization
# For further gains, upgrade to MIPROv2:
# optimizer = dspy.MIPROv2(metric=search_metric, auto="medium")

When NOT to use RAG

Data fits in context — if all your documents fit within the LM context window (under ~100K tokens), pass them directly as context instead of building a retrieval pipeline. RAG adds complexity for no benefit.
Questions are always about the same document — if every query targets one known document, skip retrieval and just include it.
You need exact keyword search — if users search by ID, SKU, or exact phrases, a database query or full-text search (Elasticsearch, Postgres tsvector) outperforms embedding search. Use RAG only when queries are semantic.
Real-time data — if the answer changes every minute (stock prices, live dashboards), a retrieval index will be stale. Query the source directly.

Key patterns

Prefer ChainOfThought for the answer step — reasoning typically helps ground answers in the documents. Use Predict if latency matters more than accuracy
Include context in the signature so the AI knows to use the retrieved passages
Multi-step search for complex questions — if one search is not enough, chain search queries
Use dspy.Refine to ensure answers actually cite the documents by scoring citation presence in a reward function
Separate search from answer generation — optimize each independently
Consider joint prompt + retrieval optimization — the GEPA paper (arxiv 2507.19457) shows a RAG adapter that jointly optimizes prompts and retrieval strategy for multiplicative gains. See /dspy-gepa for details
Consider dspy.Embeddings as a built-in retriever — it handles embedding, FAISS indexing, and search in one class without needing a separate vector store (see reference.md for API details)

Gotchas

Chunk size matters more than retriever choice — most RAG failures trace to bad chunking, not bad embeddings. Start with 512 tokens with 50-token overlap and tune from there.
Do not skip the reranking step — embedding similarity retrieves candidates; a reranker (or LM-based reranker) filters them. Without reranking, irrelevant passages dilute the context.
k=3 is not always right — the default k (number of retrieved passages) is a critical hyperparameter. Too few and you miss relevant context; too many and you overwhelm the LM. Tune it against your dev set.
Test with questions that require combining information — single-hop retrieval fails when the answer spans multiple chunks. Use dspy.ChainOfThought with multi-step retrieval for these cases.
Embedding models and chunk sizes must match at index and query time — if you re-chunk or switch embedding models, you must rebuild the vector index. Stale indexes silently return bad results.

Cross-references

Install any skill: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>

Need to summarize docs instead of answering questions? Use /ai-summarizing
Put your document search behind a REST API — see /ai-serving-apis
Building a chatbot on top of doc search? Use /ai-building-chatbots
Measure and improve your AI — see /ai-improving-accuracy
Define input/output contracts for your signatures — see /dspy-signatures
Add reasoning to your answer step — see /dspy-chain-of-thought
Install /ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do

Additional resources

For DSPy retrieval API details (Embeddings, Retrieve, ColBERTv2, Embedder), see reference.md
For worked examples, see examples.md

Build AI-Powered Document Search

Step 0: Load your data

If you have documents in files, databases, or SaaS tools, use LangChain's document loaders to get them into a standard format before building your search pipeline.

LangChain document loaders

from langchain_community.document_loaders import (
    PyPDFLoader,
    TextLoader,
    CSVLoader,
    WebBaseLoader,
    DirectoryLoader,
    NotionDBLoader,
    JSONLoader,
)

# PDF files
docs = PyPDFLoader("report.pdf").load()

# All text files in a directory
docs = DirectoryLoader("./docs/", glob="**/*.txt", loader_cls=TextLoader).load()

# Web pages
docs = WebBaseLoader("https://example.com/help").load()

# CSV
docs = CSVLoader("data.csv", source_column="url").load()

# JSON
docs = JSONLoader("data.json", jq_schema=".records[]", content_key="text").load()

Text splitting

Split loaded documents into chunks sized for embedding and retrieval:

from langchain_text_splitters import RecursiveCharacterTextSplitter

splitter = RecursiveCharacterTextSplitter(chunk_size=1000, chunk_overlap=200)
chunks = splitter.split_documents(docs)
# Each chunk has .page_content (text) and .metadata (source info)

Vector store setup with LangChain

from langchain_chroma import Chroma
from langchain_openai import OpenAIEmbeddings

embeddings = OpenAIEmbeddings(model="text-embedding-3-small")  # or HuggingFaceEmbeddings, etc.
vectorstore = Chroma.from_documents(chunks, embeddings, persist_directory="./chroma_db")

# Use as a retriever
retriever = vectorstore.as_retriever(search_kwargs={"k": 5})
results = retriever.invoke("How do refunds work?")

Other stores follow the same pattern: Pinecone.from_documents(...), FAISS.from_documents(...).

Once your data is loaded and chunked, wire it into a DSPy retriever (Step 3 below) or use ChromaDB directly. For the full LangChain/LangGraph API, see the LangChain docs.

Step 1: Understand the setup

Ask the user:

What documents are you searching? (PDFs, web pages, database, help articles, etc.)
What kind of questions will users ask? (factual lookups, how-to questions, multi-step research?)
Do you have a search backend already? (Elasticsearch, Pinecone, ChromaDB, pgvector, etc.)
Do questions need info from multiple documents? (simple lookup vs. combining info)

Step 2: Build the search-and-answer pipeline

Basic: search then answer

import dspy

lm = dspy.LM("openai/gpt-4o-mini")  # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm)

class AnswerFromDocs(dspy.Signature):
    """Answer the question based on the given context."""
    context: list[str] = dspy.InputField(desc="Relevant passages from the knowledge base")
    question: str = dspy.InputField(desc="User's question")
    answer: str = dspy.OutputField(desc="Answer grounded in the context")

class DocSearch(dspy.Module):
    def __init__(self, num_passages=3):
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.answer = dspy.ChainOfThought(AnswerFromDocs)

    def forward(self, question):
        context = self.retrieve(question).passages
        return self.answer(context=context, question=question)

Configure the search backend

DSPy supports multiple search backends. Set up via dspy.configure:

# ColBERTv2 (hosted)
colbert = dspy.ColBERTv2(url="http://your-server:port/endpoint")
dspy.configure(lm=lm, rm=colbert)

# Or wrap your own search (Elasticsearch, Pinecone, pgvector, etc.)
class MySearchBackend(dspy.Retrieve):
    def forward(self, query, k=None):
        k = k or self.k
        # Your search logic here
        results = your_search_function(query, top_k=k)
        return dspy.Prediction(passages=[r["text"] for r in results])

Step 3: Set up a vector store

If you do not have a search backend yet, set one up. For prototyping, use dspy.Embeddings (built-in, no external DB needed) or ChromaDB:

DSPy built-in retriever (simplest option)

embedder = dspy.Embedder("openai/text-embedding-3-small")  # or any supported model
retriever = dspy.Embeddings(corpus=corpus_texts, embedder=embedder, k=5)
# Uses FAISS for large corpora (>20K docs), brute-force for smaller ones
# Use retriever("query") to search — returns dspy.Prediction(passages=..., indices=...)
dspy.configure(lm=lm, rm=retriever)

ChromaDB setup

import chromadb

client = chromadb.PersistentClient(path="./chroma_db")
collection = client.get_or_create_collection("my_docs")

Load and chunk documents

Split documents into passages before adding them to the vector store. Sentence-based chunking works well for most use cases:

import re

def chunk_text(text, max_sentences=5):
    """Split text into chunks of N sentences."""
    sentences = re.split(r'(?<=[.!?])\s+', text.strip())
    chunks = []
    for i in range(0, len(sentences), max_sentences):
        chunk = " ".join(sentences[i:i + max_sentences])
        if chunk:
            chunks.append(chunk)
    return chunks

# Load and chunk your documents
for doc in documents:
    chunks = chunk_text(doc["text"])
    collection.add(
        documents=chunks,
        ids=[f"{doc['id']}_chunk_{i}" for i in range(len(chunks))],
        metadatas=[{"source": doc["source"]}] * len(chunks),
    )

Custom embeddings

ChromaDB uses its default embedding function, but you can swap in others:

# SentenceTransformers (local, free)
from chromadb.utils.embedding_functions import SentenceTransformerEmbeddingFunction
ef = SentenceTransformerEmbeddingFunction(model_name="all-MiniLM-L6-v2")

# OpenAI embeddings (API, paid)
from chromadb.utils.embedding_functions import OpenAIEmbeddingFunction
ef = OpenAIEmbeddingFunction(api_key="...", model_name="text-embedding-3-small")

collection = client.get_or_create_collection("my_docs", embedding_function=ef)

Chunking strategies

Wire it up as a DSPy retriever

class ChromaRetriever(dspy.Retrieve):
    def __init__(self, collection, k=3):
        super().__init__(k=k)
        self.collection = collection

    def forward(self, query, k=None):
        k = k or self.k
        results = self.collection.query(query_texts=[query], n_results=k)
        return dspy.Prediction(passages=results["documents"][0])

# Use it
retriever = ChromaRetriever(collection)
dspy.configure(lm=lm, rm=retriever)

Step 3b: Connect an existing vector store

If you already have a vector store, wire it up as a DSPy retriever. Each follows the same pattern — subclass dspy.Retrieve and implement forward():

Provider comparison

Pinecone retriever

from pinecone import Pinecone

class PineconeRetriever(dspy.Retrieve):
    def __init__(self, index_name, api_key, embed_fn, k=3):
        super().__init__(k=k)
        pc = Pinecone(api_key=api_key)
        self.index = pc.Index(index_name)
        self.embed_fn = embed_fn  # function: str -> list[float]

    def forward(self, query, k=None):
        k = k or self.k
        vector = self.embed_fn(query)
        results = self.index.query(vector=vector, top_k=k, include_metadata=True)
        passages = [m["metadata"]["text"] for m in results["matches"]]
        return dspy.Prediction(passages=passages)

Qdrant retriever

from qdrant_client import QdrantClient

class QdrantRetriever(dspy.Retrieve):
    def __init__(self, collection_name, url, embed_fn, k=3):
        super().__init__(k=k)
        self.client = QdrantClient(url=url)
        self.collection = collection_name
        self.embed_fn = embed_fn

    def forward(self, query, k=None):
        k = k or self.k
        vector = self.embed_fn(query)
        results = self.client.query_points(
            collection_name=self.collection, query=vector, limit=k,
        )
        passages = [p.payload["text"] for p in results.points]
        return dspy.Prediction(passages=passages)

Weaviate retriever

import weaviate

class WeaviateRetriever(dspy.Retrieve):
    def __init__(self, collection_name, url, k=3):
        super().__init__(k=k)
        self.client = weaviate.connect_to_local(url=url)  # or connect_to_weaviate_cloud
        self.collection = self.client.collections.get(collection_name)

    def forward(self, query, k=None):
        k = k or self.k
        results = self.collection.query.near_text(query=query, limit=k)
        passages = [o.properties["text"] for o in results.objects]
        return dspy.Prediction(passages=passages)

pgvector retriever (PostgreSQL)

from sqlalchemy import create_engine, text

class PgvectorRetriever(dspy.Retrieve):
    def __init__(self, engine, table, embed_fn, k=3):
        super().__init__(k=k)
        self.engine = engine
        self.table = table
        self.embed_fn = embed_fn

    def forward(self, query, k=None):
        k = k or self.k
        vector = self.embed_fn(query)
        sql = text(f"""
            SELECT content FROM {self.table}
            ORDER BY embedding <=> :vec LIMIT :k
        """)
        with self.engine.connect() as conn:
            rows = conn.execute(sql, {"vec": str(vector), "k": k}).fetchall()
        passages = [row[0] for row in rows]
        return dspy.Prediction(passages=passages)

All of these work as drop-in replacements for dspy.Retrieve:

retriever = PineconeRetriever("my-index", api_key="...", embed_fn=embed)
dspy.configure(lm=lm, rm=retriever)
# Or pass directly to your module

Step 4: Multi-document search (for complex questions)

When questions need info from multiple places:

class GenerateSearchQuery(dspy.Signature):
    """Generate a search query to find missing information."""
    context: list[str] = dspy.InputField(desc="Information gathered so far")
    question: str = dspy.InputField(desc="The question to answer")
    query: str = dspy.OutputField(desc="Search query to find missing information")

class MultiStepSearch(dspy.Module):
    def __init__(self, num_passages=3, num_searches=2):
        self.retrieve = dspy.Retrieve(k=num_passages)
        self.generate_query = [dspy.ChainOfThought(GenerateSearchQuery) for _ in range(num_searches)]
        self.answer = dspy.ChainOfThought(AnswerFromDocs)

    def forward(self, question):
        context = []

        for hop in self.generate_query:
            query = hop(context=context, question=question).query
            passages = self.retrieve(query).passages
            context = deduplicate(context + passages)

        return self.answer(context=context, question=question)

def deduplicate(passages):
    seen = set()
    result = []
    for p in passages:
        if p not in seen:
            seen.add(p)
            result.append(p)
    return result

Step 5: Test the quality

def search_metric(example, prediction, trace=None):
    # Exact match (simple)
    return prediction.answer == example.answer

# Or use an AI judge for open-ended answers
class JudgeAnswer(dspy.Signature):
    """Is the predicted answer correct given the expected answer?"""
    question: str = dspy.InputField()
    gold_answer: str = dspy.InputField()
    predicted_answer: str = dspy.InputField()
    is_correct: bool = dspy.OutputField()

def judge_metric(example, prediction, trace=None):
    judge = dspy.Predict(JudgeAnswer)
    result = judge(
        question=example.question,
        gold_answer=example.answer,
        predicted_answer=prediction.answer,
    )
    return result.is_correct

Step 6: Improve accuracy

optimizer = dspy.BootstrapFewShot(metric=search_metric, max_bootstrapped_demos=4)
optimized = optimizer.compile(DocSearch(), trainset=trainset)

# Typical improvement: 45-60% exact match -> 65-80% after optimization
# For further gains, upgrade to MIPROv2:
# optimizer = dspy.MIPROv2(metric=search_metric, auto="medium")

When NOT to use RAG

Data fits in context — if all your documents fit within the LM context window (under ~100K tokens), pass them directly as context instead of building a retrieval pipeline. RAG adds complexity for no benefit.
Questions are always about the same document — if every query targets one known document, skip retrieval and just include it.
You need exact keyword search — if users search by ID, SKU, or exact phrases, a database query or full-text search (Elasticsearch, Postgres tsvector) outperforms embedding search. Use RAG only when queries are semantic.
Real-time data — if the answer changes every minute (stock prices, live dashboards), a retrieval index will be stale. Query the source directly.

Key patterns

Prefer ChainOfThought for the answer step — reasoning typically helps ground answers in the documents. Use Predict if latency matters more than accuracy
Include context in the signature so the AI knows to use the retrieved passages
Multi-step search for complex questions — if one search is not enough, chain search queries
Use dspy.Refine to ensure answers actually cite the documents by scoring citation presence in a reward function
Separate search from answer generation — optimize each independently
Consider joint prompt + retrieval optimization — the GEPA paper (arxiv 2507.19457) shows a RAG adapter that jointly optimizes prompts and retrieval strategy for multiplicative gains. See /dspy-gepa for details
Consider dspy.Embeddings as a built-in retriever — it handles embedding, FAISS indexing, and search in one class without needing a separate vector store (see reference.md for API details)

Gotchas

Chunk size matters more than retriever choice — most RAG failures trace to bad chunking, not bad embeddings. Start with 512 tokens with 50-token overlap and tune from there.
Do not skip the reranking step — embedding similarity retrieves candidates; a reranker (or LM-based reranker) filters them. Without reranking, irrelevant passages dilute the context.
k=3 is not always right — the default k (number of retrieved passages) is a critical hyperparameter. Too few and you miss relevant context; too many and you overwhelm the LM. Tune it against your dev set.
Test with questions that require combining information — single-hop retrieval fails when the answer spans multiple chunks. Use dspy.ChainOfThought with multi-step retrieval for these cases.
Embedding models and chunk sizes must match at index and query time — if you re-chunk or switch embedding models, you must rebuild the vector index. Stale indexes silently return bad results.

Cross-references

Install any skill: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>

Need to summarize docs instead of answering questions? Use /ai-summarizing
Put your document search behind a REST API — see /ai-serving-apis
Building a chatbot on top of doc search? Use /ai-building-chatbots
Measure and improve your AI — see /ai-improving-accuracy
Define input/output contracts for your signatures — see /dspy-signatures
Add reasoning to your answer step — see /dspy-chain-of-thought
Install /ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-do

Additional resources

For DSPy retrieval API details (Embeddings, Retrieve, ColBERTv2, Embedder), see reference.md
For worked examples, see examples.md

Adoption

lebsral/ai-searching-docs

$ install --global

Security Scan Results

SKILL.md

Build AI-Powered Document Search

Step 0: Load your data

LangChain document loaders

Text splitting

Vector store setup with LangChain

Step 1: Understand the setup

Step 2: Build the search-and-answer pipeline

Basic: search then answer

Configure the search backend

Step 3: Set up a vector store

DSPy built-in retriever (simplest option)

ChromaDB setup

Load and chunk documents

Custom embeddings

Chunking strategies

Wire it up as a DSPy retriever

Step 3b: Connect an existing vector store

Provider comparison

Pinecone retriever

Qdrant retriever

Weaviate retriever

pgvector retriever (PostgreSQL)

Step 4: Multi-document search (for complex questions)

Step 5: Test the quality

Step 6: Improve accuracy

When NOT to use RAG

Key patterns

Gotchas

Cross-references

Additional resources

Related Skills

lebsral/ai-watching-optimization

lebsral/dspy-miprov2

lebsral/dspy-langwatch

lebsral/dspy-gepa

lebsral/ai-searching-docs

$ install --global

Security Scan Results

SKILL.md

Build AI-Powered Document Search

Step 0: Load your data

LangChain document loaders

Text splitting

Vector store setup with LangChain

Step 1: Understand the setup

Step 2: Build the search-and-answer pipeline

Basic: search then answer

Configure the search backend

Step 3: Set up a vector store

DSPy built-in retriever (simplest option)

ChromaDB setup

Load and chunk documents

Custom embeddings

Chunking strategies

Wire it up as a DSPy retriever

Step 3b: Connect an existing vector store

Provider comparison

Pinecone retriever

Qdrant retriever

Weaviate retriever

pgvector retriever (PostgreSQL)

Step 4: Multi-document search (for complex questions)

Step 5: Test the quality

Step 6: Improve accuracy

When NOT to use RAG

Key patterns

Gotchas

Cross-references

Additional resources

Related Skills

lebsral/ai-watching-optimization

lebsral/dspy-miprov2

lebsral/dspy-langwatch

lebsral/dspy-gepa