skills/dspy-retrieval/SKILL.md
DSPy retrieval modules (dspy.Retrieve, dspy.ColBERTv2, dspy.Embedder, dspy.retrievers.Embeddings) for searching documents, computing embeddings, and building RAG pipelines. Use when you need to search over documents, build a RAG pipeline, connect DSPy to a vector database, compute embeddings for semantic search, set up ChromaDB or Pinecone with DSPy, or build knowledge-grounded question answering. Also used for RAG pipeline in DSPy, vector database integration, semantic search, embedding retrieval, retrieval augmented generation setup, connect knowledge base to DSPy, search documents then answer, grounded generation with retrieval.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-retrievalInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through DSPy's retrieval modules for searching documents, computing embeddings, and building RAG (retrieval-augmented generation) pipelines.
Before building retrieval into a DSPy program, clarify:
dspy.Retrieve to wrap it. If not, use dspy.retrievers.Embeddings for a local solution.DSPy provides retrieval modules that fetch relevant documents or passages given a query. These modules plug into DSPy programs just like dspy.Predict or dspy.ChainOfThought -- declare them in __init__, call them in forward(), and optimizers handle the rest.
There are four key components:
| Component | Purpose | When to use |
|-----------|---------|-------------|
| dspy.Retrieve | Base retriever class | Wrap any search backend (Elastic, Pinecone, etc.) |
| dspy.ColBERTv2 | ColBERTv2 retrieval client | Query a hosted ColBERTv2 server |
| dspy.Embedder | Compute embeddings | Turn text into vectors using any LiteLLM-supported model |
| dspy.retrievers.Embeddings | Local vector search | Build a retriever from an embedder + corpus, uses FAISS |
The base class for all retrievers. Use it directly with a configured retrieval model (rm), or subclass it to wrap your own search backend.
import dspy
# Configure a retrieval model globally
colbert = dspy.ColBERTv2(url="http://your-server:8893/api/search")
lm = dspy.LM("openai/gpt-4o-mini") # or "anthropic/claude-sonnet-4-5-20250929", etc.
dspy.configure(lm=lm, rm=colbert)
# Use dspy.Retrieve -- it delegates to the configured rm
retriever = dspy.Retrieve(k=5)
result = retriever("What is retrieval-augmented generation?")
print(result.passages) # list[str] of top-k passages
k (int) -- number of passages to retrieve. Can be set at init time or overridden per call.dspy.Retrieve returns a dspy.Prediction with a .passages attribute -- a list[str] of the top-k retrieved passages.
Wrap any search system by subclassing dspy.Retrieve and implementing forward():
class MyRetriever(dspy.Retrieve):
def __init__(self, search_client, k=3):
super().__init__(k=k)
self.client = search_client
def forward(self, query, k=None):
k = k or self.k
results = self.client.search(query, top_k=k)
return dspy.Prediction(passages=[r["text"] for r in results])
The forward() method must:
query (str) and optional k (int)dspy.Prediction with a passages field (list of strings)A retrieval client that queries a hosted ColBERTv2 server. ColBERTv2 is a neural retrieval model that provides high-quality passage retrieval.
colbert = dspy.ColBERTv2(url="http://your-server:8893/api/search")
Parameters:
url (str) -- URL of the ColBERTv2 server endpoint# Direct call
results = colbert("What is DSPy?", k=3)
# Returns list of dicts with 'text', 'score', etc.
# As a configured retrieval model
dspy.configure(lm=lm, rm=colbert)
retriever = dspy.Retrieve(k=5)
passages = retriever("search query").passages
Stanford hosts a public ColBERTv2 server for Wikipedia that you can use for testing:
colbert = dspy.ColBERTv2(url="http://20.102.90.50:2017/wiki17_abstracts")
dspy.configure(lm=lm, rm=colbert)
For your own data, you need to run a ColBERTv2 server. See the ColBERT repository for setup instructions.
Computes embeddings for text using any LiteLLM-supported embedding model. This is not a retriever itself -- it turns text into vectors that you can use with dspy.retrievers.Embeddings or your own vector store.
embedder = dspy.Embedder(
"openai/text-embedding-3-small", # model identifier (LiteLLM format)
dimensions=512, # optional: output dimensions
)
Parameters:
"openai/text-embedding-3-small", "cohere/embed-english-v3.0"), or a callable for custom embedding functionsdimensions=512 for models that support it)# Embed a single text
vector = embedder("What is DSPy?")
# Returns a 1D numpy array
# Embed multiple texts
vectors = embedder(["text one", "text two", "text three"])
# Returns a 2D numpy array (shape: num_texts x embedding_dim)
Any embedding model supported by LiteLLM works:
# OpenAI
embedder = dspy.Embedder("openai/text-embedding-3-small")
# Cohere
embedder = dspy.Embedder("cohere/embed-english-v3.0")
# Local via Ollama
embedder = dspy.Embedder("ollama/nomic-embed-text")
A local vector search retriever that uses FAISS under the hood. Give it an Embedder and a corpus, and it builds an in-memory index for fast similarity search.
import dspy
embedder = dspy.Embedder("openai/text-embedding-3-small", dimensions=512)
search = dspy.retrievers.Embeddings(
corpus=corpus, # list[str] of documents
embedder=embedder,
k=5, # number of results to return (default 5)
)
Parameters:
corpus (list[str]) -- the documents to index and search overembedder -- a dspy.Embedder instancek (int, default 5) -- default number of results to returnbrute_force_threshold (int, default 20000) -- corpus size above which FAISS indexing kicks in (below this, brute-force search)normalize (bool, default True) -- whether to normalize embeddingsAvoid re-embedding large corpora on every run:
# Save after initial indexing
search.save("./my_embeddings")
# Load later without re-computing
search = dspy.retrievers.Embeddings.from_saved("./my_embeddings", embedder=embedder)
# Search
result = search("How do I reset my password?")
print(result.passages) # list[str] of top-k matching documents
# Use in a module
class QA(dspy.Module):
def __init__(self, search):
self.search = search
self.answer = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = self.search(question).passages
return self.answer(context=context, question=question)
| Scenario | Use |
|----------|-----|
| Quick prototyping with small-medium corpus | dspy.retrievers.Embeddings |
| Need a hosted, scalable retrieval server | dspy.ColBERTv2 |
| Already have a vector store (Pinecone, Chroma, etc.) | Subclass dspy.Retrieve |
| Need full control over embeddings | dspy.Embedder + your own vector store |
RAG is the most common use of retrieval in DSPy. The pattern: retrieve relevant passages, then generate an answer grounded in them.
import dspy
class RAG(dspy.Module):
def __init__(self, retriever, k=3):
self.retrieve = retriever
self.generate = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = self.retrieve(question).passages
return self.generate(context=context, question=question)
# With Embeddings retriever
embedder = dspy.Embedder("openai/text-embedding-3-small", dimensions=512)
search = dspy.retrievers.Embeddings(embedder=embedder, corpus=my_docs, k=5)
rag = RAG(retriever=search)
result = rag(question="How do refunds work?")
print(result.answer)
Use dspy.Refine to enforce that answers stay grounded in the retrieved context:
class GroundedRAG(dspy.Module):
def __init__(self, retriever):
self.retrieve = retriever
self.generate = dspy.ChainOfThought(
"context, question -> answer, cited_sources: list[int]"
)
def forward(self, question):
passages = self.retrieve(question).passages
result = self.generate(context=passages, question=question)
return dspy.Prediction(
answer=result.answer,
cited_sources=result.cited_sources,
passages=passages,
)
def grounding_reward(args, pred):
score = 1.0
if not pred.cited_sources or len(pred.cited_sources) == 0:
score -= 0.3 # soft penalty for missing citations
return score
grounded_rag = dspy.Refine(module=GroundedRAG(retriever=search), N=3, reward_fn=grounding_reward, threshold=0.8)
When a question needs information from multiple documents, chain retrieval steps:
class MultiHopRAG(dspy.Module):
def __init__(self, retriever, hops=2):
self.retrieve = retriever
self.generate_query = [
dspy.ChainOfThought("context, question -> search_query")
for _ in range(hops)
]
self.answer = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = []
for hop in self.generate_query:
query = hop(context=context, question=question).search_query
new_passages = self.retrieve(query).passages
context = list(dict.fromkeys(context + new_passages)) # deduplicate
return self.answer(context=context, question=question)
There are two ways to wire up a retriever:
colbert = dspy.ColBERTv2(url="http://your-server:8893/api/search")
dspy.configure(lm=lm, rm=colbert)
# dspy.Retrieve() now uses colbert automatically
retriever = dspy.Retrieve(k=5)
embedder = dspy.Embedder("openai/text-embedding-3-small")
search = dspy.retrievers.Embeddings(embedder=embedder, corpus=docs, k=5)
class MyRAG(dspy.Module):
def __init__(self):
self.search = search # use directly, no global config needed
self.answer = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = self.search(question).passages
return self.answer(context=context, question=question)
Option 2 is more explicit and avoids global state. Prefer it when your program uses a single retriever.
The k parameter controls how many passages to retrieve. It can be set at multiple levels:
# At init time
retriever = dspy.Retrieve(k=5)
# Override per call
result = retriever("query", k=10)
# In Embeddings constructor
search = dspy.retrievers.Embeddings(embedder=embedder, corpus=docs, k=3)
Choosing k:
k=3 to k=5 for most tasksk for questions that need broader contextk for faster inference and lower token costsk for your specific taskdspy.experimental.Citations is a type (not a module) that you use as an OutputField to get structured source references from the LM. It works with Anthropic models that support native citations, or falls back to LM-generated citation extraction.
from dspy.experimental import Citations, Document
class AnswerWithSources(dspy.Signature):
"""Answer the question and cite the source documents."""
documents: list[Document] = dspy.InputField()
question: str = dspy.InputField()
answer: str = dspy.OutputField()
citations: Citations = dspy.OutputField()
lm = dspy.LM("anthropic/claude-sonnet-4-5-20250929") # or "openai/gpt-4o", etc.
predictor = dspy.Predict(AnswerWithSources)
result = predictor(documents=docs, question="What is the refund policy?")
# result.citations contains structured Citation objects with cited_text, document_index, etc.
When to use: RAG pipelines where claims need to trace back to source documents with exact quoted text and document indices.
Note: This is in dspy.experimental — the API may change. For broader anti-hallucination patterns, see /ai-stopping-hallucinations.
dspy.Retrieve without configuring rm. Claude often writes dspy.Retrieve(k=5) without setting dspy.configure(rm=...) first. Without a configured retrieval model, calling Retrieve raises a confusing error. Either configure rm globally or pass a concrete retriever (like dspy.retrievers.Embeddings) directly to your module.dspy.retrievers.Embeddings(corpus=docs, embedder=embedder) in scripts without saving. For corpora over a few hundred docs, this wastes time and API calls. Use search.save("./embeddings") after initial indexing and Embeddings.from_saved("./embeddings", embedder=embedder) on subsequent runs..with_inputs() on RAG examples. When building training data for RAG optimization, Claude creates dspy.Example(question=q, answer=a) without calling .with_inputs("question"). The optimizer silently treats all fields as inputs. Always chain .with_inputs() to mark which fields are inputs vs. expected outputs.dspy.Prediction from custom retrievers. When subclassing dspy.Retrieve, the forward() method must return dspy.Prediction(passages=[...]) — not a list or dict. Returning the wrong type causes downstream modules to fail when they access .passages.k=10 or higher, which can stuff too many passages into the generation prompt and exceed the LM context or degrade answer quality. Start with k=3 to k=5 and increase based on evaluation results.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/dspy-modules/dspy-qdrant/ai-searching-docs/ai-stopping-hallucinations/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.