skills/dspy-rlm/SKILL.md
Recursive Language Model (dspy.RLM) that explores large contexts via a sandboxed Python REPL -- the LM writes code, queries sub-LMs, and iterates until it produces a final answer. Use when your input is too large for the context window, the model needs to explore data iteratively, you need recursive self-refinement with code execution, or you have research-style tasks requiring programmatic investigation. Also used for recursive language model, iterative exploration with LLM, model explores data in REPL, agent that keeps digging until it finds the answer, REPL-based reasoning, explore then answer pattern, deep research agent, when one pass is not enough.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-rlmInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Guide the user through using DSPy's RLM (Recursive Language Model) module. RLM lets the LM explore data programmatically in a sandboxed Python REPL, writing code to examine inputs, querying sub-LMs for semantic analysis, and iterating until it produces a final answer.
Experimental. RLM is marked as experimental in DSPy. The API may change in future releases.
Before building with RLM, clarify:
dspy.RLM implements the Recursive Language Models approach (Zhang, Kraska, Khattab 2025). Instead of feeding the full input context into the LM's prompt, RLM:
llm_query(prompt) to do semantic analysis on slices of the data.SUBMIT(output) with a final answer.This makes RLM ideal for tasks where the input is too large for the context window, or where the LM needs to programmatically explore the data to find the answer.
| Scenario | Why RLM helps | |----------|---------------| | Very large input contexts (100K+ chars) | LM sees metadata, explores programmatically instead of stuffing the context | | Data exploration tasks | LM writes code to search, filter, aggregate | | Tasks requiring code + reasoning | Built-in REPL combines computation with LM reasoning | | Multi-step analysis over structured data | LM can iterate, inspect intermediate results, refine approach |
When RLM is not the right fit:
dspy.Predict or dspy.ChainOfThoughtdspy.ReActRLM's default sandbox requires Deno for the Pyodide WASM interpreter:
curl -fsSL https://deno.land/install.sh | sh
import dspy
dspy.configure(lm=dspy.LM("openai/gpt-4o")) # or "anthropic/claude-sonnet-4-5-20250929", etc.
rlm = dspy.RLM("context, query -> answer")
result = rlm(
context="...very long document or dataset...",
query="What is the total revenue for Q3?",
)
print(result.answer)
The LM will:
context (type, length, first/last chars).llm_query() for semantic questions about slices.SUBMIT(answer) when ready.dspy.RLM(
signature, # str | Signature -- required, defines inputs/outputs
max_iterations=20, # max REPL interaction loops
max_llm_calls=50, # max sub-LM query calls per execution
max_output_chars=10_000,# max chars from REPL output per step
verbose=False, # enable detailed execution logging
tools=None, # list[Callable] -- custom tool functions
sub_lm=None, # dspy.LM -- separate (cheaper) LM for sub-queries
interpreter=None, # custom CodeInterpreter (defaults to PythonInterpreter)
)
When the LM writes code in the sandbox, these functions are available:
| Function | Purpose |
|----------|---------|
| llm_query(prompt) | Query the sub-LM with a prompt (up to ~500K chars) |
| llm_query_batched(prompts) | Concurrent multi-prompt queries |
| print() | Display REPL output (required to see results) |
| SUBMIT(output) | End execution and return the final answer |
Route expensive reasoning to a strong model while using a cheap model for sub-queries:
main_lm = dspy.LM("openai/gpt-4o") # or "anthropic/claude-sonnet-4-5-20250929", etc.
cheap_lm = dspy.LM("openai/gpt-4o-mini") # or any cheaper model
dspy.configure(lm=main_lm)
rlm = dspy.RLM("data, query -> summary", sub_lm=cheap_lm)
result = rlm(data=large_dataset, query="Summarize the key trends")
RLM supports DSPy's typed output fields, just like other modules:
rlm = dspy.RLM("logs -> error_count: int, critical_errors: list[str]")
result = rlm(logs=server_logs)
print(result.error_count) # int
print(result.critical_errors) # list[str]
Pass additional Python functions that the LM can call inside the sandbox:
def fetch_metadata(doc_id: str) -> str:
"""Look up metadata for a document by ID."""
return database.get_metadata(doc_id)
rlm = dspy.RLM("documents, query -> answer", tools=[fetch_metadata])
result = rlm(documents=docs, query="Which document has the latest revision?")
After execution, inspect the code-execute-observe steps the LM took:
result = rlm(context=data, query="Find the outlier values")
for step in result.trajectory:
print(f"Code:\n{step['code']}")
print(f"Output:\n{step['output']}\n")
This is useful for debugging, understanding the LM's exploration strategy, and building trust in the result.
async def process():
result = await rlm.aforward(context=data, query="Summarize findings")
return result.answer
| Approach | Mechanism | Best for |
|----------|-----------|----------|
| RLM | LM writes code in a REPL to explore data, calls sub-LMs, iterates until SUBMIT() | Large contexts, data exploration, programmatic analysis |
| Refine (dspy.Refine) | Retry with feedback from a reward function until score threshold is met | Improving a single output with a known quality metric |
| Best-of-N | Generate N candidates, pick the best by a metric | When you want diversity of attempts and can score them |
| ChainOfThought | Single-pass step-by-step reasoning | Standard tasks that fit in context |
| Output validation | dspy.Refine / dspy.BestOfN | Reward-based retry with feedback (replaced dspy.Assert/dspy.Suggest in 3.x) |
Key difference: RLM gives the LM a code execution environment to actively explore the input, rather than just re-prompting with feedback. The LM decides its own exploration strategy.
RLM instances with custom interpreters are not thread-safe. For concurrent usage, create separate instances or use the default PythonInterpreter.
deno --version before running RLM code.dspy.ChainOfThought or dspy.Predict instead.print() in REPL code. The LM must print() values to see REPL output -- assignments alone produce no visible result. If the LM's exploration seems stuck, check the trajectory for code that computes but never prints.max_iterations too low for complex tasks. Claude tends to set max_iterations=5 for brevity. RLM defaults to 20 for a reason -- complex data exploration often needs 10-15 iterations. Only lower it for simple lookups.sub_lm for cost control. Every llm_query() call inside the REPL uses the main LM by default. For large-context tasks with many sub-queries, this gets expensive fast. Always set sub_lm to a cheaper model for semantic analysis calls.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/dspy-refine/dspy-best-of-n/ai-improving-accuracy/ai-building-pipelines/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.