skills/dspy-langfuse/SKILL.md
LLM observability for DSPy with Langfuse -- auto-trace every LM call, attach scores and evaluations, run annotation queues for human review, and track experiments across prompt versions. Use when you want to set up Langfuse, langfuse.com, openinference-instrumentation-dspy, trace DSPy calls, LLM observability with scores, annotation queues, or experiment tracking. Also used for langfuse setup, pip install langfuse, DSPy trace viewer, langfuse vs phoenix, langfuse vs langtrace, observe decorator with DSPy, self-hosted tracing with evaluation, production LLM monitoring with scoring.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills dspy-langfuseInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Before writing code, clarify:
langfuse.flush(); servers do not.Langfuse is an open-source LLM observability platform with auto-instrumentation for DSPy via the OpenInference plugin. It traces every LM call, retrieval, and module execution, then adds evaluation, scoring, and annotation on top.
| Component | Details captured | |-----------|-----------------| | LM calls | Prompts, responses, token counts, latency, cost | | Retrievals | Queries, passages, relevance | | Module executions | Input/output per module step | | Full pipeline | Nested spans showing the complete call tree |
| Feature | What it does | |---------|-------------| | Scores | Attach numeric, boolean, categorical, or text scores to any trace | | Annotation queues | Structured human review workflows for building ground truth | | Experiments | Compare prompt versions, measure score changes across runs | | Sessions | Group multi-turn traces (chatbots, agents) | | Environments | Separate dev/staging/prod traces |
Use Langfuse when:
Do NOT use Langfuse when:
/dspy-langtrace/dspy-phoenix/dspy-weave/dspy-mlflowpip install langfuse dspy openinference-instrumentation-dspy -U
export LANGFUSE_PUBLIC_KEY="pk-lf-..."
export LANGFUSE_SECRET_KEY="sk-lf-..."
export LANGFUSE_BASE_URL="https://cloud.langfuse.com" # US: us.cloud.langfuse.com | EU: cloud.langfuse.com | JP: jp.cloud.langfuse.com | HIPAA: hipaa.cloud.langfuse.com
import dspy
from langfuse import get_client
from openinference.instrumentation.dspy import DSPyInstrumentor
# 1. Verify Langfuse credentials
langfuse = get_client()
langfuse.auth_check()
# 2. Auto-instrument DSPy (one line)
DSPyInstrumentor().instrument()
# 3. Configure DSPy
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini")) # or "anthropic/claude-sonnet-4-5-20250929", etc.
# 4. Use DSPy normally -- all calls are traced
program = dspy.ChainOfThought("question -> answer")
result = program(question="What is DSPy?")
# View traces at https://cloud.langfuse.com
Use @observe() and propagate_attributes() to enrich auto-captured traces with user IDs, session IDs, tags, and custom metadata:
from langfuse import observe, propagate_attributes
@observe()
def answer_question(question: str):
with propagate_attributes(
user_id="user_123",
session_id="session_abc",
tags=["production", "qa-pipeline"],
metadata={"pipeline_version": "2.1"},
version="2.1",
):
program = dspy.ChainOfThought("question -> answer")
return program(question=question)
result = answer_question("How do refunds work?")
For non-decorator workflows (batch jobs, scripts):
from langfuse import get_client, propagate_attributes
langfuse = get_client()
with langfuse.start_as_current_observation(as_type="span", name="batch-qa"):
with propagate_attributes(
user_id="batch_user",
session_id="batch_001",
metadata={"batch_size": 100},
):
program = dspy.ChainOfThought("question -> answer")
result = program(question="What is DSPy?")
langfuse.flush() # Required for short-lived scripts
# Use langfuse.shutdown() instead if the process is exiting (also terminates background threads)
DSPy prompts can contain sensitive data or be very large. Disable input/output capture on specific observations:
@observe(capture_input=False, capture_output=False)
def process_pii_data(user_data: str):
# Traces timing and structure but not the actual data
return program(question=user_data)
Or disable globally via environment variable:
export LANGFUSE_OBSERVE_DECORATOR_IO_CAPTURE_ENABLED=false
Langfuse scores attach evaluation results to traces. Score types: NUMERIC, CATEGORICAL, BOOLEAN, TEXT.
from langfuse import get_client
langfuse = get_client()
# Score a trace after evaluation
langfuse.score(
trace_id="trace-id-from-dashboard",
name="accuracy",
value=0.92,
data_type="NUMERIC",
comment="Verified against ground truth",
)
After running dspy.Evaluate, push results to Langfuse for tracking:
from dspy.evaluate import Evaluate
evaluator = Evaluate(devset=devset, metric=metric, num_threads=4)
score = evaluator(my_program)
# Log the aggregate score
langfuse = get_client()
langfuse.score(
trace_id=langfuse.get_current_trace_id(),
name="dspy_eval_accuracy",
value=score,
data_type="NUMERIC",
)
langfuse.flush()
from langfuse import observe, propagate_attributes
import dspy
DSPyInstrumentor().instrument()
dspy.configure(lm=dspy.LM("openai/gpt-4o-mini")) # or "anthropic/claude-sonnet-4-5-20250929", etc.
class RAGPipeline(dspy.Module):
def __init__(self):
self.retrieve = dspy.Retrieve(k=3)
self.answer = dspy.ChainOfThought("context, question -> answer")
def forward(self, question):
context = self.retrieve(question).passages
return self.answer(context=context, question=question)
@observe()
def search_docs(question: str):
with propagate_attributes(user_id="user_456", tags=["rag"]):
pipeline = RAGPipeline()
return pipeline(question=question)
result = search_docs("How do refunds work?")
# Trace shows: RAGPipeline -> Retrieve -> ChainOfThought (nested spans)
| Feature | Langfuse | Arize Phoenix | Langtrace | W&B Weave |
|---------|----------|---------------|-----------|-----------|
| DSPy auto-instrumentation | Yes (OpenInference plugin) | Yes (same plugin) | Yes (built-in) | Manual (@weave.op()) |
| Setup effort | 3 lines + env vars | 2 lines + launch_app() | 1 line | Decorator per function |
| Local mode (no cloud) | Yes (self-hosted Docker) | Yes (px.launch_app()) | Yes (Docker) | No |
| Cloud option | Yes (managed, multi-region) | Yes (Arize platform) | Yes (app.langtrace.ai) | Yes (wandb.ai) |
| Built-in scoring/evals | Yes (4 score types) | Yes (evals module) | Basic | Yes (feedback) |
| Annotation queues | Yes | No | No | No |
| Experiment tracking | Yes | Basic | No | Yes |
| Session grouping | Yes | No | No | No |
| Best for | Tracing + evaluation + human review | Local trace viewer + evals | Easiest DSPy-first setup | Teams already on W&B |
Want DSPy observability?
|
+- Need scoring + annotation queues + experiments? -> Langfuse (/dspy-langfuse)
+- Want local-first open-source trace viewer? -> Phoenix (/dspy-phoenix)
+- Want easiest one-line auto-instrumentation? -> Langtrace (/dspy-langtrace)
+- Team already uses W&B? -> Weave (/dspy-weave)
+- Need full ML lifecycle (registry, deploy)? -> MLflow (/dspy-mlflow)
langfuse.flush() in scripts and notebooks. Langfuse sends traces asynchronously in the background. In long-running servers this is fine, but in scripts, notebooks, and batch jobs the process exits before traces are sent. Always call langfuse.flush() (or langfuse.shutdown()) at the end of short-lived processes.langfuse but forgets openinference-instrumentation-dspy. The DSPy auto-instrumentation lives in a separate package. Without it, DSPyInstrumentor is not available and no DSPy spans are captured. Install both: pip install langfuse openinference-instrumentation-dspy.DSPyInstrumentor().instrument() after dspy.configure() and DSPy calls. The instrumentor must be activated before any DSPy module runs. Calls made before instrumentation are not captured. Always instrument first, then configure DSPy, then run modules.LANGFUSE_BASE_URL to US cloud for all users. Langfuse has region-specific endpoints: US (us.cloud.langfuse.com), EU (cloud.langfuse.com), Japan (jp.cloud.langfuse.com), HIPAA (hipaa.cloud.langfuse.com), and self-hosted URLs. Always ask the user which region or instance they use, or read it from environment variables rather than hardcoding.get_client() instance in every function. get_client() returns a singleton -- calling it multiple times is safe but unnecessary clutter. Call it once at module level or in setup, then reuse the reference.export LANGFUSE_DEBUG="True" to get verbose logging that shows whether traces are being sent and any API errors. This is the fastest way to diagnose missing traces.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/dspy-langtrace/dspy-phoenix/dspy-weave/dspy-mlflow/dspy-langwatch/ai-monitoring/ai-tracing-requests/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.