skills/ai-serving-apis/SKILL.md
Put your AI behind an API. Use when you need to serve AI features as web endpoints, add AI to an existing backend, deploy AI for other services to call, wrap a DSPy program in REST or HTTP, build an AI microservice, or put a language model behind FastAPI or Flask. Also use for deploy AI model to production, AI REST API, serve DSPy program over HTTP, Docker AI service, AI endpoint for mobile app, how to productionize my AI, LLM behind a web API, AI microservice architecture, AI backend for React app, put my AI in production, AI API for frontend to call.
npx skillsauth add lebsral/dspy-programming-not-prompting-lms-skills ai-serving-apisInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Wrap a DSPy program in a web API so other services or a frontend can call it over HTTP. Defaults to FastAPI but adapts to the user's existing framework.
Ask the user:
optimized.json from /ai-improving-accuracy?)| Deployment pattern | When to use | |-------------------|-------------| | FastAPI + Docker | Default for production microservices — most teams, most cases | | Flask/Django integration | When adding AI to an existing backend — avoid a second service | | Serverless (Lambda, Cloud Run) | Low-traffic or spiky workloads — pay per invocation, cold starts acceptable | | Direct import (no API) | Internal tooling, notebooks, scripts — skip HTTP entirely |
Recommended layout — keep DSPy logic separate from API code:
project/
├── program.py # DSPy module (already exists from /ai-kickoff)
├── server.py # FastAPI app — routes and startup
├── models.py # Pydantic request/response schemas
├── config.py # Environment configuration
├── optimized.json # Saved optimized program (if available)
├── requirements.txt
├── Dockerfile
└── .env.example
# models.py
from pydantic import BaseModel, Field
class QueryRequest(BaseModel):
"""Request to the AI endpoint."""
query: str = Field(..., description="The input to process", min_length=1)
# Optional: let callers override the model per request
model: str | None = Field(None, description="Override the default LM")
temperature: float | None = Field(None, ge=0, le=2, description="Override temperature")
class QueryResponse(BaseModel):
"""Response from the AI endpoint."""
answer: str
# Include whatever your DSPy program outputs
# reasoning: str | None = None
# confidence: float | None = None
class HealthResponse(BaseModel):
status: str = "ok"
model: str
optimized: bool
# server.py
from contextlib import asynccontextmanager
import dspy
from fastapi import FastAPI
from program import MyProgram
from config import settings
@asynccontextmanager
async def lifespan(app: FastAPI):
"""Load DSPy program once at startup."""
# Configure the default LM
lm = dspy.LM(settings.model_name)
dspy.configure(lm=lm)
# Load the program (with optimization if available)
app.state.program = MyProgram()
app.state.optimized = False
try:
app.state.program.load(settings.program_path)
app.state.optimized = True
print(f"Loaded optimized program from {settings.program_path}")
except FileNotFoundError:
print("Running unoptimized program")
yield # Server runs here
app = FastAPI(title="My AI API", lifespan=lifespan)
# server.py (continued)
from fastapi import HTTPException
from models import QueryRequest, QueryResponse, HealthResponse
@app.post("/query", response_model=QueryResponse)
async def query(request: QueryRequest):
"""Run the AI program on input."""
program = app.state.program
# If caller wants a different model, use dspy.context for this request only
if request.model or request.temperature is not None:
lm_kwargs = {}
if request.model:
lm_kwargs["model"] = request.model
if request.temperature is not None:
lm_kwargs["temperature"] = request.temperature
override_lm = dspy.LM(**lm_kwargs) if request.model else dspy.LM(
settings.model_name, temperature=request.temperature
)
with dspy.context(lm=override_lm):
result = program(query=request.query)
else:
result = program(query=request.query)
return QueryResponse(answer=result.answer)
@app.get("/health", response_model=HealthResponse)
async def health():
return HealthResponse(
model=settings.model_name,
optimized=app.state.optimized,
)
For processing multiple inputs at once:
@app.post("/query/batch", response_model=list[QueryResponse])
async def query_batch(requests: list[QueryRequest]):
"""Process multiple inputs."""
program = app.state.program
results = []
for req in requests:
result = program(query=req.query)
results.append(QueryResponse(answer=result.answer))
return results
Map DSPy errors to appropriate HTTP status codes:
@app.post("/query", response_model=QueryResponse)
async def query(request: QueryRequest):
program = app.state.program
try:
if request.model or request.temperature is not None:
override_lm = dspy.LM(
request.model or settings.model_name,
temperature=request.temperature,
)
with dspy.context(lm=override_lm):
result = program(query=request.query)
else:
result = program(query=request.query)
return QueryResponse(answer=result.answer)
except Exception as e:
error_msg = str(e).lower()
# dspy.Refine raises when fail_count is exhausted -- treat as validation failure
if "refine" in error_msg or "reward" in error_msg or "fail_count" in error_msg:
raise HTTPException(status_code=422, detail=f"Output validation failed: {e}")
if "rate limit" in error_msg or "429" in error_msg:
raise HTTPException(status_code=429, detail="Rate limited by AI provider")
if "timeout" in error_msg:
raise HTTPException(status_code=504, detail="AI provider timed out")
raise HTTPException(status_code=500, detail="Internal error processing request")
Use pydantic-settings to manage configuration:
# config.py
from pydantic_settings import BaseSettings
class Settings(BaseSettings):
model_name: str = "openai/gpt-4o-mini" # or "anthropic/claude-sonnet-4-5-20250929", etc.
program_path: str = "optimized.json"
api_key: str = "" # Set via environment variable
model_config = {"env_prefix": "AI_"}
settings = Settings()
# .env.example
AI_MODEL_NAME=openai/gpt-4o-mini # or anthropic/claude-sonnet-4-5-20250929, etc.
AI_PROGRAM_PATH=optimized.json
AI_API_KEY=your-api-key-here
pip install fastapi uvicorn pydantic-settings
uvicorn server:app --reload --port 8000
Visit http://localhost:8000/docs for auto-generated API docs.
FROM python:3.11-slim
WORKDIR /app
COPY requirements.txt .
RUN pip install --no-cache-dir -r requirements.txt
COPY . .
EXPOSE 8000
CMD ["uvicorn", "server:app", "--host", "0.0.0.0", "--port", "8000"]
dspy>=2.5
fastapi>=0.100
uvicorn[standard]
pydantic-settings>=2.0
Add provider-specific packages as needed (e.g., openai, anthropic).
# docker-compose.yml
services:
api:
build: .
ports:
- "8000:8000"
env_file: .env
volumes:
- ./optimized.json:/app/optimized.json:ro
After starting the server, test the endpoints:
# Health check
curl http://localhost:8000/health
# Query endpoint
curl -X POST http://localhost:8000/query \
-H "Content-Type: application/json" \
-d '{"query": "test question"}'
# Check auto-generated docs
open http://localhost:8000/docs
dspy.context() for per-request overrides. Isolates model/temperature changes without affecting other concurrent requests — critical because dspy.configure() sets global state.program.py independent — the same module runs in scripts, tests, and the API.dspy.Refine exhaustion → 422, rate limits → 429, timeouts → 504.# After optimization
optimized_program.save("./artifacts/v1.json")
# At server startup
program = MyProgram()
program.load("./artifacts/v1.json")
The save()/load() API serializes optimized prompts, demos, and weights — no training data or optimizer needed at deploy time.
import mlflow
mlflow.dspy.autolog() # auto-traces all DSPy calls
mlflow.set_experiment("production-qa-api")
This gives you latency breakdowns, token counts, and full prompt/response logs per request. For the full MLflow guide, see /dspy-mlflow.
dspy.configure() sets global state. For concurrent requests with per-request overrides, always use dspy.context():
@app.post("/query")
async def query(request: QueryRequest):
if request.model:
with dspy.context(lm=dspy.LM(request.model)):
result = program(query=request.query)
else:
result = program(query=request.query)
return QueryResponse(answer=result.answer)
dspy.LM() instance on every request. Claude tends to put dspy.LM() inside the route handler. LM initialization has overhead (connection pooling, auth validation). Configure the default LM once at startup; only create per-request LM instances when the caller explicitly overrides the model via dspy.context().program.load() reads from disk and deserializes — doing it per request adds latency and can cause file handle exhaustion under load. Always load in the lifespan handler and store on app.state.async def routes but calling DSPy synchronously. DSPy LM calls are blocking I/O. In an async def FastAPI route, a blocking call ties up the event loop. Either use def (sync) routes so FastAPI runs them in a thread pool, or wrap DSPy calls in asyncio.to_thread().dspy.configure() is global state. Claude often calls dspy.configure() inside a route to change the model per request. This mutates global state and causes race conditions under concurrent load. Use dspy.context() (context manager) for per-request overrides instead.Prediction objects from the API. Claude sometimes returns result directly instead of mapping to a Pydantic response model. Prediction objects are not JSON-serializable by FastAPI — always extract the fields you need into your response model.Install any skill:
npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill <name>
/ai-kickoff/ai-searching-docs/ai-monitoring/ai-cutting-costs/dspy-mlflow/dspy-signatures/ai-fixing-errors/ai-do if you do not have it — it routes any AI problem to the right skill and is the fastest way to work: npx skills add lebsral/DSPy-Programming-not-prompting-LMs-skills --skill ai-dotools
See what is happening during optimizer.compile() instead of waiting blind. Use when you want to watch optimization progress, see scores as they come in, know if your optimizer is working, check if optimization is stuck, understand why optimization is taking too long, get live progress during compile, monitor convergence, detect overfitting during optimization, interpret optimization results, or pick the right tool for watching optimization. Also used for optimizer progress bar, is my optimizer doing anything, optimization seems stuck, how long will optimization take, watch GEPA run, watch MIPROv2 run, live optimization dashboard, optimizer not improving, scores not going up, optimization taking forever, see what optimizer is doing, debug slow optimization, optimization visibility, optimizer metrics, track compile progress, optimization observability.
testing
Use when you want the highest-quality prompt optimization DSPy offers — jointly optimizes instructions and few-shot demos, with auto=light/medium/heavy presets. Common scenarios - you want the best possible accuracy from prompt optimization, jointly tuning instructions and few-shot demonstrations, using auto presets for different compute budgets, or when COPRO or BootstrapFewShot alone are not reaching your accuracy target. Related - ai-improving-accuracy, dspy-copro, dspy-bootstrap-few-shot. Also used for dspy.MIPROv2, best DSPy optimizer, highest quality optimization, auto=light medium heavy, joint instruction and demo optimization, most powerful prompt optimizer, MIPROv2 vs COPRO vs BootstrapFewShot, which optimizer should I use, state of the art prompt optimization, when to use MIPROv2, optimize both instructions and examples, heavy optimization for production, best optimizer for accuracy.
testing
Use LangWatch for DSPy auto-tracing and real-time optimizer progress. Use when you want to set up LangWatch, langwatch.dspy.init, auto-tracing DSPy, real-time optimization dashboard, optimizer progress tracking, app.langwatch.ai, or DSPy optimizer dashboard. Also used for langwatch setup, pip install langwatch, langwatch trace, optimizer progress, real-time optimization, watch optimizer run, LangWatch self-hosted, langwatch docker, langwatch vs langtrace, langwatch autotrack_dspy.
data-ai
Use when you want to optimize instructions without few-shot examples — a lightweight alternative to COPRO when you do not have or do not want to use demonstrations. Common scenarios - optimizing instructions when you do not have or do not want to use few-shot demonstrations, lightweight instruction search as a first step, tasks where examples in the prompt confuse the model, or when you want fast instruction optimization without the cost of COPRO. Related - ai-improving-accuracy, dspy-copro, dspy-miprov2. Also used for dspy.GEPA, instruction optimization without demos, lightweight prompt optimization, optimize instructions only, no few-shot examples needed, GEPA vs COPRO, quick instruction search, when demonstrations hurt performance, zero-shot optimization, instruction-only optimizer, simplest instruction tuner, fast prompt optimization, skip few-shot and just tune instructions, optimize Pydantic field descriptions, GEPA structured output, GEPA does not optimize field desc.