skills/litellm-provider-extension/SKILL.md
Pi extension for LiteLLM proxy provider integration
npx skillsauth add jcsaaddupuy/badrobots litellm-provider-extensionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Create a production-ready Pi extension that registers a LiteLLM proxy as a model provider with automatic model discovery and parameter compatibility handling.
This skill implements a generic extension that:
/v1/models endpointPi's Provider System:
pi.registerProvider()baseUrl, apiKey, api type, and models arrayid, name, reasoning, input, cost, contextWindow, maxTokens, compat flagsLiteLLM Challenges:
store or prompt_cache_key parametersRequired environment variables:
export OPENAI_BASE_URL="https://your-litellm-proxy.example.com"
export OPENAI_API_KEY="your-api-key"
Fetch models from LiteLLM:
const response = await fetch(`${baseUrl}/v1/models`, {
headers: {
'Authorization': `Bearer ${apiKey}`,
'Content-Type': 'application/json',
},
});
const data = await response.json();
const models = data.data; // Array of { id, object, created, owned_by }
Known patterns from testing:
| Model Pattern | store | prompt_cache_key | Backend |
|--------------|---------|-------------------|---------|
| claude* | ✅ | ❌ | Databricks |
| gemini* | ✅ | ❌ | Databricks |
| gpt* | ✅ | ✅ | Azure/OpenAI |
| databricks* | ✅ | ❌ | Databricks |
| Others | ✅ | ✅ | Default (fail-open) |
Implementation:
function getParameterSupport(modelId: string) {
if (modelId.includes("claude") || modelId.includes("gemini") || modelId.includes("databricks")) {
return { store: true, prompt_cache_key: false };
}
return { store: true, prompt_cache_key: true };
}
Why these patterns:
prompt_cache_key"message":"prompt_cache_key: Extra inputs are not permitted"Reasoning models:
opus, with-reasoninggpt-5, o1, o3Multimodal (text + images):
claude, gpt-4, gpt-5, gemini, sonnet, opus, haiku, novaContext windows:
opus: 200Kgemini-2: 1Mgpt-5: 200KMax output tokens:
opus-4.6: 8192gpt-5: 8192haiku: 4096sonnet: 4096Why conservative? Databricks reserves output capacity based on max_tokens. Lower values = higher admission success rate.
The key to parameter filtering is a custom streamSimple function:
function createLiteLLMStream(parameterSupport: Map<string, ParameterSupport>) {
return function streamLiteLLM(model, context, options) {
const stream = createAssistantMessageEventStream();
(async () => {
const support = parameterSupport.get(model.id);
const wrappedOptions = { ...options };
// Filter sessionId if prompt_cache_key not supported
if (!support.prompt_cache_key && wrappedOptions.sessionId) {
delete wrappedOptions.sessionId;
}
// Use built-in Azure streaming with filtered options
const piAi = await import("@mariozechner/pi-ai");
const azureStream = piAi.streamAzureOpenAIResponses(model, context, wrappedOptions);
// Forward all events
for await (const event of azureStream) {
stream.push(event);
}
})();
return stream;
};
}
Why this approach:
streamAzureOpenAIResponses implementationsessionId option becomes prompt_cache_key parameter in Azure providercompat.supportsStore flag handles store parameter automaticallypi.registerProvider("litellm", {
baseUrl: normalizedBaseUrl,
apiKey,
api: "azure-openai-responses", // Use Azure API (most compatible)
authHeader: true, // Add Authorization: Bearer header
models,
streamSimple: createLiteLLMStream(parameterSupport),
});
Test each parameter support pattern:
# Claude (no prompt_cache_key)
pi --provider litellm --model claude-haiku-4-5 -p "test"
# GPT (full support)
pi --provider litellm --model gpt-5.2 -p "test"
# Gemini (no prompt_cache_key)
pi --provider litellm --model gemini-2-5-pro -p "test"
What to verify:
IMPORTANT: Use reactive error handling, NOT proactive rate limiting.
Why reactive wins:
retry_after, limit_type, limit, currentKey Databricks concepts:
max_tokens before admissionImplementation strategy:
Step 1: Set conservative max_tokens
// Databricks reserves output capacity based on max_tokens
// Lower values = higher admission success rate
let maxTokens = 4096; // Conservative default
if (modelId.includes("opus-4.6")) maxTokens = 8192;
if (modelId.includes("haiku")) maxTokens = 4096;
if (modelId.includes("sonnet")) maxTokens = 4096;
if (modelId.includes("gpt-5")) maxTokens = 8192;
Step 2: Parse structured rate limit errors
function parseRateLimitError(error: any): RateLimitError {
const errorMessage = error instanceof Error ? error.message : String(error);
const errorString = JSON.stringify(error);
// Look for structured JSON error response
try {
const jsonMatch = errorString.match(/\{[^{}]*"error"[^{}]*\}/);
if (jsonMatch) {
const parsed = JSON.parse(jsonMatch[0]);
if (parsed.error?.type === "rate_limit_exceeded") {
return {
isRateLimit: true,
limitType: parsed.error.limit_type,
limit: parsed.error.limit,
current: parsed.error.current,
retryAfter: parsed.error.retry_after,
message: parsed.error.message,
};
}
}
} catch (e) {
// Fallback to pattern matching
}
// Check for Databricks REQUEST_LIMIT_EXCEEDED
if (errorMessage.includes("REQUEST_LIMIT_EXCEEDED")) {
const modelMatch = errorMessage.match(/rate limit for ([\w-]+)/i);
const typeMatch = errorMessage.match(/(input|output) tokens per minute/i);
return {
isRateLimit: true,
limitType: typeMatch ? `${typeMatch[1]}_tokens_per_minute` : "unknown",
message: "Exceeded Databricks workspace rate limit",
};
}
// Generic rate limit patterns
if (errorMessage.includes("rate limit") || errorMessage.includes("429")) {
return { isRateLimit: true, message: "Rate limit exceeded" };
}
return { isRateLimit: false };
}
Step 3: Format user-friendly error messages
function formatRateLimitError(info: RateLimitError, modelId: string): string {
const lines = [
`⚠️ Rate Limit Exceeded - Model: ${modelId}`,
"",
info.message || "Rate limit exceeded",
"",
];
if (info.limitType) {
lines.push(`Limit Type: ${info.limitType.replace(/_/g, " ")}`);
}
if (info.limit && info.current) {
lines.push(`Limit: ${info.limit} | Current: ${info.current}`);
}
if (info.retryAfter) {
lines.push(`Retry After: ${info.retryAfter} seconds`);
}
lines.push("");
lines.push("What to do:");
lines.push(` • Wait ${info.retryAfter || 60} seconds before retrying`);
lines.push(` • Switch to a smaller model (e.g., haiku instead of opus)`);
lines.push(` • Reduce prompt length or max_tokens`);
lines.push(` • Contact Databricks account team for higher limits`);
return lines.join("\n");
}
Step 4: Handle in streaming function
try {
const azureStream = streamAzureOpenAIResponses(model, context, options);
for await (const event of azureStream) {
stream.push(event);
}
} catch (error) {
const rateLimitInfo = parseRateLimitError(error);
if (rateLimitInfo.isRateLimit) {
output.stopReason = "error";
output.errorMessage = formatRateLimitError(rateLimitInfo, model.id);
} else {
output.stopReason = "error";
output.errorMessage = error.message;
}
stream.push({ type: "error", reason: output.stopReason, error: output });
} finally {
stream.end();
}
Databricks Rate Limits (Enterprise Tier):
DON'T implement:
Reason: Databricks does this better server-side with full visibility.
Reference: https://docs.databricks.com/aws/en/machine-learning/foundation-model-apis/limits
Why NOT proactive rate limiting?
Initially considered: Client-side token bucket to pre-reject requests.
Why it doesn't work:
Databricks provides everything needed:
{
"error": {
"type": "rate_limit_exceeded",
"code": 429,
"limit_type": "input_tokens_per_minute",
"limit": 200000,
"current": 200150,
"retry_after": 15
}
}
Let the experts handle rate limiting. Focus on clear error messages.
Error: "prompt_cache_key: Extra inputs are not permitted"
sessionId option for that modelError: "REQUEST_LIMIT_EXCEEDED" or 429 Too Many Requests
Error: Cannot find module 'openai'
@mariozechner/pi-ai insteadError: Model not found
/v1/models responsepi --provider litellm --list-modelsError: Infinite loop / repeated requests
for await loop pushes all events and calls stream.end()See ~/.pi/agent/extensions/litellm.ts for the full reference implementation.
Key files to create:
~/.pi/agent/extensions/litellm.ts (main code)~/.pi/agent/extensions/README-litellm.md (usage guide)// 1. Imports
import type { ExtensionAPI } from "@mariozechner/pi-coding-agent";
import { createAssistantMessageEventStream, ... } from "@mariozechner/pi-ai";
// 2. Type definitions
interface ParameterSupport { store: boolean; prompt_cache_key: boolean; }
interface RateLimitError { isRateLimit: boolean; limitType?: string; ... }
// 3. Helper functions
function getParameterSupport(modelId: string): ParameterSupport { ... }
function detectModelCapabilities(modelId: string) { ... }
async function fetchAvailableModels(baseUrl, apiKey) { ... }
function parseRateLimitError(error: any): RateLimitError { ... }
function formatRateLimitError(info: RateLimitError, modelId: string): string { ... }
function createLiteLLMStream(parameterSupport) { ... }
// 4. Main export
export default async function (pi: ExtensionAPI) {
// Check environment
// Fetch models
// Build parameter support map
// Build model configurations (with conservative max_tokens)
// Register provider
}
After implementation:
# Check registration
pi --provider litellm --list-models
# Test various backends
pi --provider litellm --model claude-opus-4-6 -p "hello"
pi --provider litellm --model gpt-5.2 -p "count to 3"
pi --provider litellm --model gemini-2-5-pro -p "hi"
# Verify clean output (no verbose logs)
pi --provider litellm --model gpt-5.2 -p "test" 2>&1 | head -5
# Check max_tokens are conservative (4-8K, not 16K+)
pi --provider litellm --list-models | grep litellm | head -5
# Should show max-out around 4.1K-8.2K
# Test rate limit error formatting (if you can trigger it)
# Make rapid requests to potentially hit limits
for i in {1..10}; do
pi --provider litellm --model claude-opus-4-6 -p "Count to $i"
done
# Should see clear error message with retry guidance if rate limited
/opt/homebrew/lib/node_modules/@mariozechner/pi-coding-agent/docs/extensions.md/opt/homebrew/lib/node_modules/@mariozechner/pi-coding-agent/docs/custom-provider.mdpi-mono/packages/ai/src/providers/azure-openai-responses.tsFrom implementation sessions:
Parameter Compatibility:
store but reject prompt_cache_keysessionId option before calling Azure providerRate Limit Handling:
retry_after, limit_type, limit, current from 429sKey insight: Databricks API is designed for reactive handling. Don't fight it.
OPENAI_BASE_URL, OPENAI_API_KEY)/v1/models endpoint)~/.pi/agent/extensions/)--list-models outputdevelopment
DuckDB patterns for JSON/JSONL analysis, array unnesting, and common gotchas. Use when querying JSON files, nested data, or encountering "UNNEST not supported here" errors.
development
Mealie recipe manager API: recipes, shopping lists, meal plans. Requires MEALIE_BASE_URL and MEALIE_API_KEY.
business
TimeWarrior time tracking: start/stop intervals, query durations by tag or issue, compute totals for issue tracker time reporting
development
Bookmark manager for saving, searching, and annotating web content. Use when: (1) saving a webpage for later reference, (2) searching previously saved bookmarks, (3) adding highlights/annotations to saved content, (4) user asks to 'bookmark this' or 'save this article'. Requires READECK_BASE_URL and READECK_API_KEY environment variables.