plugins/litellm/skills/litellm/SKILL.md
When calling LLM APIs from Python code. When connecting to llamafile or local LLM servers. When switching between OpenAI/Anthropic/local providers. When implementing retry/fallback logic for LLM calls. When code imports litellm or uses completion() patterns.
npx skillsauth add jamie-bitflight/claude_skills litellmInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Unified Python interface for calling 100+ LLM APIs using consistent OpenAI format. Provides standardized exception handling, retry/fallback logic, and cost tracking across multiple providers.
Use this skill when:
LiteLLM supports 100+ providers through consistent OpenAI-style API:
completion() function for all providers# Using pip
pip install litellm
# Using uv
uv add litellm
All llamafile models MUST use the llamafile/ prefix for routing:
model = "llamafile/mistralai/mistral-7b-instruct-v0.2"
model = "llamafile/gemma-3-3b"
The api_base MUST point to llamafile's OpenAI-compatible endpoint:
api_base = "http://localhost:8080/v1"
Critical Requirements:
/v1 suffix/chat/completions (LiteLLM adds these automatically)import os
os.environ["LLAMAFILE_API_BASE"] = "http://localhost:8080/v1"
import litellm
response = litellm.completion(
model="llamafile/mistralai/mistral-7b-instruct-v0.2",
messages=[{"role": "user", "content": "Summarize this diff"}],
api_base="http://localhost:8080/v1",
temperature=0.2,
max_tokens=80,
)
print(response.choices[0].message.content)
from litellm import acompletion
import asyncio
async def generate_message():
response = await acompletion(
model="llamafile/gemma-3-3b",
messages=[{"role": "user", "content": "Write a commit message"}],
api_base="http://localhost:8080/v1",
temperature=0.3,
max_tokens=200,
)
return response.choices[0].message.content
result = asyncio.run(generate_message())
print(result)
from litellm import acompletion
import asyncio
async def stream_response():
response = await acompletion(
model="llamafile/gemma-3-3b",
messages=[{"role": "user", "content": "Hello, how are you?"}],
api_base="http://localhost:8080/v1",
stream=True,
)
async for chunk in response:
if chunk.choices[0].delta.content:
print(chunk.choices[0].delta.content, end="", flush=True)
print()
asyncio.run(stream_response())
from litellm import embedding
import os
os.environ["LLAMAFILE_API_BASE"] = "http://localhost:8080/v1"
response = embedding(
model="llamafile/sentence-transformers/all-MiniLM-L6-v2",
input=["Hello world"],
)
print(response)
All exceptions can be imported directly from litellm:
from litellm import (
BadRequestError, # 400 errors
AuthenticationError, # 401 errors
NotFoundError, # 404 errors
Timeout, # 408 errors (alias: openai.APITimeoutError)
RateLimitError, # 429 errors
APIConnectionError, # 500 errors / connection issues (default)
ServiceUnavailableError, # 503 errors
)
| Status Code | Exception Type | Inherits from | Description |
| ----------- | ----------------------------- | ---------------------------- | --------------------------- |
| 400 | BadRequestError | openai.BadRequestError | Invalid request |
| 400 | ContextWindowExceededError | litellm.BadRequestError | Token limit exceeded |
| 400 | ContentPolicyViolationError | litellm.BadRequestError | Content policy violation |
| 401 | AuthenticationError | openai.AuthenticationError | Auth failure |
| 403 | PermissionDeniedError | openai.PermissionDeniedError | Permission denied |
| 404 | NotFoundError | openai.NotFoundError | Invalid model/endpoint |
| 408 | Timeout | openai.APITimeoutError | Request timeout |
| 429 | RateLimitError | openai.RateLimitError | Rate limited |
| 500 | APIConnectionError | openai.APIConnectionError | Default for unmapped errors |
| 500 | APIError | openai.APIError | Generic 500 error |
| 503 | ServiceUnavailableError | openai.APIStatusError | Service unavailable |
| >=500 | InternalServerError | openai.InternalServerError | Unmapped 500+ errors |
All LiteLLM exceptions include:
status_code: HTTP status codemessage: Error messagellm_provider: Provider that raised the exceptionimport litellm
import openai
try:
response = litellm.completion(
model="llamafile/gemma-3-3b",
messages=[{"role": "user", "content": "Hello"}],
api_base="http://localhost:8080/v1",
timeout=30.0,
)
except openai.APITimeoutError as e:
# LiteLLM exceptions inherit from OpenAI types
print(f"Timeout: {e}")
except litellm.APIConnectionError as e:
print(f"Connection failed: {e.message}")
print(f"Provider: {e.llm_provider}")
from litellm.exceptions import BadRequestError, AuthenticationError, APIError
try:
response = litellm.completion(
model="llamafile/gemma-3-3b",
messages=[{"role": "user", "content": "Hello"}],
api_base="http://localhost:8080/v1",
)
except AuthenticationError as e:
print(f"Authentication failed: {e}")
except BadRequestError as e:
print(f"Bad request: {e}")
except APIError as e:
print(f"API error: {e}")
import litellm
try:
response = litellm.completion(
model="llamafile/gemma-3-3b",
messages=[{"role": "user", "content": "Hello"}],
api_base="http://localhost:8080/v1",
)
except Exception as e:
if hasattr(e, 'status_code'):
should_retry = litellm._should_retry(e.status_code)
print(f"Should retry: {should_retry}")
from litellm import completion
response = completion(
model="llamafile/gemma-3-3b",
messages=[{"role": "user", "content": "Hello"}],
api_base="http://localhost:8080/v1",
num_retries=3, # Retry 3 times on failure
timeout=30.0, # 30 second timeout
)
For proxy deployments, use config.yaml:
model_list:
- model_name: commit-polish-model
litellm_params:
model: llamafile/gemma-3-3b # add llamafile/ prefix
api_base: http://localhost:8080/v1 # add api base for OpenAI compatible provider
import litellm
from litellm import APIConnectionError
def verify_llamafile_connection(api_base: str = "http://localhost:8080/v1") -> bool:
"""Check if llamafile server is running."""
try:
litellm.completion(
model="llamafile/test",
messages=[{"role": "user", "content": "test"}],
api_base=api_base,
max_tokens=1,
)
return True
except APIConnectionError:
return False
import litellm
from litellm import acompletion, APIConnectionError
import asyncio
class AIService:
"""LiteLLM wrapper with llamafile routing."""
def __init__(self, model: str, api_base: str, temperature: float = 0.3, max_tokens: int = 200):
self.model = model
self.api_base = api_base
self.temperature = temperature
self.max_tokens = max_tokens
async def generate_commit_message(self, diff: str, system_prompt: str) -> str:
"""Generate a commit message using the LLM."""
try:
response = await acompletion(
model=self.model,
messages=[
{"role": "system", "content": system_prompt},
{"role": "user", "content": f"Generate a commit message for this diff:\n\n{diff}"},
],
api_base=self.api_base,
temperature=self.temperature,
max_tokens=self.max_tokens,
)
return response.choices[0].message.content.strip()
except APIConnectionError as e:
raise RuntimeError(f"Failed to connect to llamafile server at {self.api_base}: {e.message}")
llamafile/ prefix: Without prefix, LiteLLM won't route to OpenAI-compatible endpoint/v1 suffix: API base must end with /v1http://localhost:8080/v1/chat/completions - LiteLLM adds the endpoint path automatically# ~/.config/commit-polish/config.toml
[ai]
model = "llamafile/gemma-3-3b" # MUST have llamafile/ prefix
temperature = 0.3
max_tokens = 200
export LLAMAFILE_API_BASE="http://localhost:8080/v1"
export LITELLM_LOG="INFO" # Enable LiteLLM debug logging
For comprehensive documentation on related tools:
Skill(command: "llamafile:llamafile") for llamafile server setup, model management, and local LLM deployment patternsSkill(command: "python3-development:uv") for Python project management, dependency handling, and virtual environment workflowsdevelopment
When an application needs to store config, data, cache, or state files. When designing where user-specific files should live. When code writes to ~/.appname or hardcoded home paths. When implementing cross-platform file storage with platformdirs.
testing
Enforce mandatory pre-action verification checkpoints to prevent pattern-matching from overriding explicit reasoning. Use this skill when about to execute implementation actions (Bash, Write, Edit) to verify hypothesis-action alignment. Blocks execution when hypothesis unverified or action targets different system than hypothesis identified. Critical for preventing cognitive dissonance where correct diagnosis leads to wrong implementation.
tools
Reference guide for the Twelve-Factor App methodology — 15 principles (12 original + 3 modern extensions) for building portable, resilient, cloud-native applications. Use when evaluating application architecture, designing cloud-native services, reviewing codebases for methodology compliance, advising on configuration, scaling, observability, security, and deployment patterns. Incorporates the 2025 open-source community evolution and cloud-native reinterpretations of each factor.
tools
Converts user-facing documentation (how-to guides, tutorials, API references, examples) in any format — Markdown, PDF, DOCX, PPTX, XLSX, AsciiDoc, RST, HTML, Jupyter notebooks, man pages, TOML/YAML/JSON configs, and plain text — into Claude Code skill directories with SKILL.md plus thematically grouped references/*.md files. Use when given a docs directory or mixed-format documentation to transform into an AI skill. Uses MCP file-reader server for binary formats.