plugins/llamafile/skills/llamafile/SKILL.md
When setting up local LLM inference without cloud APIs. When running GGUF models locally. When needing OpenAI-compatible API from a local model. When building offline/air-gapped AI tools. When troubleshooting local LLM server connections.
npx skillsauth add jamie-bitflight/claude_skills llamafileInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Configure and manage Mozilla Llamafile - a cross-platform executable distribution format that runs LLMs locally with an OpenAI-compatible API.
Use this skill when:
Llamafile combines llama.cpp with Cosmopolitan Libc to create single-file executables that:
/health endpoint for monitoring--embedding flagLlamafile exposes these OpenAI-compatible endpoints when running with --server:
| Endpoint | Description | Requirements |
| ------------------------------------------- | -------------------------- | ------------------ |
| http://localhost:8080/v1/chat/completions | Chat completions (primary) | Server mode |
| http://localhost:8080/v1/completions | Text completions | Server mode |
| http://localhost:8080/v1/embeddings | Generate embeddings | --embedding flag |
| http://localhost:8080/health | Health check | Server mode |
Critical Detail: All OpenAI-compatible endpoints require /v1 prefix in the URL path.
# Download llamafile v0.9.3 binary
curl -L -o llamafile https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/llamafile-0.9.3
# Make executable
chmod 755 llamafile
# Verify version
./llamafile --version
Alternative download sources:
https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/llamafile-0.9.3https://sourceforge.net/projects/llamafile.mirror/files/0.9.3/Llamafile supports two approaches: pre-packaged llamafile executables (model embedded) or separate GGUF model files.
Pre-packaged llamafile (easiest):
# Download a llamafile with embedded model
curl -LO https://huggingface.co/mozilla-ai/llava-v1.5-7b-llamafile/resolve/main/llava-v1.5-7b-q4.llamafile
chmod +x llava-v1.5-7b-q4.llamafile
./llava-v1.5-7b-q4.llamafile --server --nobrowser
Separate GGUF model (use with ./llamafile --server -m model.gguf):
Download GGUF files from HuggingFace model publishers, then load with the llamafile binary.
Pre-packaged llamafile models from mozilla-ai:
| Model | Size | Use Case | Download | | ------------------ | ------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------- | | Qwen3-0.6B | ~500MB | Fast, lower quality | mozilla-ai/Qwen3-0.6B-llamafile | | Mistral 7B v0.2 | ~4GB | Balanced speed/quality | mozilla-ai/Mistral-7B-Instruct-v0.2-llamafile | | Llama 3.1 8B | ~5GB | Higher quality, slower | mozilla-ai/Meta-Llama-3.1-8B-Instruct-llamafile | | LLaVA v1.5 7B | ~4GB | Multimodal (text+image)| mozilla-ai/llava-v1.5-7b-llamafile |
These are self-contained executables — download, chmod +x, and run. No separate llamafile binary needed.
Start llamafile server for local API access:
./llamafile --server \
-m /path/to/model.gguf \
--nobrowser \
--port 8080 \
--host 127.0.0.1
Critical flags explained:
--server: Required to enable HTTP API endpoints-m: Path to GGUF model file (required)--nobrowser: Prevents auto-opening browser on startup--port 8080: Default port (note: NOT 8000)--host 127.0.0.1: Localhost only (secure default)For GPU-accelerated inference with higher throughput:
./llamafile --server \
-m /path/to/model.gguf \
--nobrowser \
--port 8080 \
--host 127.0.0.1 \
--ctx-size 4096 \
--n-gpu-layers 99 \
--threads 8 \
--cont-batching \
--parallel 4
Advanced flags:
| Flag | Purpose | Default | When to Use |
| ----------------- | ---------------------------- | ------------------- | ----------------------------------------------- |
| --ctx-size | Prompt context window size | 512 | Increase for longer conversations |
| --n-gpu-layers | GPU offload layer count | 0 | Set to 99 to offload all layers to GPU |
| --threads | CPU threads for generation | Auto | Set explicitly for consistent performance |
| --threads-batch | Threads for batch processing | Same as --threads | Tune separately for prompt vs generation |
| --cont-batching | Continuous batching | Off | Enable for multiple concurrent requests |
| --parallel | Parallel sequence count | 1 | Increase for concurrent request handling |
| --mlock | Lock model in memory | Off | Prevent swapping on systems with sufficient RAM |
| --embedding | Enable embeddings endpoint | Off | Required for /v1/embeddings API |
To allow connections from other machines (development/testing only):
./llamafile --server \
-m /path/to/model.gguf \
--nobrowser \
--host 0.0.0.0 \
--port 8080
Security warning: Binding to 0.0.0.0 exposes the API to network access. Use only in trusted environments.
LiteLLM provides unified interface for llamafile and cloud LLM providers.
import litellm
response = litellm.completion(
model="llamafile/gemma-3-3b", # MUST use llamafile/ prefix
messages=[{"role": "user", "content": "Hello, world!"}],
api_base="http://localhost:8080/v1", # MUST include /v1 suffix
temperature=0.3,
max_tokens=200
)
print(response.choices[0].message.content)
Critical requirements for LiteLLM:
llamafile/ prefix for routingapi_base MUST include /v1 suffixRelated skill: For comprehensive LiteLLM configuration, activate the litellm skill:
Skill(skill: "litellm:litellm")
Direct integration with OpenAI SDK for llamafile endpoints:
from openai import OpenAI
client = OpenAI(
base_url="http://localhost:8080/v1", # MUST include /v1
api_key="sk-no-key-required" # Any value works
)
response = client.chat.completions.create(
model="local-model", # Model name is flexible
messages=[
{"role": "user", "content": "Hello, world!"}
],
temperature=0.3,
max_tokens=200
)
print(response.choices[0].message.content)
Verify llamafile server is responding correctly:
# Health check
curl http://localhost:8080/health
# Chat completions
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"messages": [{"role": "user", "content": "Hello"}],
"temperature": 0.3,
"max_tokens": 200
}'
# Embeddings (requires --embedding flag on server)
curl http://localhost:8080/v1/embeddings \
-H "Content-Type: application/json" \
-d '{
"model": "local",
"input": ["Hello world"]
}'
Python script to start llamafile as background process with health checking:
import subprocess
import time
import httpx
def start_llamafile(
llamafile_path: str,
model_path: str,
port: int = 8080,
host: str = "127.0.0.1"
) -> subprocess.Popen:
"""Start llamafile server as background process."""
cmd = [
llamafile_path,
"--server",
"-m", model_path,
"--nobrowser",
"--port", str(port),
"--host", host,
]
process = subprocess.Popen(
cmd,
stdout=subprocess.PIPE,
stderr=subprocess.PIPE,
)
_wait_for_server(host, port)
return process
def _wait_for_server(host: str, port: int, timeout: int = 30) -> None:
"""Wait for server to respond to health checks."""
url = f"http://{host}:{port}/health"
start = time.time()
while time.time() - start < timeout:
try:
response = httpx.get(url, timeout=2)
if response.status_code == 200:
return
except httpx.RequestError:
pass
time.sleep(0.5)
raise TimeoutError(f"Server did not start within {timeout} seconds")
Example TOML configuration for applications using llamafile:
# ~/.config/app-name/config.toml
[ai]
model = "llamafile/gemma-3-3b" # Must use llamafile/ prefix
temperature = 0.3
max_tokens = 200
[llamafile]
path = "/home/user/.local/bin/llamafile"
model_path = "/home/user/.local/share/app-name/models/gemma-3-3b.gguf"
api_base = "http://127.0.0.1:8080/v1" # Include /v1 suffix
Check if port is already in use:
# Find process using port 8080
lsof -i :8080
# Kill existing process
kill $(lsof -t -i :8080)
Verify model file exists and is readable:
ls -lh /path/to/model.gguf
Check llamafile binary permissions:
ls -la /path/to/llamafile
# Should show: -rwxr-xr-x (executable)
# Fix permissions if needed
chmod 755 /path/to/llamafile
Verify server is running:
# Check health endpoint
curl http://localhost:8080/health
# Check server is listening
netstat -tlnp | grep 8080
# or
lsof -i :8080
Common causes:
--server flag/v1 in API URL path127.0.0.1 but accessing from another machineTest basic connectivity:
# Verbose health check
curl -v http://localhost:8080/health
# Test chat completions with verbose output
curl -v http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{"model":"test","messages":[{"role":"user","content":"Hi"}]}'
Common API issues:
| Error | Cause | Solution |
| ------------------ | -------------------- | --------------------------------- |
| 404 Not Found | Missing /v1 in URL | Add /v1 before endpoint path |
| Connection refused | Server not running | Start server with --server flag |
| Timeout | Model loading slowly | Wait longer or use smaller model |
| Invalid model | Wrong model path | Verify -m path to GGUF file |
Optimize inference speed:
--n-gpu-layers 99--threads 8--cont-batching--ctx-size 2048Check GPU availability:
# NVIDIA GPU
nvidia-smi
# AMD GPU
rocm-smi
# Apple Metal (check activity monitor)
Avoid these frequent errors when using llamafile:
/v1 in API URL: Always include /v1 suffix for OpenAI-compatible endpointsllamafile/ prefix in model name for proper routingchmod 755)--n-gpu-layers on CPU-only systems causes errorsCurrent stable version: 0.9.3 (May 14, 2025)
Version constants:
LLAMAFILE_MAJOR = 0
LLAMAFILE_MINOR = 9
LLAMAFILE_PATCH = 3
Recent changes in 0.9.3:
Skills to activate:
litellm - For unified LLM provider interface and routing
Skill(skill: "litellm:litellm")
External tools:
development
When an application needs to store config, data, cache, or state files. When designing where user-specific files should live. When code writes to ~/.appname or hardcoded home paths. When implementing cross-platform file storage with platformdirs.
testing
Enforce mandatory pre-action verification checkpoints to prevent pattern-matching from overriding explicit reasoning. Use this skill when about to execute implementation actions (Bash, Write, Edit) to verify hypothesis-action alignment. Blocks execution when hypothesis unverified or action targets different system than hypothesis identified. Critical for preventing cognitive dissonance where correct diagnosis leads to wrong implementation.
tools
Reference guide for the Twelve-Factor App methodology — 15 principles (12 original + 3 modern extensions) for building portable, resilient, cloud-native applications. Use when evaluating application architecture, designing cloud-native services, reviewing codebases for methodology compliance, advising on configuration, scaling, observability, security, and deployment patterns. Incorporates the 2025 open-source community evolution and cloud-native reinterpretations of each factor.
tools
Converts user-facing documentation (how-to guides, tutorials, API references, examples) in any format — Markdown, PDF, DOCX, PPTX, XLSX, AsciiDoc, RST, HTML, Jupyter notebooks, man pages, TOML/YAML/JSON configs, and plain text — into Claude Code skill directories with SKILL.md plus thematically grouped references/*.md files. Use when given a docs directory or mixed-format documentation to transform into an AI skill. Uses MCP file-reader server for binary formats.