Llamafile

Configure and manage Mozilla Llamafile - a cross-platform executable distribution format that runs LLMs locally with an OpenAI-compatible API.

When to Use This Skill

Use this skill when:

Installing llamafile binary and GGUF model files
Starting llamafile server with optimal configuration
Integrating llamafile with LiteLLM or OpenAI SDK
Configuring llamafile for different performance profiles (GPU, CPU, network access)
Troubleshooting llamafile server startup or API connection issues
Building applications requiring local LLM inference
Setting up commit message tools, code review systems, or other developer tools with local AI
Managing llamafile as a background service
Selecting and downloading appropriate GGUF models
Validating OpenAI-compatible API responses

Core Capabilities

What Llamafile Provides

Llamafile combines llama.cpp with Cosmopolitan Libc to create single-file executables that:

Run on macOS, Windows, Linux, FreeBSD, OpenBSD, NetBSD
Support AMD64 and ARM64 architectures
Serve OpenAI-compatible HTTP API on localhost
Load GGUF model files for inference
Provide /health endpoint for monitoring
Support GPU acceleration (CUDA, Metal, Vulkan)
Enable embeddings generation with --embedding flag

API Compatibility

Llamafile exposes these OpenAI-compatible endpoints when running with --server:

| Endpoint | Description | Requirements | | ------------------------------------------- | -------------------------- | ------------------ | | http://localhost:8080/v1/chat/completions | Chat completions (primary) | Server mode | | http://localhost:8080/v1/completions | Text completions | Server mode | | http://localhost:8080/v1/embeddings | Generate embeddings | --embedding flag | | http://localhost:8080/health | Health check | Server mode |

Critical Detail: All OpenAI-compatible endpoints require /v1 prefix in the URL path.

Installation

Download Llamafile Binary

# Download llamafile v0.9.3 binary
curl -L -o llamafile https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/llamafile-0.9.3

# Make executable
chmod 755 llamafile

# Verify version
./llamafile --version

Alternative download sources:

GitHub Release: https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/llamafile-0.9.3
SourceForge Mirror: https://sourceforge.net/projects/llamafile.mirror/files/0.9.3/

Download a Model

Llamafile supports two approaches: pre-packaged llamafile executables (model embedded) or separate GGUF model files.

Pre-packaged llamafile (easiest):

# Download a llamafile with embedded model
curl -LO https://huggingface.co/mozilla-ai/llava-v1.5-7b-llamafile/resolve/main/llava-v1.5-7b-q4.llamafile
chmod +x llava-v1.5-7b-q4.llamafile
./llava-v1.5-7b-q4.llamafile --server --nobrowser

Separate GGUF model (use with ./llamafile --server -m model.gguf):

Download GGUF files from HuggingFace model publishers, then load with the llamafile binary.

Pre-packaged llamafile models from mozilla-ai:

| Model | Size | Use Case | Download | | ------------------ | ------ | ---------------------- | ----------------------------------------------------------------------------------------------------------------- | | Qwen3-0.6B | ~500MB | Fast, lower quality | mozilla-ai/Qwen3-0.6B-llamafile | | Mistral 7B v0.2 | ~4GB | Balanced speed/quality | mozilla-ai/Mistral-7B-Instruct-v0.2-llamafile | | Llama 3.1 8B | ~5GB | Higher quality, slower | mozilla-ai/Meta-Llama-3.1-8B-Instruct-llamafile | | LLaVA v1.5 7B | ~4GB | Multimodal (text+image)| mozilla-ai/llava-v1.5-7b-llamafile |

These are self-contained executables — download, chmod +x, and run. No separate llamafile binary needed.

Server Configuration

Basic Server Command

Start llamafile server for local API access:

./llamafile --server \
    -m /path/to/model.gguf \
    --nobrowser \
    --port 8080 \
    --host 127.0.0.1

Critical flags explained:

--server: Required to enable HTTP API endpoints
-m: Path to GGUF model file (required)
--nobrowser: Prevents auto-opening browser on startup
--port 8080: Default port (note: NOT 8000)
--host 127.0.0.1: Localhost only (secure default)

Performance-Optimized Configuration

For GPU-accelerated inference with higher throughput:

./llamafile --server \
    -m /path/to/model.gguf \
    --nobrowser \
    --port 8080 \
    --host 127.0.0.1 \
    --ctx-size 4096 \
    --n-gpu-layers 99 \
    --threads 8 \
    --cont-batching \
    --parallel 4

Advanced flags:

| Flag | Purpose | Default | When to Use | | ----------------- | ---------------------------- | ------------------- | ----------------------------------------------- | | --ctx-size | Prompt context window size | 512 | Increase for longer conversations | | --n-gpu-layers | GPU offload layer count | 0 | Set to 99 to offload all layers to GPU | | --threads | CPU threads for generation | Auto | Set explicitly for consistent performance | | --threads-batch | Threads for batch processing | Same as --threads | Tune separately for prompt vs generation | | --cont-batching | Continuous batching | Off | Enable for multiple concurrent requests | | --parallel | Parallel sequence count | 1 | Increase for concurrent request handling | | --mlock | Lock model in memory | Off | Prevent swapping on systems with sufficient RAM | | --embedding | Enable embeddings endpoint | Off | Required for /v1/embeddings API |

Network-Accessible Configuration

To allow connections from other machines (development/testing only):

./llamafile --server \
    -m /path/to/model.gguf \
    --nobrowser \
    --host 0.0.0.0 \
    --port 8080

Security warning: Binding to 0.0.0.0 exposes the API to network access. Use only in trusted environments.

API Integration

Using LiteLLM (Recommended)

LiteLLM provides unified interface for llamafile and cloud LLM providers.

import litellm

response = litellm.completion(
    model="llamafile/gemma-3-3b",  # MUST use llamafile/ prefix
    messages=[{"role": "user", "content": "Hello, world!"}],
    api_base="http://localhost:8080/v1",  # MUST include /v1 suffix
    temperature=0.3,
    max_tokens=200
)

print(response.choices[0].message.content)

Critical requirements for LiteLLM:

Model name MUST use llamafile/ prefix for routing
api_base MUST include /v1 suffix
No API key required (any placeholder value works)

Related skill: For comprehensive LiteLLM configuration, activate the litellm skill:

Skill(skill: "litellm:litellm")

Using OpenAI Python SDK

Direct integration with OpenAI SDK for llamafile endpoints:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",  # MUST include /v1
    api_key="sk-no-key-required"  # Any value works
)

response = client.chat.completions.create(
    model="local-model",  # Model name is flexible
    messages=[
        {"role": "user", "content": "Hello, world!"}
    ],
    temperature=0.3,
    max_tokens=200
)

print(response.choices[0].message.content)

Using curl for Testing

Verify llamafile server is responding correctly:

# Health check
curl http://localhost:8080/health

# Chat completions
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Hello"}],
    "temperature": 0.3,
    "max_tokens": 200
  }'

# Embeddings (requires --embedding flag on server)
curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "input": ["Hello world"]
  }'

Server Management

Process Management Script

Python script to start llamafile as background process with health checking:

import subprocess
import time
import httpx

def start_llamafile(
    llamafile_path: str,
    model_path: str,
    port: int = 8080,
    host: str = "127.0.0.1"
) -> subprocess.Popen:
    """Start llamafile server as background process."""
    cmd = [
        llamafile_path,
        "--server",
        "-m", model_path,
        "--nobrowser",
        "--port", str(port),
        "--host", host,
    ]
    process = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )
    _wait_for_server(host, port)
    return process


def _wait_for_server(host: str, port: int, timeout: int = 30) -> None:
    """Wait for server to respond to health checks."""
    url = f"http://{host}:{port}/health"
    start = time.time()
    while time.time() - start < timeout:
        try:
            response = httpx.get(url, timeout=2)
            if response.status_code == 200:
                return
        except httpx.RequestError:
            pass
        time.sleep(0.5)
    raise TimeoutError(f"Server did not start within {timeout} seconds")

Configuration File Pattern

Example TOML configuration for applications using llamafile:

# ~/.config/app-name/config.toml
[ai]
model = "llamafile/gemma-3-3b"  # Must use llamafile/ prefix
temperature = 0.3
max_tokens = 200

[llamafile]
path = "/home/user/.local/bin/llamafile"
model_path = "/home/user/.local/share/app-name/models/gemma-3-3b.gguf"
api_base = "http://127.0.0.1:8080/v1"  # Include /v1 suffix

Troubleshooting

Server Fails to Start

Check if port is already in use:

# Find process using port 8080
lsof -i :8080

# Kill existing process
kill $(lsof -t -i :8080)

Verify model file exists and is readable:

ls -lh /path/to/model.gguf

Check llamafile binary permissions:

ls -la /path/to/llamafile
# Should show: -rwxr-xr-x (executable)

# Fix permissions if needed
chmod 755 /path/to/llamafile

Connection Refused Errors

Verify server is running:

# Check health endpoint
curl http://localhost:8080/health

# Check server is listening
netstat -tlnp | grep 8080
# or
lsof -i :8080

Common causes:

Server not started with --server flag
Wrong port number (8080 vs 8000)
Missing /v1 in API URL path
Server bound to 127.0.0.1 but accessing from another machine

API Errors

Test basic connectivity:

# Verbose health check
curl -v http://localhost:8080/health

# Test chat completions with verbose output
curl -v http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"test","messages":[{"role":"user","content":"Hi"}]}'

Common API issues:

| Error | Cause | Solution | | ------------------ | -------------------- | --------------------------------- | | 404 Not Found | Missing /v1 in URL | Add /v1 before endpoint path | | Connection refused | Server not running | Start server with --server flag | | Timeout | Model loading slowly | Wait longer or use smaller model | | Invalid model | Wrong model path | Verify -m path to GGUF file |

Performance Issues

Optimize inference speed:

Use quantized models (Q4_K_M recommended)
Enable GPU acceleration: --n-gpu-layers 99
Increase threads: --threads 8
Enable continuous batching: --cont-batching
Reduce context size if not needed: --ctx-size 2048

Check GPU availability:

# NVIDIA GPU
nvidia-smi

# AMD GPU
rocm-smi

# Apple Metal (check activity monitor)

Common Pitfalls

Avoid these frequent errors when using llamafile:

Port 8000 vs 8080: Llamafile defaults to port 8080, not 8000
Missing /v1 in API URL: Always include /v1 suffix for OpenAI-compatible endpoints
LiteLLM prefix: Must use llamafile/ prefix in model name for proper routing
API key confusion: No real API key needed, but some clients require placeholder value
Starting server from hooks: Application hooks should check if server is running, not start it
Model path issues: Ensure GGUF file exists and is readable before starting server
Binary permissions: Llamafile must be executable (chmod 755)
GPU layers on CPU: Setting --n-gpu-layers on CPU-only systems causes errors

Version Information

Current stable version: 0.9.3 (May 14, 2025)

Version constants:

LLAMAFILE_MAJOR = 0
LLAMAFILE_MINOR = 9
LLAMAFILE_PATCH = 3

Recent changes in 0.9.3:

Added Phi4 model support
Added Qwen3 model support
Respects NO_COLOR environment variable
Fixed URL handling in JavaScript (preserves path when building relative URLs)
Added Plaintext output option to LocalScore

Related Skills and Tools

Skills to activate:

litellm - For unified LLM provider interface and routing
```
Skill(skill: "litellm:litellm")
```

External tools:

LiteLLM - Unified interface for multiple LLM providers
OpenAI Python SDK - Direct OpenAI-compatible API access
llama.cpp - Underlying inference engine
GGUF format - Model format specification

References

Official Documentation

Mozilla llamafile GitHub - Primary repository and source code
Mozilla llamafile Documentation - Official documentation site
LiteLLM llamafile Provider - LiteLLM integration guide
llama.cpp Server Documentation - Underlying server implementation
Releases Page - Binary downloads and changelog

Model Resources

Hugging Face Mozilla Models - Official Mozilla GGUF models
GGUF Format Specification - Model file format details

Related Technologies

Cosmopolitan Libc - Cross-platform binary format
llama.cpp - LLM inference engine
OpenAI API Reference - API compatibility reference

Llamafile

Configure and manage Mozilla Llamafile - a cross-platform executable distribution format that runs LLMs locally with an OpenAI-compatible API.

When to Use This Skill

Use this skill when:

Installing llamafile binary and GGUF model files
Starting llamafile server with optimal configuration
Integrating llamafile with LiteLLM or OpenAI SDK
Configuring llamafile for different performance profiles (GPU, CPU, network access)
Troubleshooting llamafile server startup or API connection issues
Building applications requiring local LLM inference
Setting up commit message tools, code review systems, or other developer tools with local AI
Managing llamafile as a background service
Selecting and downloading appropriate GGUF models
Validating OpenAI-compatible API responses

Core Capabilities

What Llamafile Provides

Llamafile combines llama.cpp with Cosmopolitan Libc to create single-file executables that:

Run on macOS, Windows, Linux, FreeBSD, OpenBSD, NetBSD
Support AMD64 and ARM64 architectures
Serve OpenAI-compatible HTTP API on localhost
Load GGUF model files for inference
Provide /health endpoint for monitoring
Support GPU acceleration (CUDA, Metal, Vulkan)
Enable embeddings generation with --embedding flag

API Compatibility

Llamafile exposes these OpenAI-compatible endpoints when running with --server:

Critical Detail: All OpenAI-compatible endpoints require /v1 prefix in the URL path.

Installation

Download Llamafile Binary

# Download llamafile v0.9.3 binary
curl -L -o llamafile https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/llamafile-0.9.3

# Make executable
chmod 755 llamafile

# Verify version
./llamafile --version

Alternative download sources:

GitHub Release: https://github.com/mozilla-ai/llamafile/releases/download/0.9.3/llamafile-0.9.3
SourceForge Mirror: https://sourceforge.net/projects/llamafile.mirror/files/0.9.3/

Download a Model

Llamafile supports two approaches: pre-packaged llamafile executables (model embedded) or separate GGUF model files.

Pre-packaged llamafile (easiest):

# Download a llamafile with embedded model
curl -LO https://huggingface.co/mozilla-ai/llava-v1.5-7b-llamafile/resolve/main/llava-v1.5-7b-q4.llamafile
chmod +x llava-v1.5-7b-q4.llamafile
./llava-v1.5-7b-q4.llamafile --server --nobrowser

Separate GGUF model (use with ./llamafile --server -m model.gguf):

Download GGUF files from HuggingFace model publishers, then load with the llamafile binary.

Pre-packaged llamafile models from mozilla-ai:

These are self-contained executables — download, chmod +x, and run. No separate llamafile binary needed.

Server Configuration

Basic Server Command

Start llamafile server for local API access:

./llamafile --server \
    -m /path/to/model.gguf \
    --nobrowser \
    --port 8080 \
    --host 127.0.0.1

Critical flags explained:

--server: Required to enable HTTP API endpoints
-m: Path to GGUF model file (required)
--nobrowser: Prevents auto-opening browser on startup
--port 8080: Default port (note: NOT 8000)
--host 127.0.0.1: Localhost only (secure default)

Performance-Optimized Configuration

For GPU-accelerated inference with higher throughput:

./llamafile --server \
    -m /path/to/model.gguf \
    --nobrowser \
    --port 8080 \
    --host 127.0.0.1 \
    --ctx-size 4096 \
    --n-gpu-layers 99 \
    --threads 8 \
    --cont-batching \
    --parallel 4

Advanced flags:

Network-Accessible Configuration

To allow connections from other machines (development/testing only):

./llamafile --server \
    -m /path/to/model.gguf \
    --nobrowser \
    --host 0.0.0.0 \
    --port 8080

Security warning: Binding to 0.0.0.0 exposes the API to network access. Use only in trusted environments.

API Integration

Using LiteLLM (Recommended)

LiteLLM provides unified interface for llamafile and cloud LLM providers.

import litellm

response = litellm.completion(
    model="llamafile/gemma-3-3b",  # MUST use llamafile/ prefix
    messages=[{"role": "user", "content": "Hello, world!"}],
    api_base="http://localhost:8080/v1",  # MUST include /v1 suffix
    temperature=0.3,
    max_tokens=200
)

print(response.choices[0].message.content)

Critical requirements for LiteLLM:

Model name MUST use llamafile/ prefix for routing
api_base MUST include /v1 suffix
No API key required (any placeholder value works)

Related skill: For comprehensive LiteLLM configuration, activate the litellm skill:

Skill(skill: "litellm:litellm")

Using OpenAI Python SDK

Direct integration with OpenAI SDK for llamafile endpoints:

from openai import OpenAI

client = OpenAI(
    base_url="http://localhost:8080/v1",  # MUST include /v1
    api_key="sk-no-key-required"  # Any value works
)

response = client.chat.completions.create(
    model="local-model",  # Model name is flexible
    messages=[
        {"role": "user", "content": "Hello, world!"}
    ],
    temperature=0.3,
    max_tokens=200
)

print(response.choices[0].message.content)

Using curl for Testing

Verify llamafile server is responding correctly:

# Health check
curl http://localhost:8080/health

# Chat completions
curl http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "messages": [{"role": "user", "content": "Hello"}],
    "temperature": 0.3,
    "max_tokens": 200
  }'

# Embeddings (requires --embedding flag on server)
curl http://localhost:8080/v1/embeddings \
  -H "Content-Type: application/json" \
  -d '{
    "model": "local",
    "input": ["Hello world"]
  }'

Server Management

Process Management Script

Python script to start llamafile as background process with health checking:

import subprocess
import time
import httpx

def start_llamafile(
    llamafile_path: str,
    model_path: str,
    port: int = 8080,
    host: str = "127.0.0.1"
) -> subprocess.Popen:
    """Start llamafile server as background process."""
    cmd = [
        llamafile_path,
        "--server",
        "-m", model_path,
        "--nobrowser",
        "--port", str(port),
        "--host", host,
    ]
    process = subprocess.Popen(
        cmd,
        stdout=subprocess.PIPE,
        stderr=subprocess.PIPE,
    )
    _wait_for_server(host, port)
    return process


def _wait_for_server(host: str, port: int, timeout: int = 30) -> None:
    """Wait for server to respond to health checks."""
    url = f"http://{host}:{port}/health"
    start = time.time()
    while time.time() - start < timeout:
        try:
            response = httpx.get(url, timeout=2)
            if response.status_code == 200:
                return
        except httpx.RequestError:
            pass
        time.sleep(0.5)
    raise TimeoutError(f"Server did not start within {timeout} seconds")

Configuration File Pattern

Example TOML configuration for applications using llamafile:

# ~/.config/app-name/config.toml
[ai]
model = "llamafile/gemma-3-3b"  # Must use llamafile/ prefix
temperature = 0.3
max_tokens = 200

[llamafile]
path = "/home/user/.local/bin/llamafile"
model_path = "/home/user/.local/share/app-name/models/gemma-3-3b.gguf"
api_base = "http://127.0.0.1:8080/v1"  # Include /v1 suffix

Troubleshooting

Server Fails to Start

Check if port is already in use:

# Find process using port 8080
lsof -i :8080

# Kill existing process
kill $(lsof -t -i :8080)

Verify model file exists and is readable:

ls -lh /path/to/model.gguf

Check llamafile binary permissions:

ls -la /path/to/llamafile
# Should show: -rwxr-xr-x (executable)

# Fix permissions if needed
chmod 755 /path/to/llamafile

Connection Refused Errors

Verify server is running:

# Check health endpoint
curl http://localhost:8080/health

# Check server is listening
netstat -tlnp | grep 8080
# or
lsof -i :8080

Common causes:

Server not started with --server flag
Wrong port number (8080 vs 8000)
Missing /v1 in API URL path
Server bound to 127.0.0.1 but accessing from another machine

API Errors

Test basic connectivity:

# Verbose health check
curl -v http://localhost:8080/health

# Test chat completions with verbose output
curl -v http://localhost:8080/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{"model":"test","messages":[{"role":"user","content":"Hi"}]}'

Common API issues:

Performance Issues

Optimize inference speed:

Use quantized models (Q4_K_M recommended)
Enable GPU acceleration: --n-gpu-layers 99
Increase threads: --threads 8
Enable continuous batching: --cont-batching
Reduce context size if not needed: --ctx-size 2048

Check GPU availability:

# NVIDIA GPU
nvidia-smi

# AMD GPU
rocm-smi

# Apple Metal (check activity monitor)

Common Pitfalls

Avoid these frequent errors when using llamafile:

Port 8000 vs 8080: Llamafile defaults to port 8080, not 8000
Missing /v1 in API URL: Always include /v1 suffix for OpenAI-compatible endpoints
LiteLLM prefix: Must use llamafile/ prefix in model name for proper routing
API key confusion: No real API key needed, but some clients require placeholder value
Starting server from hooks: Application hooks should check if server is running, not start it
Model path issues: Ensure GGUF file exists and is readable before starting server
Binary permissions: Llamafile must be executable (chmod 755)
GPU layers on CPU: Setting --n-gpu-layers on CPU-only systems causes errors

Version Information

Current stable version: 0.9.3 (May 14, 2025)

Version constants:

LLAMAFILE_MAJOR = 0
LLAMAFILE_MINOR = 9
LLAMAFILE_PATCH = 3

Recent changes in 0.9.3:

Added Phi4 model support
Added Qwen3 model support
Respects NO_COLOR environment variable
Fixed URL handling in JavaScript (preserves path when building relative URLs)
Added Plaintext output option to LocalScore

Related Skills and Tools

Skills to activate:

litellm - For unified LLM provider interface and routing
```
Skill(skill: "litellm:litellm")
```

External tools:

LiteLLM - Unified interface for multiple LLM providers
OpenAI Python SDK - Direct OpenAI-compatible API access
llama.cpp - Underlying inference engine
GGUF format - Model format specification

References

Official Documentation

Mozilla llamafile GitHub - Primary repository and source code
Mozilla llamafile Documentation - Official documentation site
LiteLLM llamafile Provider - LiteLLM integration guide
llama.cpp Server Documentation - Underlying server implementation
Releases Page - Binary downloads and changelog

Model Resources

Hugging Face Mozilla Models - Official Mozilla GGUF models
GGUF Format Specification - Model file format details

Related Technologies

Cosmopolitan Libc - Cross-platform binary format
llama.cpp - LLM inference engine
OpenAI API Reference - API compatibility reference

Adoption

jamie-bitflight/llamafile

$ install --global

Security Scan Results

SKILL.md

Llamafile

When to Use This Skill

Core Capabilities

What Llamafile Provides

API Compatibility

Installation

Download Llamafile Binary

Download a Model

Server Configuration

Basic Server Command

Performance-Optimized Configuration

Network-Accessible Configuration

API Integration

Using LiteLLM (Recommended)

Using OpenAI Python SDK

Using curl for Testing

Server Management

Process Management Script

Configuration File Pattern

Troubleshooting

Server Fails to Start

Connection Refused Errors

API Errors

Performance Issues

Common Pitfalls

Version Information

Related Skills and Tools

References

Official Documentation

Model Resources

Related Technologies

Related Skills

jamie-bitflight/xdg-base-directory

jamie-bitflight/verification-gate

jamie-bitflight/twelve-factor-app

jamie-bitflight/user-docs-to-ai-skill

jamie-bitflight/llamafile

$ install --global

Security Scan Results

SKILL.md

Llamafile

When to Use This Skill

Core Capabilities

What Llamafile Provides

API Compatibility

Installation

Download Llamafile Binary

Download a Model

Server Configuration

Basic Server Command

Performance-Optimized Configuration

Network-Accessible Configuration

API Integration

Using LiteLLM (Recommended)

Using OpenAI Python SDK

Using curl for Testing

Server Management

Process Management Script

Configuration File Pattern

Troubleshooting

Server Fails to Start

Connection Refused Errors

API Errors

Performance Issues

Common Pitfalls

Version Information

Related Skills and Tools

References

Official Documentation

Model Resources

Related Technologies

Related Skills

jamie-bitflight/xdg-base-directory

jamie-bitflight/verification-gate

jamie-bitflight/twelve-factor-app