skills/mlops/inference/llama-cpp/SKILL.md
Runs LLM inference on CPU, Apple Silicon, and consumer GPUs without NVIDIA hardware. Use for edge deployment, M1/M2/M3 Macs, AMD/Intel GPUs, or when CUDA is unavailable. Supports GGUF quantization (1.5-8 bit) for reduced memory and 4-10× speedup vs PyTorch on CPU.
npx skillsauth add garrettroi/open-manus llama-cppInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Pure C/C++ LLM inference with minimal dependencies, optimized for CPUs and non-NVIDIA hardware.
Use llama.cpp when:
Use TensorRT-LLM instead when:
Use vLLM instead when:
# macOS/Linux
brew install llama.cpp
# Or build from source
git clone https://github.com/ggerganov/llama.cpp
cd llama.cpp
make
# With Metal (Apple Silicon)
make LLAMA_METAL=1
# With CUDA (NVIDIA)
make LLAMA_CUDA=1
# With ROCm (AMD)
make LLAMA_HIP=1
# Download from HuggingFace (GGUF format)
huggingface-cli download \
TheBloke/Llama-2-7B-Chat-GGUF \
llama-2-7b-chat.Q4_K_M.gguf \
--local-dir models/
# Or convert from HuggingFace
python convert_hf_to_gguf.py models/llama-2-7b-chat/
# Simple chat
./llama-cli \
-m models/llama-2-7b-chat.Q4_K_M.gguf \
-p "Explain quantum computing" \
-n 256 # Max tokens
# Interactive chat
./llama-cli \
-m models/llama-2-7b-chat.Q4_K_M.gguf \
--interactive
# Start OpenAI-compatible server
./llama-server \
-m models/llama-2-7b-chat.Q4_K_M.gguf \
--host 0.0.0.0 \
--port 8080 \
-ngl 32 # Offload 32 layers to GPU
# Client request
curl http://localhost:8080/v1/chat/completions \
-H "Content-Type: application/json" \
-d '{
"model": "llama-2-7b-chat",
"messages": [{"role": "user", "content": "Hello!"}],
"temperature": 0.7,
"max_tokens": 100
}'
| Format | Bits | Size (7B) | Speed | Quality | Use Case | |--------|------|-----------|-------|---------|----------| | Q4_K_M | 4.5 | 4.1 GB | Fast | Good | Recommended default | | Q4_K_S | 4.3 | 3.9 GB | Faster | Lower | Speed critical | | Q5_K_M | 5.5 | 4.8 GB | Medium | Better | Quality critical | | Q6_K | 6.5 | 5.5 GB | Slower | Best | Maximum quality | | Q8_0 | 8.0 | 7.0 GB | Slow | Excellent | Minimal degradation | | Q2_K | 2.5 | 2.7 GB | Fastest | Poor | Testing only |
# General use (balanced)
Q4_K_M # 4-bit, medium quality
# Maximum speed (more degradation)
Q2_K or Q3_K_M
# Maximum quality (slower)
Q6_K or Q8_0
# Very large models (70B, 405B)
Q3_K_M or Q4_K_S # Lower bits to fit in memory
# Build with Metal
make LLAMA_METAL=1
# Run with GPU acceleration (automatic)
./llama-cli -m model.gguf -ngl 999 # Offload all layers
# Performance: M3 Max 40-60 tokens/sec (Llama 2-7B Q4_K_M)
# Build with CUDA
make LLAMA_CUDA=1
# Offload layers to GPU
./llama-cli -m model.gguf -ngl 35 # Offload 35/40 layers
# Hybrid CPU+GPU for large models
./llama-cli -m llama-70b.Q4_K_M.gguf -ngl 20 # GPU: 20 layers, CPU: rest
# Build with ROCm
make LLAMA_HIP=1
# Run with AMD GPU
./llama-cli -m model.gguf -ngl 999
# Process multiple prompts from file
cat prompts.txt | ./llama-cli \
-m model.gguf \
--batch-size 512 \
-n 100
# JSON output with grammar
./llama-cli \
-m model.gguf \
-p "Generate a person: " \
--grammar-file grammars/json.gbnf
# Outputs valid JSON only
# Increase context (default 512)
./llama-cli \
-m model.gguf \
-c 4096 # 4K context window
# Very long context (if model supports)
./llama-cli -m model.gguf -c 32768 # 32K context
| CPU | Threads | Speed | Cost | |-----|---------|-------|------| | Apple M3 Max | 16 | 50 tok/s | $0 (local) | | AMD Ryzen 9 7950X | 32 | 35 tok/s | $0.50/hour | | Intel i9-13900K | 32 | 30 tok/s | $0.40/hour | | AWS c7i.16xlarge | 64 | 40 tok/s | $2.88/hour |
| GPU | Speed | vs CPU | Cost | |-----|-------|--------|------| | NVIDIA RTX 4090 | 120 tok/s | 3-4× | $0 (local) | | NVIDIA A10 | 80 tok/s | 2-3× | $1.00/hour | | AMD MI250 | 70 tok/s | 2× | $2.00/hour | | Apple M3 Max (Metal) | 50 tok/s | ~Same | $0 (local) |
LLaMA family:
Mistral family:
Other:
Find models: https://huggingface.co/models?library=gguf
development
# Voice Sanitizer This skill cleans up text before it is sent to the Text-to-Speech (TTS) engine. It removes technical jargon, code blocks, and long URLs to ensure the agent sounds natural and conversational in voice chat. ## Usage To sanitize text for speech, run the following command in the terminal: ```bash python3 /app/skills/voice_sanitizer/sanitizer.py "Your long, technical text with `code` and https://links.com/long-url" ``` ### Example Output ```text Your long, technical text with a
tools
Professional AI video production workflow. Use when creating videos, short films, commercials, or any video content using AI generation tools.
tools
Secure API key access from the centralized vault. Fetch keys on-demand without storing them in environment variables.
testing
# Task Board — Persistent Task Tracking for Open Manus This skill provides a shared task board backed by Redis. Harmony uses it to track delegated work across all agents, and agents use it to report progress and completion. ## When to Use - **Harmony**: Use this whenever you delegate a task to an agent. Add the task to the board, then check the board periodically to follow up. - **Worker Agents**: Use this to update your task status or mark tasks as complete. ## Commands ### Add a new task