skills/ollama/SKILL.md
Ollama for running local LLMs — model management, API usage, and integration patterns. Use when user mentions "ollama", "local LLM", "run llama locally", "local AI", "ollama run", "ollama pull", "self-hosted model", "offline AI", "local inference", or running language models on their own machine.
npx skillsauth add 1mangesh1/dev-skills-collection ollamaInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
macOS:
brew install ollama
Linux:
curl -fsSL https://ollama.com/install.sh | sh
Windows: Download from https://ollama.com/download/windows.
Start the server:
ollama serve
The server listens on http://localhost:11434 by default. Set OLLAMA_HOST to change the bind address.
ollama pull llama3 # download without running
ollama run llama3 # run (auto-pulls if missing)
ollama list # list downloaded models
Interactive chat (opens a REPL, type /bye to exit):
ollama run llama3
One-shot generation (pipe input, get output, exit):
echo "Explain quicksort in two sentences" | ollama run llama3
cat main.py | ollama run codellama "Review this code for bugs"
curl http://localhost:11434/api/generate -d '{
"model": "llama3",
"prompt": "Why is the sky blue?",
"stream": false
}'
curl http://localhost:11434/api/chat -d '{
"model": "llama3",
"messages": [
{"role": "system", "content": "You are a helpful assistant."},
{"role": "user", "content": "What is 2+2?"}
],
"stream": false
}'
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": "Ollama is a tool for running local LLMs"
}'
Set "stream": true (the default) to receive newline-delimited JSON chunks.
ollama list # list downloaded models
ollama show llama3 # show model details (parameters, template, license)
ollama cp llama3 my-llama3 # copy/alias a model
ollama rm my-llama3 # delete a model
ollama ps # list currently loaded/running models
ollama ps shows VRAM usage, quantization level, and time until unload.
A Modelfile defines a custom model:
FROM llama3
SYSTEM "You are a senior software engineer. Be concise. Provide code examples."
PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|eot_id|>"
Key parameters:
temperature -- randomness (0.0 = deterministic, 1.0+ = creative).num_ctx -- context window in tokens. Higher values use more VRAM.top_p -- nucleus sampling threshold.top_k -- limits token selection pool.repeat_penalty -- penalizes repeated tokens.stop -- stop sequence(s).num_gpu -- layers to offload to GPU (0 for CPU-only).ollama create my-coder -f ./Modelfile
ollama run my-coder
To update, edit the Modelfile and run ollama create again with the same name.
Ollama auto-detects NVIDIA (CUDA), AMD (ROCm), and Apple Silicon (Metal) GPUs.
ollama ps # PROCESSOR column shows gpu or cpu
Force CPU-only:
CUDA_VISIBLE_DEVICES="" ollama serve # per-session
OLLAMA_NUM_GPU=0 ollama serve # server-wide
Per-model GPU control in a Modelfile:
PARAMETER num_gpu 0 # force CPU
PARAMETER num_gpu 999 # offload all layers to GPU (default)
For multi-GPU, set CUDA_VISIBLE_DEVICES=0,1.
| Model | Size | Best for |
|---|---|---|
| llama3 (8B) | 4.7 GB | General chat, reasoning, instruction following |
| llama3:70b | 40 GB | Higher quality when you have the VRAM |
| codellama (7B) | 3.8 GB | Code generation, completion, infilling |
| mistral (7B) | 4.1 GB | Fast general-purpose, structured output |
| phi3 (3.8B) | 2.2 GB | Small footprint, good quality for its size |
| gemma2 (9B) | 5.4 GB | Strong reasoning, multilingual |
| deepseek-coder-v2 (16B) | 8.9 GB | Code generation, multi-language |
| nomic-embed-text | 274 MB | Text embeddings for RAG |
| llava (7B) | 4.7 GB | Multi-modal image understanding |
For constrained hardware (8 GB RAM), use phi3 or llama3 with q4 quantization.
Use nomic-embed-text (768 dimensions) or mxbai-embed-large (1024 dimensions).
curl http://localhost:11434/api/embed -d '{
"model": "nomic-embed-text",
"input": ["document chunk one", "document chunk two"]
}'
Response contains an "embeddings" array of float vectors. Store in a vector database (ChromaDB, pgvector, Qdrant, FAISS) for similarity search.
import requests
response = requests.post("http://localhost:11434/api/generate", json={
"model": "llama3",
"prompt": "Explain monads simply",
"stream": False
})
print(response.json()["response"])
pip install ollama)import ollama
# Chat
response = ollama.chat(model="llama3", messages=[
{"role": "user", "content": "Explain monads simply"}
])
print(response["message"]["content"])
# Embeddings
result = ollama.embed(model="nomic-embed-text", input="some text")
print(len(result["embeddings"][0])) # 768
# Streaming
for chunk in ollama.chat(model="llama3", messages=[
{"role": "user", "content": "Write a haiku"}
], stream=True):
print(chunk["message"]["content"], end="", flush=True)
const response = await fetch("http://localhost:11434/api/generate", {
method: "POST",
body: JSON.stringify({ model: "llama3", prompt: "Explain closures", stream: false }),
});
const data = await response.json();
console.log(data.response);
npm install ollama)import { Ollama } from "ollama";
const ollama = new Ollama({ host: "http://localhost:11434" });
const response = await ollama.chat({
model: "llama3",
messages: [{ role: "user", content: "Explain closures" }],
});
console.log(response.message.content);
Python (pip install langchain-ollama):
from langchain_ollama import ChatOllama, OllamaEmbeddings
llm = ChatOllama(model="llama3", temperature=0.3)
response = llm.invoke("What is dependency injection?")
embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectors = embeddings.embed_documents(["first chunk", "second chunk"])
JavaScript (npm install @langchain/ollama):
import { ChatOllama } from "@langchain/ollama";
const llm = new ChatOllama({ model: "llama3", temperature: 0.3 });
const response = await llm.invoke("What is dependency injection?");
Use llava or llava-llama3 for image understanding:
curl http://localhost:11434/api/chat -d '{
"model": "llava",
"messages": [{"role": "user", "content": "Describe this image", "images": ["BASE64_DATA"]}],
"stream": false
}'
import ollama, base64
with open("photo.jpg", "rb") as f:
img = base64.b64encode(f.read()).decode()
response = ollama.chat(model="llava", messages=[
{"role": "user", "content": "What do you see?", "images": [img]}
])
Context size -- use the smallest that fits your workload:
PARAMETER num_ctx 4096 # default
PARAMETER num_ctx 32768 # long documents, more VRAM
Partial GPU offloading when model exceeds VRAM:
PARAMETER num_gpu 20 # 20 layers on GPU, rest on CPU
Batch size for prompt processing speed:
PARAMETER num_batch 512 # default, increase for faster eval
Keep-alive control (how long model stays loaded):
curl http://localhost:11434/api/generate -d '{
"model": "llama3", "prompt": "hi", "keep_alive": "30m"
}'
Use "keep_alive": 0 to unload immediately, -1 to keep indefinitely.
Server-wide environment variables:
OLLAMA_MAX_LOADED_MODELS -- concurrent models in memory (default 1).OLLAMA_NUM_PARALLEL -- concurrent requests per model.OLLAMA_FLASH_ATTENTION=1 -- reduce memory with flash attention.macOS (Homebrew):
brew services start ollama
Linux (systemd -- created automatically by the install script):
sudo systemctl enable --now ollama
sudo journalctl -u ollama -f
Customize with sudo systemctl edit ollama:
[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_PARALLEL=4"
Then sudo systemctl daemon-reload && sudo systemctl restart ollama.
Docker:
docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker exec -it ollama ollama run llama3
With NVIDIA GPU: add --gpus=all to the run command.
tools
Parallel execution with xargs, GNU parallel, and batch processing patterns. Use when user mentions "xargs", "parallel", "batch processing", "run in parallel", "parallel execution", "process list of files", "bulk operations", "concurrent commands", "map over files", or running commands on multiple inputs.
development
WebSocket implementation for real-time bidirectional communication. Use when user mentions "websocket", "ws://", "wss://", "real-time", "live updates", "chat application", "socket.io", "Server-Sent Events", "SSE", "push notifications", "live data", "streaming data", "bidirectional communication", "websocket server", "reconnection", or building real-time features.
tools
Frontend bundler configuration for Webpack and Vite. Use when user mentions "webpack", "vite", "bundler", "vite config", "webpack config", "code splitting", "tree shaking", "hot module replacement", "HMR", "build optimization", "bundle size", "chunk splitting", "loader", "plugin", "esbuild", "rollup", "dev server", or configuring JavaScript build tools.
tools
VS Code configuration, extensions, keybindings, and workspace optimization. Use when user mentions "vscode", "vs code", "vscode settings", "vscode extensions", "keybindings", "code editor", "workspace settings", "settings.json", "launch.json", "tasks.json", "vscode snippets", "devcontainer", "remote development", or customizing their VS Code setup.