Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

1mangesh1/ollama

Name: ollama
Author: 1mangesh1

skills/ollama/SKILL.md

npx skillsauth add 1mangesh1/dev-skills-collection ollama

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Ollama

Install and Setup

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download from https://ollama.com/download/windows.

Start the server:

ollama serve

The server listens on http://localhost:11434 by default. Set OLLAMA_HOST to change the bind address.

Pull and Run Models

ollama pull llama3       # download without running
ollama run llama3        # run (auto-pulls if missing)
ollama list              # list downloaded models

Interactive Chat vs One-Shot Generation

Interactive chat (opens a REPL, type /bye to exit):

ollama run llama3

One-shot generation (pipe input, get output, exit):

echo "Explain quicksort in two sentences" | ollama run llama3
cat main.py | ollama run codellama "Review this code for bugs"

REST API

Generate (completion)

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Chat (multi-turn)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"}
  ],
  "stream": false
}'

Embeddings

curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Ollama is a tool for running local LLMs"
}'

Set "stream": true (the default) to receive newline-delimited JSON chunks.

Model Management

ollama list                    # list downloaded models
ollama show llama3             # show model details (parameters, template, license)
ollama cp llama3 my-llama3     # copy/alias a model
ollama rm my-llama3            # delete a model
ollama ps                      # list currently loaded/running models

ollama ps shows VRAM usage, quantization level, and time until unload.

Modelfile

A Modelfile defines a custom model:

FROM llama3

SYSTEM "You are a senior software engineer. Be concise. Provide code examples."

PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|eot_id|>"

Key parameters:

temperature -- randomness (0.0 = deterministic, 1.0+ = creative).
num_ctx -- context window in tokens. Higher values use more VRAM.
top_p -- nucleus sampling threshold.
top_k -- limits token selection pool.
repeat_penalty -- penalizes repeated tokens.
stop -- stop sequence(s).
num_gpu -- layers to offload to GPU (0 for CPU-only).

Create Custom Models from Modelfile

ollama create my-coder -f ./Modelfile
ollama run my-coder

To update, edit the Modelfile and run ollama create again with the same name.

GPU vs CPU Detection and Configuration

Ollama auto-detects NVIDIA (CUDA), AMD (ROCm), and Apple Silicon (Metal) GPUs.

ollama ps   # PROCESSOR column shows gpu or cpu

Force CPU-only:

CUDA_VISIBLE_DEVICES="" ollama serve    # per-session
OLLAMA_NUM_GPU=0 ollama serve           # server-wide

Per-model GPU control in a Modelfile:

PARAMETER num_gpu 0    # force CPU
PARAMETER num_gpu 999  # offload all layers to GPU (default)

For multi-GPU, set CUDA_VISIBLE_DEVICES=0,1.

Popular Models and When to Use Which

| Model | Size | Best for | |---|---|---| | llama3 (8B) | 4.7 GB | General chat, reasoning, instruction following | | llama3:70b | 40 GB | Higher quality when you have the VRAM | | codellama (7B) | 3.8 GB | Code generation, completion, infilling | | mistral (7B) | 4.1 GB | Fast general-purpose, structured output | | phi3 (3.8B) | 2.2 GB | Small footprint, good quality for its size | | gemma2 (9B) | 5.4 GB | Strong reasoning, multilingual | | deepseek-coder-v2 (16B) | 8.9 GB | Code generation, multi-language | | nomic-embed-text | 274 MB | Text embeddings for RAG | | llava (7B) | 4.7 GB | Multi-modal image understanding |

For constrained hardware (8 GB RAM), use phi3 or llama3 with q4 quantization.

Embeddings for RAG

Use nomic-embed-text (768 dimensions) or mxbai-embed-large (1024 dimensions).

curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["document chunk one", "document chunk two"]
}'

Response contains an "embeddings" array of float vectors. Store in a vector database (ChromaDB, pgvector, Qdrant, FAISS) for similarity search.

Integration with Python

Using requests

import requests

response = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3",
    "prompt": "Explain monads simply",
    "stream": False
})
print(response.json()["response"])

Using ollama-python (`pip install ollama`)

import ollama

# Chat
response = ollama.chat(model="llama3", messages=[
    {"role": "user", "content": "Explain monads simply"}
])
print(response["message"]["content"])

# Embeddings
result = ollama.embed(model="nomic-embed-text", input="some text")
print(len(result["embeddings"][0]))  # 768

# Streaming
for chunk in ollama.chat(model="llama3", messages=[
    {"role": "user", "content": "Write a haiku"}
], stream=True):
    print(chunk["message"]["content"], end="", flush=True)

Integration with JavaScript

Using fetch

const response = await fetch("http://localhost:11434/api/generate", {
  method: "POST",
  body: JSON.stringify({ model: "llama3", prompt: "Explain closures", stream: false }),
});
const data = await response.json();
console.log(data.response);

Using ollama-js (`npm install ollama`)

import { Ollama } from "ollama";
const ollama = new Ollama({ host: "http://localhost:11434" });

const response = await ollama.chat({
  model: "llama3",
  messages: [{ role: "user", content: "Explain closures" }],
});
console.log(response.message.content);

Integration with LangChain

Python (pip install langchain-ollama):

from langchain_ollama import ChatOllama, OllamaEmbeddings

llm = ChatOllama(model="llama3", temperature=0.3)
response = llm.invoke("What is dependency injection?")

embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectors = embeddings.embed_documents(["first chunk", "second chunk"])

JavaScript (npm install @langchain/ollama):

import { ChatOllama } from "@langchain/ollama";
const llm = new ChatOllama({ model: "llama3", temperature: 0.3 });
const response = await llm.invoke("What is dependency injection?");

Multi-Modal Models

Use llava or llava-llama3 for image understanding:

curl http://localhost:11434/api/chat -d '{
  "model": "llava",
  "messages": [{"role": "user", "content": "Describe this image", "images": ["BASE64_DATA"]}],
  "stream": false
}'

import ollama, base64

with open("photo.jpg", "rb") as f:
    img = base64.b64encode(f.read()).decode()

response = ollama.chat(model="llava", messages=[
    {"role": "user", "content": "What do you see?", "images": [img]}
])

Performance Tuning

Context size -- use the smallest that fits your workload:

PARAMETER num_ctx 4096   # default
PARAMETER num_ctx 32768  # long documents, more VRAM

Partial GPU offloading when model exceeds VRAM:

PARAMETER num_gpu 20     # 20 layers on GPU, rest on CPU

Batch size for prompt processing speed:

PARAMETER num_batch 512  # default, increase for faster eval

Keep-alive control (how long model stays loaded):

curl http://localhost:11434/api/generate -d '{
  "model": "llama3", "prompt": "hi", "keep_alive": "30m"
}'

Use "keep_alive": 0 to unload immediately, -1 to keep indefinitely.

Server-wide environment variables:

OLLAMA_MAX_LOADED_MODELS -- concurrent models in memory (default 1).
OLLAMA_NUM_PARALLEL -- concurrent requests per model.
OLLAMA_FLASH_ATTENTION=1 -- reduce memory with flash attention.

Running as a Service

macOS (Homebrew):

brew services start ollama

Linux (systemd -- created automatically by the install script):

sudo systemctl enable --now ollama
sudo journalctl -u ollama -f

Customize with sudo systemctl edit ollama:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_PARALLEL=4"

Then sudo systemctl daemon-reload && sudo systemctl restart ollama.

Docker:

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker exec -it ollama ollama run llama3

With NVIDIA GPU: add --gpus=all to the run command.

1mangesh1/ollama

skills/ollama/SKILL.md

Ollama for running local LLMs — model management, API usage, and integration patterns. Use when user mentions "ollama", "local LLM", "run llama locally", "local AI", "ollama run", "ollama pull", "self-hosted model", "offline AI", "local inference", or running language models on their own machine.

3 stars

development

Updated Apr 28, 2026

$ install --global

skillsauth

npx skillsauth add 1mangesh1/dev-skills-collection ollama

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 28, 2026, 3:59 PM194.8s1 file scanned

SKILL.md

name:: ollama
description:: Ollama for running local LLMs — model management, API usage, and integration patterns. Use when user mentions "ollama", "local LLM", "run llama locally", "local AI", "ollama run", "ollama pull", "self-hosted model", "offline AI", "local inference", or running language models on their own machine.

Ollama

Install and Setup

macOS:

brew install ollama

Linux:

curl -fsSL https://ollama.com/install.sh | sh

Windows: Download from https://ollama.com/download/windows.

Start the server:

ollama serve

The server listens on http://localhost:11434 by default. Set OLLAMA_HOST to change the bind address.

Pull and Run Models

ollama pull llama3       # download without running
ollama run llama3        # run (auto-pulls if missing)
ollama list              # list downloaded models

Interactive Chat vs One-Shot Generation

Interactive chat (opens a REPL, type /bye to exit):

ollama run llama3

One-shot generation (pipe input, get output, exit):

echo "Explain quicksort in two sentences" | ollama run llama3
cat main.py | ollama run codellama "Review this code for bugs"

REST API

Generate (completion)

curl http://localhost:11434/api/generate -d '{
  "model": "llama3",
  "prompt": "Why is the sky blue?",
  "stream": false
}'

Chat (multi-turn)

curl http://localhost:11434/api/chat -d '{
  "model": "llama3",
  "messages": [
    {"role": "system", "content": "You are a helpful assistant."},
    {"role": "user", "content": "What is 2+2?"}
  ],
  "stream": false
}'

Embeddings

curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": "Ollama is a tool for running local LLMs"
}'

Set "stream": true (the default) to receive newline-delimited JSON chunks.

Model Management

ollama list                    # list downloaded models
ollama show llama3             # show model details (parameters, template, license)
ollama cp llama3 my-llama3     # copy/alias a model
ollama rm my-llama3            # delete a model
ollama ps                      # list currently loaded/running models

ollama ps shows VRAM usage, quantization level, and time until unload.

Modelfile

A Modelfile defines a custom model:

FROM llama3

SYSTEM "You are a senior software engineer. Be concise. Provide code examples."

PARAMETER temperature 0.3
PARAMETER num_ctx 8192
PARAMETER top_p 0.9
PARAMETER repeat_penalty 1.1
PARAMETER stop "<|eot_id|>"

Key parameters:

temperature -- randomness (0.0 = deterministic, 1.0+ = creative).
num_ctx -- context window in tokens. Higher values use more VRAM.
top_p -- nucleus sampling threshold.
top_k -- limits token selection pool.
repeat_penalty -- penalizes repeated tokens.
stop -- stop sequence(s).
num_gpu -- layers to offload to GPU (0 for CPU-only).

Create Custom Models from Modelfile

ollama create my-coder -f ./Modelfile
ollama run my-coder

To update, edit the Modelfile and run ollama create again with the same name.

GPU vs CPU Detection and Configuration

Ollama auto-detects NVIDIA (CUDA), AMD (ROCm), and Apple Silicon (Metal) GPUs.

ollama ps   # PROCESSOR column shows gpu or cpu

Force CPU-only:

CUDA_VISIBLE_DEVICES="" ollama serve    # per-session
OLLAMA_NUM_GPU=0 ollama serve           # server-wide

Per-model GPU control in a Modelfile:

PARAMETER num_gpu 0    # force CPU
PARAMETER num_gpu 999  # offload all layers to GPU (default)

For multi-GPU, set CUDA_VISIBLE_DEVICES=0,1.

Popular Models and When to Use Which

For constrained hardware (8 GB RAM), use phi3 or llama3 with q4 quantization.

Embeddings for RAG

Use nomic-embed-text (768 dimensions) or mxbai-embed-large (1024 dimensions).

curl http://localhost:11434/api/embed -d '{
  "model": "nomic-embed-text",
  "input": ["document chunk one", "document chunk two"]
}'

Response contains an "embeddings" array of float vectors. Store in a vector database (ChromaDB, pgvector, Qdrant, FAISS) for similarity search.

Integration with Python

Using requests

import requests

response = requests.post("http://localhost:11434/api/generate", json={
    "model": "llama3",
    "prompt": "Explain monads simply",
    "stream": False
})
print(response.json()["response"])

Using ollama-python (`pip install ollama`)

import ollama

# Chat
response = ollama.chat(model="llama3", messages=[
    {"role": "user", "content": "Explain monads simply"}
])
print(response["message"]["content"])

# Embeddings
result = ollama.embed(model="nomic-embed-text", input="some text")
print(len(result["embeddings"][0]))  # 768

# Streaming
for chunk in ollama.chat(model="llama3", messages=[
    {"role": "user", "content": "Write a haiku"}
], stream=True):
    print(chunk["message"]["content"], end="", flush=True)

Integration with JavaScript

Using fetch

const response = await fetch("http://localhost:11434/api/generate", {
  method: "POST",
  body: JSON.stringify({ model: "llama3", prompt: "Explain closures", stream: false }),
});
const data = await response.json();
console.log(data.response);

Using ollama-js (`npm install ollama`)

import { Ollama } from "ollama";
const ollama = new Ollama({ host: "http://localhost:11434" });

const response = await ollama.chat({
  model: "llama3",
  messages: [{ role: "user", content: "Explain closures" }],
});
console.log(response.message.content);

Integration with LangChain

Python (pip install langchain-ollama):

from langchain_ollama import ChatOllama, OllamaEmbeddings

llm = ChatOllama(model="llama3", temperature=0.3)
response = llm.invoke("What is dependency injection?")

embeddings = OllamaEmbeddings(model="nomic-embed-text")
vectors = embeddings.embed_documents(["first chunk", "second chunk"])

JavaScript (npm install @langchain/ollama):

import { ChatOllama } from "@langchain/ollama";
const llm = new ChatOllama({ model: "llama3", temperature: 0.3 });
const response = await llm.invoke("What is dependency injection?");

Multi-Modal Models

Use llava or llava-llama3 for image understanding:

curl http://localhost:11434/api/chat -d '{
  "model": "llava",
  "messages": [{"role": "user", "content": "Describe this image", "images": ["BASE64_DATA"]}],
  "stream": false
}'

import ollama, base64

with open("photo.jpg", "rb") as f:
    img = base64.b64encode(f.read()).decode()

response = ollama.chat(model="llava", messages=[
    {"role": "user", "content": "What do you see?", "images": [img]}
])

Performance Tuning

Context size -- use the smallest that fits your workload:

PARAMETER num_ctx 4096   # default
PARAMETER num_ctx 32768  # long documents, more VRAM

Partial GPU offloading when model exceeds VRAM:

PARAMETER num_gpu 20     # 20 layers on GPU, rest on CPU

Batch size for prompt processing speed:

PARAMETER num_batch 512  # default, increase for faster eval

Keep-alive control (how long model stays loaded):

curl http://localhost:11434/api/generate -d '{
  "model": "llama3", "prompt": "hi", "keep_alive": "30m"
}'

Use "keep_alive": 0 to unload immediately, -1 to keep indefinitely.

Server-wide environment variables:

OLLAMA_MAX_LOADED_MODELS -- concurrent models in memory (default 1).
OLLAMA_NUM_PARALLEL -- concurrent requests per model.
OLLAMA_FLASH_ATTENTION=1 -- reduce memory with flash attention.

Running as a Service

macOS (Homebrew):

brew services start ollama

Linux (systemd -- created automatically by the install script):

sudo systemctl enable --now ollama
sudo journalctl -u ollama -f

Customize with sudo systemctl edit ollama:

[Service]
Environment="OLLAMA_HOST=0.0.0.0:11434"
Environment="OLLAMA_NUM_PARALLEL=4"

Then sudo systemctl daemon-reload && sudo systemctl restart ollama.

Docker:

docker run -d -v ollama:/root/.ollama -p 11434:11434 --name ollama ollama/ollama
docker exec -it ollama ollama run llama3

With NVIDIA GPU: add --gpus=all to the run command.

Related Skills

1mangesh1/xargs-parallel

tools

VerifiedTrustedCommunity

Parallel execution with xargs, GNU parallel, and batch processing patterns. Use when user mentions "xargs", "parallel", "batch processing", "run in parallel", "parallel execution", "process list of files", "bulk operations", "concurrent commands", "map over files", or running commands on multiple inputs.

3SKILL.mdUpdated Apr 28, 2026

1mangesh1/xargs-parallel

1mangesh1/websockets

development

VerifiedTrustedCommunity

WebSocket implementation for real-time bidirectional communication. Use when user mentions "websocket", "ws://", "wss://", "real-time", "live updates", "chat application", "socket.io", "Server-Sent Events", "SSE", "push notifications", "live data", "streaming data", "bidirectional communication", "websocket server", "reconnection", or building real-time features.

3SKILL.mdUpdated Apr 28, 2026

1mangesh1/webpack-vite

tools

VerifiedTrustedCommunity

Frontend bundler configuration for Webpack and Vite. Use when user mentions "webpack", "vite", "bundler", "vite config", "webpack config", "code splitting", "tree shaking", "hot module replacement", "HMR", "build optimization", "bundle size", "chunk splitting", "loader", "plugin", "esbuild", "rollup", "dev server", or configuring JavaScript build tools.

3SKILL.mdUpdated Apr 28, 2026

1mangesh1/webpack-vite

1mangesh1/vscode-settings

tools

VerifiedTrustedCommunity

VS Code configuration, extensions, keybindings, and workspace optimization. Use when user mentions "vscode", "vs code", "vscode settings", "vscode extensions", "keybindings", "code editor", "workspace settings", "settings.json", "launch.json", "tasks.json", "vscode snippets", "devcontainer", "remote development", or customizing their VS Code setup.

3SKILL.mdUpdated Apr 28, 2026

1mangesh1/vscode-settings

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/1mangesh1/dev-skills-collection.git

# Copy into Claude Code skills folder (global)
cp -r dev-skills-collection/skills/ollama ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

1mangesh1/dev-skills-collection

3 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT

Adoption

1mangesh1/ollama

$ install --global

Security Scan Results

SKILL.md

Ollama

Install and Setup

Pull and Run Models

Interactive Chat vs One-Shot Generation

REST API

Generate (completion)

Chat (multi-turn)

Embeddings

Model Management

Modelfile

Create Custom Models from Modelfile

GPU vs CPU Detection and Configuration

Popular Models and When to Use Which

Embeddings for RAG

Integration with Python

Using requests

Using ollama-python (pip install ollama)

Integration with JavaScript

Using fetch

Using ollama-js (npm install ollama)

Integration with LangChain

Multi-Modal Models

Performance Tuning

Running as a Service

Related Skills

1mangesh1/xargs-parallel

1mangesh1/websockets

1mangesh1/webpack-vite

1mangesh1/vscode-settings

1mangesh1/ollama

$ install --global

Security Scan Results

SKILL.md

Ollama

Install and Setup

Pull and Run Models

Interactive Chat vs One-Shot Generation

REST API

Generate (completion)

Chat (multi-turn)

Embeddings

Model Management

Modelfile

Create Custom Models from Modelfile

GPU vs CPU Detection and Configuration

Popular Models and When to Use Which

Embeddings for RAG

Integration with Python

Using requests

Using ollama-python (pip install ollama)

Integration with JavaScript

Using fetch

Using ollama-js (npm install ollama)

Integration with LangChain

Multi-Modal Models

Performance Tuning

Running as a Service

Related Skills

1mangesh1/xargs-parallel

1mangesh1/websockets

1mangesh1/webpack-vite

1mangesh1/vscode-settings

Using ollama-python (`pip install ollama`)

Using ollama-js (`npm install ollama`)

Using ollama-python (`pip install ollama`)

Using ollama-js (`npm install ollama`)