Ollama Integration: $ARGUMENTS

Patterns for local AI with Ollama.

Ollama Basics

Check Status

# Is Ollama running?
curl http://localhost:11434/api/tags

# List models
ollama list

# Pull a model
ollama pull llama3.2
ollama pull mistral
ollama pull codellama

Quick Test

# Chat
ollama run llama3.2 "Hello, how are you?"

# From API
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello",
  "stream": false
}'

Integration Patterns

1. Python (Recommended for AI work)

import requests

def ask_ollama(prompt: str, model: str = "llama3.2") -> str:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

# Usage
answer = ask_ollama("What is the capital of France?")

With streaming:

import requests
import json

def stream_ollama(prompt: str, model: str = "llama3.2"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": True},
        stream=True
    )
    for line in response.iter_lines():
        if line:
            data = json.loads(line)
            yield data.get("response", "")
            if data.get("done"):
                break

# Usage
for chunk in stream_ollama("Tell me a story"):
    print(chunk, end="", flush=True)

2. TypeScript/Node

async function askOllama(prompt: string, model = "llama3.2"): Promise<string> {
    const response = await fetch("http://localhost:11434/api/generate", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ model, prompt, stream: false })
    });
    const data = await response.json();
    return data.response;
}

With streaming:

async function* streamOllama(prompt: string, model = "llama3.2") {
    const response = await fetch("http://localhost:11434/api/generate", {
        method: "POST",
        body: JSON.stringify({ model, prompt, stream: true })
    });

    const reader = response.body?.getReader();
    const decoder = new TextDecoder();

    while (reader) {
        const { done, value } = await reader.read();
        if (done) break;

        const lines = decoder.decode(value).split("\n");
        for (const line of lines) {
            if (line.trim()) {
                const data = JSON.parse(line);
                yield data.response || "";
            }
        }
    }
}

3. React Hook

import { useState, useCallback } from "react";

function useOllama(model = "llama3.2") {
    const [loading, setLoading] = useState(false);
    const [response, setResponse] = useState("");

    const ask = useCallback(async (prompt: string) => {
        setLoading(true);
        setResponse("");

        try {
            const res = await fetch("http://localhost:11434/api/generate", {
                method: "POST",
                body: JSON.stringify({ model, prompt, stream: true })
            });

            const reader = res.body?.getReader();
            const decoder = new TextDecoder();

            while (reader) {
                const { done, value } = await reader.read();
                if (done) break;

                const lines = decoder.decode(value).split("\n");
                for (const line of lines) {
                    if (line.trim()) {
                        const data = JSON.parse(line);
                        setResponse(prev => prev + (data.response || ""));
                    }
                }
            }
        } finally {
            setLoading(false);
        }
    }, [model]);

    return { ask, response, loading };
}

RAG (Retrieval Augmented Generation)

For the store AI that answers questions about SOPs/documents:

Simple RAG Pattern

from sentence_transformers import SentenceTransformer
import numpy as np
import requests

# 1. Embed your documents
embedder = SentenceTransformer('all-MiniLM-L6-v2')

documents = [
    "Return policy: Items can be returned within 30 days...",
    "Repair intake: First, get customer name and phone...",
    # ... more SOPs
]

embeddings = embedder.encode(documents)

# 2. On query, find relevant docs
def find_relevant(query: str, top_k: int = 3):
    query_embedding = embedder.encode([query])[0]
    scores = np.dot(embeddings, query_embedding)
    top_indices = np.argsort(scores)[-top_k:][::-1]
    return [documents[i] for i in top_indices]

# 3. Ask Ollama with context
def ask_with_context(question: str):
    relevant_docs = find_relevant(question)
    context = "\n\n".join(relevant_docs)

    prompt = f"""Use the following context to answer the question.

Context:
{context}

Question: {question}

Answer:"""

    return ask_ollama(prompt)

Vector DB Option (ChromaDB)

import chromadb
from chromadb.config import Settings

# Setup
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="./store_knowledge"
))

collection = client.get_or_create_collection("sops")

# Add documents
collection.add(
    documents=["Return policy...", "Repair intake..."],
    ids=["sop-1", "sop-2"],
    metadatas=[{"category": "policy"}, {"category": "repair"}]
)

# Query
results = collection.query(
    query_texts=["how do I process a return"],
    n_results=3
)

Model Selection

| Model | Best For | Size | |-------|----------|------| | llama3.2 | General purpose, good balance | 3B | | llama3.2:70b | Best quality (needs big GPU) | 70B | | mistral | Fast, good for chat | 7B | | codellama | Code generation | 7B-34B | | phi3 | Small, fast, surprisingly good | 3B | | mixtral | High quality, MoE | 8x7B |

For store AI POC: Start with llama3.2 or mistral. Upgrade if needed.

System Prompts

def ask_with_system(prompt: str, system: str, model: str = "llama3.2"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "system": system,
            "stream": False
        }
    )
    return response.json()["response"]

# Store AI system prompt
STORE_SYSTEM = """You are the Computer Connection store assistant.
You help staff and customers with questions about store policies,
repair procedures, and product information.
Be concise and helpful. If you don't know, say so.
Never make up information about store policies."""

Chat/Conversation

def chat(messages: list, model: str = "llama3.2"):
    """
    messages = [
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Hi there!"},
        {"role": "user", "content": "What's your return policy?"}
    ]
    """
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": model,
            "messages": messages,
            "stream": False
        }
    )
    return response.json()["message"]["content"]

Performance Tips

Keep model loaded: First request is slow (loading). Keep a warm model.

# Keep model in memory
curl http://localhost:11434/api/generate -d '{"model":"llama3.2","keep_alive":"24h"}'

Use appropriate model size: Don't use 70B if 7B works.
Batch similar requests: If processing multiple docs, batch them.

Set context length: Lower = faster

json={"model": model, "prompt": prompt, "options": {"num_ctx": 2048}}

GPU vs CPU: Ensure Ollama is using GPU

# Check
ollama ps
# Should show GPU memory usage

Troubleshooting

Ollama not responding

# Restart
systemctl restart ollama
# or
ollama serve

Model too slow

Use smaller model
Check GPU is being used
Reduce context length

Out of memory

Use quantized model (e.g., llama3.2:q4_0)
Reduce context length
Close other GPU apps

Output

Based on $ARGUMENTS:

Identify the integration pattern needed
Provide copy-paste code
Suggest appropriate model
Note any gotchas

Ollama Integration: $ARGUMENTS

Patterns for local AI with Ollama.

Ollama Basics

Check Status

# Is Ollama running?
curl http://localhost:11434/api/tags

# List models
ollama list

# Pull a model
ollama pull llama3.2
ollama pull mistral
ollama pull codellama

Quick Test

# Chat
ollama run llama3.2 "Hello, how are you?"

# From API
curl http://localhost:11434/api/generate -d '{
  "model": "llama3.2",
  "prompt": "Hello",
  "stream": false
}'

Integration Patterns

1. Python (Recommended for AI work)

import requests

def ask_ollama(prompt: str, model: str = "llama3.2") -> str:
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "stream": False
        }
    )
    return response.json()["response"]

# Usage
answer = ask_ollama("What is the capital of France?")

With streaming:

import requests
import json

def stream_ollama(prompt: str, model: str = "llama3.2"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={"model": model, "prompt": prompt, "stream": True},
        stream=True
    )
    for line in response.iter_lines():
        if line:
            data = json.loads(line)
            yield data.get("response", "")
            if data.get("done"):
                break

# Usage
for chunk in stream_ollama("Tell me a story"):
    print(chunk, end="", flush=True)

2. TypeScript/Node

async function askOllama(prompt: string, model = "llama3.2"): Promise<string> {
    const response = await fetch("http://localhost:11434/api/generate", {
        method: "POST",
        headers: { "Content-Type": "application/json" },
        body: JSON.stringify({ model, prompt, stream: false })
    });
    const data = await response.json();
    return data.response;
}

With streaming:

async function* streamOllama(prompt: string, model = "llama3.2") {
    const response = await fetch("http://localhost:11434/api/generate", {
        method: "POST",
        body: JSON.stringify({ model, prompt, stream: true })
    });

    const reader = response.body?.getReader();
    const decoder = new TextDecoder();

    while (reader) {
        const { done, value } = await reader.read();
        if (done) break;

        const lines = decoder.decode(value).split("\n");
        for (const line of lines) {
            if (line.trim()) {
                const data = JSON.parse(line);
                yield data.response || "";
            }
        }
    }
}

3. React Hook

import { useState, useCallback } from "react";

function useOllama(model = "llama3.2") {
    const [loading, setLoading] = useState(false);
    const [response, setResponse] = useState("");

    const ask = useCallback(async (prompt: string) => {
        setLoading(true);
        setResponse("");

        try {
            const res = await fetch("http://localhost:11434/api/generate", {
                method: "POST",
                body: JSON.stringify({ model, prompt, stream: true })
            });

            const reader = res.body?.getReader();
            const decoder = new TextDecoder();

            while (reader) {
                const { done, value } = await reader.read();
                if (done) break;

                const lines = decoder.decode(value).split("\n");
                for (const line of lines) {
                    if (line.trim()) {
                        const data = JSON.parse(line);
                        setResponse(prev => prev + (data.response || ""));
                    }
                }
            }
        } finally {
            setLoading(false);
        }
    }, [model]);

    return { ask, response, loading };
}

RAG (Retrieval Augmented Generation)

For the store AI that answers questions about SOPs/documents:

Simple RAG Pattern

from sentence_transformers import SentenceTransformer
import numpy as np
import requests

# 1. Embed your documents
embedder = SentenceTransformer('all-MiniLM-L6-v2')

documents = [
    "Return policy: Items can be returned within 30 days...",
    "Repair intake: First, get customer name and phone...",
    # ... more SOPs
]

embeddings = embedder.encode(documents)

# 2. On query, find relevant docs
def find_relevant(query: str, top_k: int = 3):
    query_embedding = embedder.encode([query])[0]
    scores = np.dot(embeddings, query_embedding)
    top_indices = np.argsort(scores)[-top_k:][::-1]
    return [documents[i] for i in top_indices]

# 3. Ask Ollama with context
def ask_with_context(question: str):
    relevant_docs = find_relevant(question)
    context = "\n\n".join(relevant_docs)

    prompt = f"""Use the following context to answer the question.

Context:
{context}

Question: {question}

Answer:"""

    return ask_ollama(prompt)

Vector DB Option (ChromaDB)

import chromadb
from chromadb.config import Settings

# Setup
client = chromadb.Client(Settings(
    chroma_db_impl="duckdb+parquet",
    persist_directory="./store_knowledge"
))

collection = client.get_or_create_collection("sops")

# Add documents
collection.add(
    documents=["Return policy...", "Repair intake..."],
    ids=["sop-1", "sop-2"],
    metadatas=[{"category": "policy"}, {"category": "repair"}]
)

# Query
results = collection.query(
    query_texts=["how do I process a return"],
    n_results=3
)

Model Selection

For store AI POC: Start with llama3.2 or mistral. Upgrade if needed.

System Prompts

def ask_with_system(prompt: str, system: str, model: str = "llama3.2"):
    response = requests.post(
        "http://localhost:11434/api/generate",
        json={
            "model": model,
            "prompt": prompt,
            "system": system,
            "stream": False
        }
    )
    return response.json()["response"]

# Store AI system prompt
STORE_SYSTEM = """You are the Computer Connection store assistant.
You help staff and customers with questions about store policies,
repair procedures, and product information.
Be concise and helpful. If you don't know, say so.
Never make up information about store policies."""

Chat/Conversation

def chat(messages: list, model: str = "llama3.2"):
    """
    messages = [
        {"role": "user", "content": "Hello"},
        {"role": "assistant", "content": "Hi there!"},
        {"role": "user", "content": "What's your return policy?"}
    ]
    """
    response = requests.post(
        "http://localhost:11434/api/chat",
        json={
            "model": model,
            "messages": messages,
            "stream": False
        }
    )
    return response.json()["message"]["content"]

Performance Tips

Keep model loaded: First request is slow (loading). Keep a warm model.

# Keep model in memory
curl http://localhost:11434/api/generate -d '{"model":"llama3.2","keep_alive":"24h"}'

Use appropriate model size: Don't use 70B if 7B works.
Batch similar requests: If processing multiple docs, batch them.

Set context length: Lower = faster

json={"model": model, "prompt": prompt, "options": {"num_ctx": 2048}}

GPU vs CPU: Ensure Ollama is using GPU

# Check
ollama ps
# Should show GPU memory usage

Troubleshooting

Ollama not responding

# Restart
systemctl restart ollama
# or
ollama serve

Model too slow

Use smaller model
Check GPU is being used
Reduce context length

Out of memory

Use quantized model (e.g., llama3.2:q4_0)
Reduce context length
Close other GPU apps

Output

Based on $ARGUMENTS:

Identify the integration pattern needed
Provide copy-paste code
Suggest appropriate model
Note any gotchas

Adoption

ComputerConnection/ollama

$ install --global

Security Scan Results

SKILL.md

Ollama Integration: $ARGUMENTS

Ollama Basics

Check Status

Quick Test

Integration Patterns

1. Python (Recommended for AI work)

2. TypeScript/Node

3. React Hook

RAG (Retrieval Augmented Generation)

Simple RAG Pattern

Vector DB Option (ChromaDB)

Model Selection

System Prompts

Chat/Conversation

Performance Tips

Troubleshooting

Ollama not responding

Model too slow

Out of memory

Output

Related Skills

ComputerConnection/zach

ComputerConnection/skills/vision

ComputerConnection/tauri

ComputerConnection/store-sop

ComputerConnection/ollama

$ install --global

Security Scan Results

SKILL.md

Ollama Integration: $ARGUMENTS

Ollama Basics

Check Status

Quick Test

Integration Patterns

1. Python (Recommended for AI work)

2. TypeScript/Node

3. React Hook

RAG (Retrieval Augmented Generation)

Simple RAG Pattern

Vector DB Option (ChromaDB)

Model Selection

System Prompts

Chat/Conversation

Performance Tips

Troubleshooting

Ollama not responding

Model too slow

Out of memory

Output

Related Skills

ComputerConnection/zach

ComputerConnection/skills/vision

ComputerConnection/tauri

ComputerConnection/store-sop