Ollama Optimizer

Optimize Ollama configuration based on system hardware analysis.

When to Use

Use this skill when the user asks to optimize Ollama, configure Ollama, speed up Ollama, fix Ollama running slow, set up a local LLM, tune inference speed, reduce memory usage, or select models that fit their GPU/RAM. The skill analyzes hardware (GPU, VRAM, RAM, CPU) and produces tailored recommendations.

Do not use for LM Studio, llama.cpp, vLLM, or hosted-API LLM providers (OpenAI, Anthropic) — those use different runtimes and tuning surfaces.

Workflow

Phase 1: System Detection

Run the detection script to gather hardware information:

python3 scripts/detect_system.py

Parse the JSON output to identify:

OS and version
CPU model and core count
Total RAM / unified memory
GPU type, VRAM, and driver version
Current Ollama installation and environment variables

Phase 2: Analyze and Recommend

Based on detected hardware, determine the optimization profile:

Hardware Tier Classification:

| Tier | Criteria | Max Model | Key Optimizations | |------|----------|-----------|-------------------| | CPU-only | No GPU detected | 3B | num_thread tuning, Q4_K_M quant | | Low VRAM | <6GB VRAM | 3B | Flash attention, KV cache q4_0 | | Entry | 6-8GB VRAM | 8B | Flash attention, KV cache q8_0 | | Prosumer | 10-12GB VRAM | 14B | Flash attention, full offload | | Workstation | 16-24GB VRAM | 32B | Standard config, Q5_K_M option | | High-end | 48GB+ VRAM | 70B+ | Multiple models, Q5/Q6 quants |

Apple Silicon Special Case:

Unified memory = shared CPU/GPU RAM
8GB Mac → treat as 6GB VRAM tier
16GB Mac → treat as 12GB VRAM tier
32GB+ Mac → treat as workstation tier

Phase 3: Generate Optimization Plan

Create a structured optimization guide with these sections:

1. System Overview

Present detected hardware specs and highlight constraints (e.g., "8GB unified memory limits to 7B models").

2. Dependency Assessment

List what's needed based on the platform:

macOS: Ollama only (Metal automatic)
Linux NVIDIA: Ollama + NVIDIA driver 450+
Linux AMD: Ollama + ROCm 5.0+
Windows: Ollama + NVIDIA driver 452+

3. Configuration Recommendations

Essential environment variables:

# Always recommended
export OLLAMA_FLASH_ATTENTION=1

# Memory-constrained systems (<12GB)
export OLLAMA_KV_CACHE_TYPE=q8_0  # or q4_0 for severe constraints

Model selection guidance:

Recommend specific models from ollama list output
Suggest appropriate quantization (Q4_K_M default, Q5_K_M if headroom exists)
Warn if current models exceed hardware capacity

Modelfile tuning (when needed):

PARAMETER num_gpu <layers>    # Partial offload for limited VRAM
PARAMETER num_thread <cores>  # CPU threads (physical cores, not hyperthreads)
PARAMETER num_ctx <size>      # Reduce context for memory savings

4. Execution Checklist

Provide copy-paste commands in order:

Set environment variables
Restart Ollama service
Pull recommended models
Test with ollama run <model> --verbose

5. Verification Commands

# Benchmark current performance
python3 scripts/benchmark_ollama.py --model <model>
# Expected output: tokens/s and generation latency. Compare against tier baseline from Phase 2.

# Check GPU memory usage (NVIDIA)
nvidia-smi

# Verify config is applied
ollama run <model> "test" --verbose 2>&1 | head -20

Acceptance Criteria

A run passes when all of the following are true:

[ ] Hardware tier (CPU-only / Low-VRAM / Entry / Prosumer / Workstation / High-end) is identified explicitly in the report.
[ ] Recommended model size fits within detected VRAM/unified-memory budget (no recommending a 14B model on an 8GB Mac).
[ ] Required Ollama environment variables (e.g., OLLAMA_FLASH_ATTENTION, KV-cache quantisation) are written to a shell init file the user actually uses, with a backup of the prior file.
[ ] Apple Silicon special case is applied when detected — unified memory is not double-counted as separate VRAM + RAM.
[ ] Verification step runs ollama run <model> with --verbose and captures the actual offload/cache numbers.
[ ] Rollback instructions are included so the user can revert all env changes with one command.

Step Completion Reports

After completing each major step, output a status report in this format:

◆ [Step Name] ([step N of M] — [context])
··································································
  [Check 1]:          √ pass
  [Check 2]:          √ pass (note if relevant)
  [Check 3]:          × fail — [reason]
  [Check 4]:          √ pass
  [Criteria]:         √ N/M met
  ____________________________
  Result:             PASS | FAIL | PARTIAL

Adapt the check names to match what the step actually validates. Use √ for pass, × for fail, and — to add brief context. The "Criteria" line summarizes how many acceptance criteria were met. The "Result" line gives the overall verdict.

Detection (step 1 of 4)

◆ Detection (step 1 of 4 — hardware profiling)
··································································
  Hardware detected:      √ pass — macOS 14, Apple M2
  GPU identified:         √ pass — Apple Metal (unified memory)
  RAM measured:           √ pass — 16GB unified memory
  [Criteria]:             √ 3/3 met
  ____________________________
  Result:                 PASS

Analysis (step 2 of 4)

◆ Analysis (step 2 of 4 — profile selection)
··································································
  Tier classified:        √ pass — Prosumer (16GB unified)
  Profile selected:       √ pass — Flash attention, full offload
  Bottlenecks identified: √ pass — memory bandwidth primary constraint
  [Criteria]:             √ 3/3 met
  ____________________________
  Result:                 PASS

Plan (step 3 of 4)

◆ Plan (step 3 of 4 — optimization guide)
··································································
  Guide generated:        √ pass — ollama-optimization-guide.md written
  Parameters tuned:       √ pass — OLLAMA_FLASH_ATTENTION=1, KV_CACHE_TYPE=q8_0
  Model recommendations ready: √ pass — llama3.1:14b-instruct-q4_K_M suggested
  [Criteria]:             √ 3/3 met
  ____________________________
  Result:                 PASS

Verification (step 4 of 4)

◆ Verification (step 4 of 4 — config validation)
··································································
  Benchmark commands listed: √ pass — python3 scripts/benchmark_ollama.py
  Config verified:        √ pass — ollama run --verbose output checked
  [Criteria]:             √ 2/2 met
  ____________________________
  Result:                 PASS

Reference Files

VRAM Requirements - Model sizing and quantization guide
Environment Variables - Complete env var reference
Platform-Specific Setup - OS-specific installation and configuration

Expected Output

Generate an ollama-optimization-guide.md file. Ask the user where to save it (suggest ~/.config/ollama/optimization-guide.md or current directory). Contents:

# Ollama Optimization Guide

**Generated:** <timestamp>
**System:** <OS> | <CPU> | <RAM>GB RAM | <GPU>

## System Overview
<hardware summary and constraints>

## Current Configuration
<existing Ollama setup and env vars>

## Recommendations

### Environment Variables
<shell commands to set vars>

### Model Selection
<recommended models with rationale>

### Performance Tuning
<Modelfile adjustments if needed>

## Execution Checklist
- [ ] <step 1>
- [ ] <step 2>
...

## Verification
<benchmark commands and expected results>

## Rollback
<commands to revert changes if needed>

Quick Optimization Commands (opt-in only)

Only use this fast path if the user explicitly asks to skip full hardware analysis. Otherwise always run Phases 1-3 and follow the tier-based recommendation — do not apply these shortcuts by default, and do not let them override a tier decision already made. For the actual per-platform commands and env vars, see Platform-Specific Setup and Environment Variables.

Ollama Optimizer

Optimize Ollama configuration based on system hardware analysis.

When to Use

Do not use for LM Studio, llama.cpp, vLLM, or hosted-API LLM providers (OpenAI, Anthropic) — those use different runtimes and tuning surfaces.

Workflow

Phase 1: System Detection

Run the detection script to gather hardware information:

python3 scripts/detect_system.py

Parse the JSON output to identify:

OS and version
CPU model and core count
Total RAM / unified memory
GPU type, VRAM, and driver version
Current Ollama installation and environment variables

Phase 2: Analyze and Recommend

Based on detected hardware, determine the optimization profile:

Hardware Tier Classification:

Apple Silicon Special Case:

Unified memory = shared CPU/GPU RAM
8GB Mac → treat as 6GB VRAM tier
16GB Mac → treat as 12GB VRAM tier
32GB+ Mac → treat as workstation tier

Phase 3: Generate Optimization Plan

Create a structured optimization guide with these sections:

1. System Overview

Present detected hardware specs and highlight constraints (e.g., "8GB unified memory limits to 7B models").

2. Dependency Assessment

List what's needed based on the platform:

macOS: Ollama only (Metal automatic)
Linux NVIDIA: Ollama + NVIDIA driver 450+
Linux AMD: Ollama + ROCm 5.0+
Windows: Ollama + NVIDIA driver 452+

3. Configuration Recommendations

Essential environment variables:

# Always recommended
export OLLAMA_FLASH_ATTENTION=1

# Memory-constrained systems (<12GB)
export OLLAMA_KV_CACHE_TYPE=q8_0  # or q4_0 for severe constraints

Model selection guidance:

Recommend specific models from ollama list output
Suggest appropriate quantization (Q4_K_M default, Q5_K_M if headroom exists)
Warn if current models exceed hardware capacity

Modelfile tuning (when needed):

PARAMETER num_gpu <layers>    # Partial offload for limited VRAM
PARAMETER num_thread <cores>  # CPU threads (physical cores, not hyperthreads)
PARAMETER num_ctx <size>      # Reduce context for memory savings

4. Execution Checklist

Provide copy-paste commands in order:

Set environment variables
Restart Ollama service
Pull recommended models
Test with ollama run <model> --verbose

5. Verification Commands

# Benchmark current performance
python3 scripts/benchmark_ollama.py --model <model>
# Expected output: tokens/s and generation latency. Compare against tier baseline from Phase 2.

# Check GPU memory usage (NVIDIA)
nvidia-smi

# Verify config is applied
ollama run <model> "test" --verbose 2>&1 | head -20

Acceptance Criteria

A run passes when all of the following are true:

[ ] Hardware tier (CPU-only / Low-VRAM / Entry / Prosumer / Workstation / High-end) is identified explicitly in the report.
[ ] Recommended model size fits within detected VRAM/unified-memory budget (no recommending a 14B model on an 8GB Mac).
[ ] Required Ollama environment variables (e.g., OLLAMA_FLASH_ATTENTION, KV-cache quantisation) are written to a shell init file the user actually uses, with a backup of the prior file.
[ ] Apple Silicon special case is applied when detected — unified memory is not double-counted as separate VRAM + RAM.
[ ] Verification step runs ollama run <model> with --verbose and captures the actual offload/cache numbers.
[ ] Rollback instructions are included so the user can revert all env changes with one command.

Step Completion Reports

After completing each major step, output a status report in this format:

◆ [Step Name] ([step N of M] — [context])
··································································
  [Check 1]:          √ pass
  [Check 2]:          √ pass (note if relevant)
  [Check 3]:          × fail — [reason]
  [Check 4]:          √ pass
  [Criteria]:         √ N/M met
  ____________________________
  Result:             PASS | FAIL | PARTIAL

Detection (step 1 of 4)

◆ Detection (step 1 of 4 — hardware profiling)
··································································
  Hardware detected:      √ pass — macOS 14, Apple M2
  GPU identified:         √ pass — Apple Metal (unified memory)
  RAM measured:           √ pass — 16GB unified memory
  [Criteria]:             √ 3/3 met
  ____________________________
  Result:                 PASS

Analysis (step 2 of 4)

◆ Analysis (step 2 of 4 — profile selection)
··································································
  Tier classified:        √ pass — Prosumer (16GB unified)
  Profile selected:       √ pass — Flash attention, full offload
  Bottlenecks identified: √ pass — memory bandwidth primary constraint
  [Criteria]:             √ 3/3 met
  ____________________________
  Result:                 PASS

Plan (step 3 of 4)

◆ Plan (step 3 of 4 — optimization guide)
··································································
  Guide generated:        √ pass — ollama-optimization-guide.md written
  Parameters tuned:       √ pass — OLLAMA_FLASH_ATTENTION=1, KV_CACHE_TYPE=q8_0
  Model recommendations ready: √ pass — llama3.1:14b-instruct-q4_K_M suggested
  [Criteria]:             √ 3/3 met
  ____________________________
  Result:                 PASS

Verification (step 4 of 4)

◆ Verification (step 4 of 4 — config validation)
··································································
  Benchmark commands listed: √ pass — python3 scripts/benchmark_ollama.py
  Config verified:        √ pass — ollama run --verbose output checked
  [Criteria]:             √ 2/2 met
  ____________________________
  Result:                 PASS

Reference Files

VRAM Requirements - Model sizing and quantization guide
Environment Variables - Complete env var reference
Platform-Specific Setup - OS-specific installation and configuration

Expected Output

Generate an ollama-optimization-guide.md file. Ask the user where to save it (suggest ~/.config/ollama/optimization-guide.md or current directory). Contents:

# Ollama Optimization Guide

**Generated:** <timestamp>
**System:** <OS> | <CPU> | <RAM>GB RAM | <GPU>

## System Overview
<hardware summary and constraints>

## Current Configuration
<existing Ollama setup and env vars>

## Recommendations

### Environment Variables
<shell commands to set vars>

### Model Selection
<recommended models with rationale>

### Performance Tuning
<Modelfile adjustments if needed>

## Execution Checklist
- [ ] <step 1>
- [ ] <step 2>
...

## Verification
<benchmark commands and expected results>

## Rollback
<commands to revert changes if needed>

Adoption

luongnv89/ollama-optimizer

$ install --global

Security Scan Results

SKILL.md

Ollama Optimizer

When to Use

Workflow

Phase 1: System Detection

Phase 2: Analyze and Recommend

Phase 3: Generate Optimization Plan

1. System Overview

2. Dependency Assessment

3. Configuration Recommendations

4. Execution Checklist

5. Verification Commands

Acceptance Criteria

Step Completion Reports

Detection (step 1 of 4)

Analysis (step 2 of 4)

Plan (step 3 of 4)

Verification (step 4 of 4)

Reference Files

Expected Output

Quick Optimization Commands (opt-in only)

Related Skills

luongnv89/issue-work-loop

luongnv89/herdr-agent-comms

luongnv89/doc-manager

luongnv89/diagram-generator

luongnv89/ollama-optimizer

$ install --global

Security Scan Results

SKILL.md

Ollama Optimizer

When to Use

Workflow

Phase 1: System Detection

Phase 2: Analyze and Recommend

Phase 3: Generate Optimization Plan

1. System Overview

2. Dependency Assessment

3. Configuration Recommendations

4. Execution Checklist

5. Verification Commands

Acceptance Criteria

Step Completion Reports

Detection (step 1 of 4)

Analysis (step 2 of 4)

Plan (step 3 of 4)

Verification (step 4 of 4)

Reference Files

Expected Output

Quick Optimization Commands (opt-in only)

Related Skills

luongnv89/issue-work-loop

luongnv89/herdr-agent-comms

luongnv89/doc-manager

luongnv89/diagram-generator