skills/ollama-optimizer/SKILL.md
Optimize Ollama configuration for the current machine's hardware. Use when asked to speed up Ollama, tune local LLM performance, or pick models that fit available GPU/RAM.
npx skillsauth add luongnv89/skills ollama-optimizerInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Optimize Ollama configuration based on system hardware analysis.
Use this skill when the user asks to optimize Ollama, configure Ollama, speed up Ollama, fix Ollama running slow, set up a local LLM, tune inference speed, reduce memory usage, or select models that fit their GPU/RAM. The skill analyzes hardware (GPU, VRAM, RAM, CPU) and produces tailored recommendations.
Do not use for LM Studio, llama.cpp, vLLM, or hosted-API LLM providers (OpenAI, Anthropic) — those use different runtimes and tuning surfaces.
Before creating/updating/deleting files in an existing repository, sync the current branch with remote:
branch="$(git rev-parse --abbrev-ref HEAD)"
git fetch origin
git pull --rebase origin "$branch"
If the working tree is not clean, stash first, sync, then restore:
git stash push -u -m "pre-sync"
branch="$(git rev-parse --abbrev-ref HEAD)"
git fetch origin && git pull --rebase origin "$branch"
git stash pop
If origin is missing, pull is unavailable, or rebase/stash conflicts occur, stop and ask the user before continuing.
Run the detection script to gather hardware information:
python3 scripts/detect_system.py
Parse the JSON output to identify:
Based on detected hardware, determine the optimization profile:
Hardware Tier Classification:
| Tier | Criteria | Max Model | Key Optimizations | |------|----------|-----------|-------------------| | CPU-only | No GPU detected | 3B | num_thread tuning, Q4_K_M quant | | Low VRAM | <6GB VRAM | 3B | Flash attention, KV cache q4_0 | | Entry | 6-8GB VRAM | 8B | Flash attention, KV cache q8_0 | | Prosumer | 10-12GB VRAM | 14B | Flash attention, full offload | | Workstation | 16-24GB VRAM | 32B | Standard config, Q5_K_M option | | High-end | 48GB+ VRAM | 70B+ | Multiple models, Q5/Q6 quants |
Apple Silicon Special Case:
Create a structured optimization guide with these sections:
Present detected hardware specs and highlight constraints (e.g., "8GB unified memory limits to 7B models").
List what's needed based on the platform:
Essential environment variables:
# Always recommended
export OLLAMA_FLASH_ATTENTION=1
# Memory-constrained systems (<12GB)
export OLLAMA_KV_CACHE_TYPE=q8_0 # or q4_0 for severe constraints
Model selection guidance:
ollama list outputModelfile tuning (when needed):
PARAMETER num_gpu <layers> # Partial offload for limited VRAM
PARAMETER num_thread <cores> # CPU threads (physical cores, not hyperthreads)
PARAMETER num_ctx <size> # Reduce context for memory savings
Provide copy-paste commands in order:
ollama run <model> --verbose# Benchmark current performance
python3 scripts/benchmark_ollama.py --model <model>
# Expected output: tokens/s and generation latency. Compare against tier baseline from Phase 2.
# Check GPU memory usage (NVIDIA)
nvidia-smi
# Verify config is applied
ollama run <model> "test" --verbose 2>&1 | head -20
A run passes when all of the following are true:
OLLAMA_FLASH_ATTENTION, KV-cache quantisation) are written to a shell init file the user actually uses, with a backup of the prior file.ollama run <model> with --verbose and captures the actual offload/cache numbers.After completing each major step, output a status report in this format:
◆ [Step Name] ([step N of M] — [context])
··································································
[Check 1]: √ pass
[Check 2]: √ pass (note if relevant)
[Check 3]: × fail — [reason]
[Check 4]: √ pass
[Criteria]: √ N/M met
____________________________
Result: PASS | FAIL | PARTIAL
Adapt the check names to match what the step actually validates. Use √ for pass, × for fail, and — to add brief context. The "Criteria" line summarizes how many acceptance criteria were met. The "Result" line gives the overall verdict.
◆ Detection (step 1 of 4 — hardware profiling)
··································································
Hardware detected: √ pass — macOS 14, Apple M2
GPU identified: √ pass — Apple Metal (unified memory)
RAM measured: √ pass — 16GB unified memory
[Criteria]: √ 3/3 met
____________________________
Result: PASS
◆ Analysis (step 2 of 4 — profile selection)
··································································
Tier classified: √ pass — Prosumer (16GB unified)
Profile selected: √ pass — Flash attention, full offload
Bottlenecks identified: √ pass — memory bandwidth primary constraint
[Criteria]: √ 3/3 met
____________________________
Result: PASS
◆ Plan (step 3 of 4 — optimization guide)
··································································
Guide generated: √ pass — ollama-optimization-guide.md written
Parameters tuned: √ pass — OLLAMA_FLASH_ATTENTION=1, KV_CACHE_TYPE=q8_0
Model recommendations ready: √ pass — llama3.1:14b-instruct-q4_K_M suggested
[Criteria]: √ 3/3 met
____________________________
Result: PASS
◆ Verification (step 4 of 4 — config validation)
··································································
Benchmark commands listed: √ pass — python3 scripts/benchmark_ollama.py
Config verified: √ pass — ollama run --verbose output checked
[Criteria]: √ 2/2 met
____________________________
Result: PASS
Generate an ollama-optimization-guide.md file. Ask the user where to save it (suggest ~/.config/ollama/optimization-guide.md or current directory). Contents:
# Ollama Optimization Guide
**Generated:** <timestamp>
**System:** <OS> | <CPU> | <RAM>GB RAM | <GPU>
## System Overview
<hardware summary and constraints>
## Current Configuration
<existing Ollama setup and env vars>
## Recommendations
### Environment Variables
<shell commands to set vars>
### Model Selection
<recommended models with rationale>
### Performance Tuning
<Modelfile adjustments if needed>
## Execution Checklist
- [ ] <step 1>
- [ ] <step 2>
...
## Verification
<benchmark commands and expected results>
## Rollback
<commands to revert changes if needed>
For users who want immediate results without full analysis:
macOS (Apple Silicon):
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.2:3b # Safe for 8GB, fast
Linux/Windows with 8GB NVIDIA GPU:
export OLLAMA_FLASH_ATTENTION=1
export OLLAMA_KV_CACHE_TYPE=q8_0
ollama pull llama3.1:8b-instruct-q4_K_M
CPU-only systems:
export CUDA_VISIBLE_DEVICES=-1
ollama pull llama3.2:3b
# Create Modelfile with: PARAMETER num_thread 4
documentation
Manage software releases end-to-end: bump version, generate changelog, tag, push, GitHub release, publish to PyPI/npm. Use when user asks to ship, cut a release, tag a version, or list changes since last tag. Skip routine commits and marketplace publishing.
development
Review UI for usability issues using Steve Krug's principles and produce a scannable report. Use when asked for a usability audit, UX review, or UI feedback on screenshots, URLs, or code. Don't use for visual/brand design critique, accessibility (WCAG) audits, or backend/API review.
development
Validate app/startup ideas with market, feasibility, commercial, and open-source competitor analysis. Use when asked to evaluate, validate, or score a product idea. Don't use for PRDs, go-to-market plans, or investor decks.
testing
Install local-first security hardening: pre-commit secret detection, offline dependency scans, static analysis, reports, and gated free CI. Use when hardening repos or adding security hooks. Don't use for incident response or cloud security reviews.