skills/ai-agent-development/agent-on-premises/SKILL.md
Guides on-premises deployment of AI agents: local model serving, data sovereignty, air-gapped environments, GPU provisioning, and infrastructure patterns. Use when agents must run locally without sending data to cloud APIs.
npx skillsauth add pkuppens/pkuppens agent-on-premisesInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Patterns for deploying AI agents entirely on-premises — no data leaves the local network. Covers model serving, GPU provisioning, infrastructure, and the trade-offs versus cloud deployment.
On-Premises Network
┌──────────────────────────────────────────────────┐
│ │
│ ┌──────────┐ ┌─────────────┐ ┌──────────┐ │
│ │ Agent │───▶│ Model │ │ Vector │ │
│ │ App │ │ Server │ │ Store │ │
│ │ (FastAPI) │ │ (Ollama / │ │ (Chroma /│ │
│ └──────────┘ │ vLLM) │ │ Qdrant) │ │
│ │ └─────────────┘ └──────────┘ │
│ │ │ │ │
│ └────────────────┼──────────────────┘ │
│ │ │
│ ┌─────┴─────┐ │
│ │ GPU(s) │ │
│ └───────────┘ │
└──────────────────────────────────────────────────┘
| Server | Setup complexity | Production-ready | GPU support | Quantisation | |--------|-----------------|-----------------|-------------|-------------| | Ollama | Low (single binary) | Development / small teams | CUDA, ROCm, Metal | GGUF (Q4, Q5, Q8) | | vLLM | Medium (Python) | Yes (high throughput) | CUDA | AWQ, GPTQ, FP8 | | TGI | Medium (Docker) | Yes (HuggingFace) | CUDA | GPTQ, AWQ, EETQ | | llama.cpp | Low (C++ binary) | Development | CUDA, Metal, Vulkan, CPU | GGUF (extensive) | | LocalAI | Medium (Docker) | Community | CUDA, CPU | GGUF, GPTQ |
| Model parameters | Min GPU VRAM (FP16) | Min GPU VRAM (Q4) | Example GPU | |-----------------|--------------------|--------------------|-------------| | 1-3B | 4 GB | 2 GB | RTX 3060 (12 GB) | | 7-8B | 16 GB | 4-6 GB | RTX 4070 (12 GB) | | 13B | 28 GB | 8-10 GB | RTX 4090 (24 GB) | | 30-34B | 68 GB | 20-24 GB | A100 (40 GB) or 2x RTX 4090 | | 70B | 140 GB | 40-48 GB | A100 (80 GB) or 2x A100 (40 GB) |
services:
ollama:
image: ollama/ollama:latest
ports:
- "11434:11434"
volumes:
- ollama_data:/root/.ollama
deploy:
resources:
reservations:
devices:
- driver: nvidia
count: 1
capabilities: [gpu]
healthcheck:
test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
interval: 30s
retries: 3
chromadb:
image: chromadb/chroma:latest
ports:
- "8000:8000"
volumes:
- chroma_data:/chroma/chroma
agent:
build: .
ports:
- "8080:8080"
environment:
LLM_BACKEND: ollama
LLM_BASE_URL: http://ollama:11434
LLM_MODEL: llama3.2
CHROMA_URL: http://chromadb:8000
depends_on:
ollama:
condition: service_healthy
For environments with no internet access:
ollama pull, download GGUF files)docker save, load with docker loadtools
Creates, queries, updates, and links Azure Boards work items via az boards CLI. Use when filing ADO work items, running WIQL queries, or setting area path, iteration, tags, and assignee.
tools
Creates, reviews, and completes Azure Repos pull requests and branch policies via az repos CLI. Use when opening ADO PRs, setting required reviewers, or configuring build validation policies.
development
Guides Azure Pipelines YAML structure, build validation on PRs, and staged deployment with environments and approvals. Use when authoring azure-pipelines.yml or configuring CI/CD on Azure DevOps.
tools
Orchestrates Azure DevOps work item, repo, and pipeline workflows using az CLI. Use when working with Azure DevOps, Azure Repos, Azure Boards, Azure Pipelines, or az devops commands.