Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

pkuppens/agent-on-premises

Name: agent-on-premises
Author: pkuppens

skills/ai-agent-development/agent-on-premises/SKILL.md

npx skillsauth add pkuppens/pkuppens agent-on-premises

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Agent On-Premises

Patterns for deploying AI agents entirely on-premises — no data leaves the local network. Covers model serving, GPU provisioning, infrastructure, and the trade-offs versus cloud deployment.

When to use

Data sovereignty or regulatory requirements prohibit cloud API usage (GDPR, HIPAA, government)
Air-gapped or restricted network environments
Sensitive data (medical records, financial data, classified information) must not leave the premises
Cost optimisation for high-volume inference (own hardware vs pay-per-token)
Low-latency requirements where cloud round-trips are unacceptable

Architecture

On-Premises Network
┌──────────────────────────────────────────────────┐
│                                                  │
│  ┌──────────┐    ┌─────────────┐   ┌──────────┐ │
│  │  Agent    │───▶│  Model      │   │  Vector  │ │
│  │  App      │    │  Server     │   │  Store   │ │
│  │ (FastAPI) │    │ (Ollama /   │   │ (Chroma /│ │
│  └──────────┘    │  vLLM)      │   │  Qdrant) │ │
│       │          └─────────────┘   └──────────┘ │
│       │                │                  │      │
│       └────────────────┼──────────────────┘      │
│                        │                         │
│                  ┌─────┴─────┐                   │
│                  │   GPU(s)  │                   │
│                  └───────────┘                   │
└──────────────────────────────────────────────────┘

Model serving options

| Server | Setup complexity | Production-ready | GPU support | Quantisation | |--------|-----------------|-----------------|-------------|-------------| | Ollama | Low (single binary) | Development / small teams | CUDA, ROCm, Metal | GGUF (Q4, Q5, Q8) | | vLLM | Medium (Python) | Yes (high throughput) | CUDA | AWQ, GPTQ, FP8 | | TGI | Medium (Docker) | Yes (HuggingFace) | CUDA | GPTQ, AWQ, EETQ | | llama.cpp | Low (C++ binary) | Development | CUDA, Metal, Vulkan, CPU | GGUF (extensive) | | LocalAI | Medium (Docker) | Community | CUDA, CPU | GGUF, GPTQ |

GPU provisioning

Minimum GPU memory by model size

| Model parameters | Min GPU VRAM (FP16) | Min GPU VRAM (Q4) | Example GPU | |-----------------|--------------------|--------------------|-------------| | 1-3B | 4 GB | 2 GB | RTX 3060 (12 GB) | | 7-8B | 16 GB | 4-6 GB | RTX 4070 (12 GB) | | 13B | 28 GB | 8-10 GB | RTX 4090 (24 GB) | | 30-34B | 68 GB | 20-24 GB | A100 (40 GB) or 2x RTX 4090 | | 70B | 140 GB | 40-48 GB | A100 (80 GB) or 2x A100 (40 GB) |

Quantisation trade-offs

Q4_K_M — good balance of quality and memory (most common for on-prem)
Q5_K_M — slightly better quality, 25% more memory
Q8_0 — near-FP16 quality, double the Q4 memory
FP16 — full precision, requires most memory, best quality

Docker Compose pattern

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      retries: 3

  chromadb:
    image: chromadb/chroma:latest
    ports:
      - "8000:8000"
    volumes:
      - chroma_data:/chroma/chroma

  agent:
    build: .
    ports:
      - "8080:8080"
    environment:
      LLM_BACKEND: ollama
      LLM_BASE_URL: http://ollama:11434
      LLM_MODEL: llama3.2
      CHROMA_URL: http://chromadb:8000
    depends_on:
      ollama:
        condition: service_healthy

Air-gapped deployment

For environments with no internet access:

Pre-download models on a connected machine (ollama pull, download GGUF files)
Transfer via secure media (USB, approved file transfer)
Load from local volume — mount the model directory into the container
Pre-build Docker images — save with docker save, load with docker load
Embedding models — download sentence-transformers models and serve locally

Data sovereignty checklist

[ ] No API calls leave the local network (verify with network monitoring)
[ ] All models are served locally (no cloud model fallback)
[ ] Embedding generation is local (not using cloud embedding APIs)
[ ] Vector store runs on local infrastructure
[ ] Audit logs are stored locally
[ ] Model updates follow an approved change process (no auto-download)

Integration with other skills

agent-llm-providers — on-prem provider configuration
agent-context — local vector stores and embedding models
agent-guardrails — on-prem PII filtering
deployment-build — container image builds
deployment-release — on-prem release process

pkuppens/agent-on-premises

skills/ai-agent-development/agent-on-premises/SKILL.md

Guides on-premises deployment of AI agents: local model serving, data sovereignty, air-gapped environments, GPU provisioning, and infrastructure patterns. Use when agents must run locally without sending data to cloud APIs.

development

Updated May 15, 2026

$ install --global

skillsauth

npx skillsauth add pkuppens/pkuppens agent-on-premises

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 15, 2026, 5:40 AM89.1s1 file scanned

SKILL.md

name:: agent-on-premises
description:: >-
Guides on-premises deployment of AI agents:: local model serving, data

Agent On-Premises

Patterns for deploying AI agents entirely on-premises — no data leaves the local network. Covers model serving, GPU provisioning, infrastructure, and the trade-offs versus cloud deployment.

When to use

Data sovereignty or regulatory requirements prohibit cloud API usage (GDPR, HIPAA, government)
Air-gapped or restricted network environments
Sensitive data (medical records, financial data, classified information) must not leave the premises
Cost optimisation for high-volume inference (own hardware vs pay-per-token)
Low-latency requirements where cloud round-trips are unacceptable

Architecture

On-Premises Network
┌──────────────────────────────────────────────────┐
│                                                  │
│  ┌──────────┐    ┌─────────────┐   ┌──────────┐ │
│  │  Agent    │───▶│  Model      │   │  Vector  │ │
│  │  App      │    │  Server     │   │  Store   │ │
│  │ (FastAPI) │    │ (Ollama /   │   │ (Chroma /│ │
│  └──────────┘    │  vLLM)      │   │  Qdrant) │ │
│       │          └─────────────┘   └──────────┘ │
│       │                │                  │      │
│       └────────────────┼──────────────────┘      │
│                        │                         │
│                  ┌─────┴─────┐                   │
│                  │   GPU(s)  │                   │
│                  └───────────┘                   │
└──────────────────────────────────────────────────┘

Model serving options

GPU provisioning

Minimum GPU memory by model size

Quantisation trade-offs

Q4_K_M — good balance of quality and memory (most common for on-prem)
Q5_K_M — slightly better quality, 25% more memory
Q8_0 — near-FP16 quality, double the Q4 memory
FP16 — full precision, requires most memory, best quality

Docker Compose pattern

services:
  ollama:
    image: ollama/ollama:latest
    ports:
      - "11434:11434"
    volumes:
      - ollama_data:/root/.ollama
    deploy:
      resources:
        reservations:
          devices:
            - driver: nvidia
              count: 1
              capabilities: [gpu]
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:11434/api/tags"]
      interval: 30s
      retries: 3

  chromadb:
    image: chromadb/chroma:latest
    ports:
      - "8000:8000"
    volumes:
      - chroma_data:/chroma/chroma

  agent:
    build: .
    ports:
      - "8080:8080"
    environment:
      LLM_BACKEND: ollama
      LLM_BASE_URL: http://ollama:11434
      LLM_MODEL: llama3.2
      CHROMA_URL: http://chromadb:8000
    depends_on:
      ollama:
        condition: service_healthy

Air-gapped deployment

For environments with no internet access:

Pre-download models on a connected machine (ollama pull, download GGUF files)
Transfer via secure media (USB, approved file transfer)
Load from local volume — mount the model directory into the container
Pre-build Docker images — save with docker save, load with docker load
Embedding models — download sentence-transformers models and serve locally

Data sovereignty checklist

[ ] No API calls leave the local network (verify with network monitoring)
[ ] All models are served locally (no cloud model fallback)
[ ] Embedding generation is local (not using cloud embedding APIs)
[ ] Vector store runs on local infrastructure
[ ] Audit logs are stored locally
[ ] Model updates follow an approved change process (no auto-download)

Integration with other skills

agent-llm-providers — on-prem provider configuration
agent-context — local vector stores and embedding models
agent-guardrails — on-prem PII filtering
deployment-build — container image builds
deployment-release — on-prem release process

Related Skills

pkuppens/sync-branch

testing

VerifiedTrustedCommunity

Syncs remote default branch locally (checkout, fetch --prune, pull) and returns to the previous branch when it still exists. Reports stashes and worktrees not yet handled. Use when the user asks to sync main, update default branch, fetch/pull origin, or run /sync-branch.

SKILL.mdUpdated Jun 6, 2026

pkuppens/azure-devops-work-items

tools

VerifiedTrustedCommunity

Creates, queries, updates, and links Azure Boards work items via az boards CLI. Use when filing ADO work items, running WIQL queries, or setting area path, iteration, tags, and assignee.

SKILL.mdUpdated May 29, 2026

pkuppens/azure-devops-work-items

pkuppens/azure-devops-repos

tools

VerifiedTrustedCommunity

Creates, reviews, and completes Azure Repos pull requests and branch policies via az repos CLI. Use when opening ADO PRs, setting required reviewers, or configuring build validation policies.

SKILL.mdUpdated May 29, 2026

pkuppens/azure-devops-repos

pkuppens/azure-devops-pipelines

development

VerifiedTrustedCommunity

Guides Azure Pipelines YAML structure, build validation on PRs, and staged deployment with environments and approvals. Use when authoring azure-pipelines.yml or configuring CI/CD on Azure DevOps.

SKILL.mdUpdated May 29, 2026

pkuppens/azure-devops-pipelines

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/pkuppens/pkuppens.git

# Copy into Claude Code skills folder (global)
cp -r pkuppens/skills/ai-agent-development/agent-on-premises ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

pkuppens/pkuppens

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT