Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

bagelhole/vllm-server

Name: vllm-server
Author: bagelhole

infrastructure/local-ai/vllm-server/SKILL.md

npx skillsauth add bagelhole/devops-security-agent-skills vllm-server

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

vLLM Server Management

Deploy production-grade LLM inference servers with vLLM — the fastest open-source LLM serving engine with PagedAttention and continuous batching.

When to Use This Skill

Use this skill when:

Serving open-source LLMs (Llama, Mistral, Qwen, Gemma) at scale
Building an OpenAI-compatible API endpoint for self-hosted models
Optimizing LLM throughput and latency for production traffic
Running multi-GPU inference with tensor or pipeline parallelism
Deploying quantized models to reduce GPU memory requirements

Prerequisites

NVIDIA GPU(s) with CUDA 12.1+ (A100/H100 recommended for production)
Docker or Python 3.9+ with pip
40GB+ VRAM for 70B models; 8GB+ for 7B models
nvidia-container-toolkit for Docker GPU passthrough

Quick Start

# Install vLLM
pip install vllm

# Serve a model (OpenAI-compatible API)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key your-secret-key

# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secret-key" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Docker Deployment

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --api-key your-secret-key

Docker Compose (Production)

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    volumes:
      - model-cache:/root/.cache/huggingface
    ports:
      - "8000:8000"
    ipc: host
    command: >
      --model meta-llama/Llama-3.1-70B-Instruct
      --tensor-parallel-size 2
      --max-model-len 32768
      --gpu-memory-utilization 0.90
      --api-key ${VLLM_API_KEY}
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  model-cache:

Key Configuration Options

Multi-GPU Tensor Parallelism

# Split one model across 4 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90

Quantization (Lower VRAM)

# AWQ quantization (70B on 2x A100 40GB)
vllm serve casperhansen/llama-3-70b-instruct-awq \
  --quantization awq \
  --tensor-parallel-size 2

# GPTQ quantization
vllm serve TheBloke/Llama-2-70B-Chat-GPTQ \
  --quantization gptq

# FP8 (H100 NVL native)
vllm serve meta-llama/Llama-3.1-405B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 8

Structured Output & Tools

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json \
  --guided-decoding-backend outlines

LoRA Adapters

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules sql-lora=/path/to/sql-lora \
                 code-lora=/path/to/code-lora \
  --max-lora-rank 64

Performance Tuning

# Maximize throughput for batch workloads
vllm serve <model> \
  --max-num-seqs 256 \          # max concurrent sequences
  --max-num-batched-tokens 8192 \ # tokens per batch
  --gpu-memory-utilization 0.95 \ # use 95% VRAM
  --swap-space 4                  # CPU swap (GiB)

# Minimize latency for interactive use
vllm serve <model> \
  --max-num-seqs 32 \
  --enforce-eager              # disable CUDA graph capture

Benchmarking

# Install benchmark tool
pip install vllm

# Run throughput benchmark
python -m vllm.entrypoints.openai.run_batch \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-file prompts.jsonl \
  --output-file results.jsonl

# Benchmark with vllm bench
vllm bench throughput \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 1000 \
  --input-len 512 \
  --output-len 128

Monitoring

# Check running server stats
curl http://localhost:8000/metrics  # Prometheus metrics

# Key metrics to watch:
# vllm:num_requests_running       - active requests
# vllm:gpu_cache_usage_perc       - KV cache utilization
# vllm:generation_tokens_per_s    - throughput
# vllm:time_to_first_token_ms     - TTFT latency
# vllm:e2e_request_latency_seconds - end-to-end latency

Common Issues

| Issue | Cause | Fix | |-------|-------|-----| | CUDA out of memory | Model too large for VRAM | Add --quantization awq or reduce --gpu-memory-utilization | | Slow cold start | Model not cached | Pre-pull with huggingface-cli download <model> | | Low throughput | Too few concurrent requests | Increase --max-num-seqs | | KV cache full errors | Context length too long | Set --max-model-len lower | | tokenizer error | Tokenizer mismatch | Use --tokenizer to specify correct tokenizer |

Best Practices

Use --gpu-memory-utilization 0.90 to leave headroom for CUDA kernels.
Pin model versions with --revision for reproducible deployments.
Set HF_HUB_OFFLINE=1 in production to prevent unexpected downloads.
Use AWQ or GPTQ quantization before tensor parallelism — lower VRAM first.
Enable --enable-chunked-prefill for long-context workloads.
Monitor gpu_cache_usage_perc — above 95% causes queuing.

Related Skills

llm-inference-scaling - Auto-scaling vLLM deployments
gpu-server-management - GPU driver setup
llm-gateway - Load balancing across vLLM instances
llm-cost-optimization - Cost management
model-serving-kubernetes - K8s deployment

bagelhole/vllm-server

infrastructure/local-ai/vllm-server/SKILL.md

Deploy and manage vLLM for high-throughput LLM inference. Configure continuous batching, tensor parallelism, quantization, and OpenAI-compatible API endpoints for production LLM serving.

18 stars

development

Updated Apr 3, 2026

$ install --global

skillsauth

npx skillsauth add bagelhole/devops-security-agent-skills vllm-server

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 3, 2026, 5:06 PM8.3s1 file scanned

SKILL.md

name:: vllm-server
description:: Deploy and manage vLLM for high-throughput LLM inference. Configure continuous batching, tensor parallelism, quantization, and OpenAI-compatible API endpoints for production LLM serving.
license:: MIT
author:: devops-skills
version:: 1.0

vLLM Server Management

Deploy production-grade LLM inference servers with vLLM — the fastest open-source LLM serving engine with PagedAttention and continuous batching.

When to Use This Skill

Use this skill when:

Serving open-source LLMs (Llama, Mistral, Qwen, Gemma) at scale
Building an OpenAI-compatible API endpoint for self-hosted models
Optimizing LLM throughput and latency for production traffic
Running multi-GPU inference with tensor or pipeline parallelism
Deploying quantized models to reduce GPU memory requirements

Prerequisites

NVIDIA GPU(s) with CUDA 12.1+ (A100/H100 recommended for production)
Docker or Python 3.9+ with pip
40GB+ VRAM for 70B models; 8GB+ for 7B models
nvidia-container-toolkit for Docker GPU passthrough

Quick Start

# Install vLLM
pip install vllm

# Serve a model (OpenAI-compatible API)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 \
  --port 8000 \
  --api-key your-secret-key

# Test the endpoint
curl http://localhost:8000/v1/chat/completions \
  -H "Content-Type: application/json" \
  -H "Authorization: Bearer your-secret-key" \
  -d '{
    "model": "meta-llama/Llama-3.1-8B-Instruct",
    "messages": [{"role": "user", "content": "Hello!"}]
  }'

Docker Deployment

docker run --runtime nvidia --gpus all \
  -v ~/.cache/huggingface:/root/.cache/huggingface \
  -p 8000:8000 \
  --ipc=host \
  vllm/vllm-openai:latest \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --api-key your-secret-key

Docker Compose (Production)

services:
  vllm:
    image: vllm/vllm-openai:latest
    runtime: nvidia
    environment:
      - NVIDIA_VISIBLE_DEVICES=all
      - HUGGING_FACE_HUB_TOKEN=${HF_TOKEN}
    volumes:
      - model-cache:/root/.cache/huggingface
    ports:
      - "8000:8000"
    ipc: host
    command: >
      --model meta-llama/Llama-3.1-70B-Instruct
      --tensor-parallel-size 2
      --max-model-len 32768
      --gpu-memory-utilization 0.90
      --api-key ${VLLM_API_KEY}
    restart: unless-stopped
    healthcheck:
      test: ["CMD", "curl", "-f", "http://localhost:8000/health"]
      interval: 30s
      timeout: 10s
      retries: 3

volumes:
  model-cache:

Key Configuration Options

Multi-GPU Tensor Parallelism

# Split one model across 4 GPUs
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 4 \
  --gpu-memory-utilization 0.90

Quantization (Lower VRAM)

# AWQ quantization (70B on 2x A100 40GB)
vllm serve casperhansen/llama-3-70b-instruct-awq \
  --quantization awq \
  --tensor-parallel-size 2

# GPTQ quantization
vllm serve TheBloke/Llama-2-70B-Chat-GPTQ \
  --quantization gptq

# FP8 (H100 NVL native)
vllm serve meta-llama/Llama-3.1-405B-Instruct \
  --quantization fp8 \
  --tensor-parallel-size 8

Structured Output & Tools

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-auto-tool-choice \
  --tool-call-parser llama3_json \
  --guided-decoding-backend outlines

LoRA Adapters

vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --enable-lora \
  --lora-modules sql-lora=/path/to/sql-lora \
                 code-lora=/path/to/code-lora \
  --max-lora-rank 64

Performance Tuning

# Maximize throughput for batch workloads
vllm serve <model> \
  --max-num-seqs 256 \          # max concurrent sequences
  --max-num-batched-tokens 8192 \ # tokens per batch
  --gpu-memory-utilization 0.95 \ # use 95% VRAM
  --swap-space 4                  # CPU swap (GiB)

# Minimize latency for interactive use
vllm serve <model> \
  --max-num-seqs 32 \
  --enforce-eager              # disable CUDA graph capture

Benchmarking

# Install benchmark tool
pip install vllm

# Run throughput benchmark
python -m vllm.entrypoints.openai.run_batch \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --input-file prompts.jsonl \
  --output-file results.jsonl

# Benchmark with vllm bench
vllm bench throughput \
  --model meta-llama/Llama-3.1-8B-Instruct \
  --num-prompts 1000 \
  --input-len 512 \
  --output-len 128

Monitoring

# Check running server stats
curl http://localhost:8000/metrics  # Prometheus metrics

# Key metrics to watch:
# vllm:num_requests_running       - active requests
# vllm:gpu_cache_usage_perc       - KV cache utilization
# vllm:generation_tokens_per_s    - throughput
# vllm:time_to_first_token_ms     - TTFT latency
# vllm:e2e_request_latency_seconds - end-to-end latency

Common Issues

Best Practices

Use --gpu-memory-utilization 0.90 to leave headroom for CUDA kernels.
Pin model versions with --revision for reproducible deployments.
Set HF_HUB_OFFLINE=1 in production to prevent unexpected downloads.
Use AWQ or GPTQ quantization before tensor parallelism — lower VRAM first.
Enable --enable-chunked-prefill for long-context workloads.
Monitor gpu_cache_usage_perc — above 95% causes queuing.

Related Skills

llm-inference-scaling - Auto-scaling vLLM deployments
gpu-server-management - GPU driver setup
llm-gateway - Load balancing across vLLM instances
llm-cost-optimization - Cost management
model-serving-kubernetes - K8s deployment

Related Skills

bagelhole/sre-dashboards

development

VerifiedTrustedCommunity

Design and operationalize SRE dashboards that surface reliability, latency, error, saturation, and capacity signals across services. Use when building observability views for SLOs, incident response, and executive reliability reporting.

28SKILL.mdUpdated May 23, 2026

bagelhole/sre-dashboards

bagelhole/openclaw-security-hardening

testing

VerifiedTrustedCommunity

Harden OpenClaw self-hosted environments with baseline host controls, auth tightening, secret handling, network segmentation, and safe update/rollback workflows. Use when deploying OpenClaw in home labs, startups, or production-like local AI infrastructure.

28SKILL.mdUpdated Apr 3, 2026

bagelhole/openclaw-security-hardening

bagelhole/vector-database-ops

devops

VerifiedTrustedCommunity

Deploy, manage, and optimize vector databases for AI applications. Covers Qdrant, Weaviate, pgvector, and Pinecone — collection management, indexing strategies, backup, and performance tuning for production RAG and semantic search workloads.

28SKILL.mdUpdated Apr 3, 2026

bagelhole/vector-database-ops

bagelhole/model-serving-kubernetes

testing

VerifiedTrustedCommunity

Deploy ML models on Kubernetes with KServe (formerly KFServing) and NVIDIA Triton Inference Server. Includes canary deployments, autoscaling, model versioning, A/B testing, and GPU resource management for production model serving.

28SKILL.mdUpdated Apr 3, 2026

bagelhole/model-serving-kubernetes

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/bagelhole/devops-security-agent-skills.git

# Copy into Claude Code skills folder (global)
cp -r devops-security-agent-skills/infrastructure/local-ai/vllm-server ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

bagelhole/devops-security-agent-skills

18 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT