Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

curiositech/model-serving-api-builder

Name: model-serving-api-builder
Author: curiositech

skills/model-serving-api-builder/SKILL.md

npx skillsauth add curiositech/windags-skills model-serving-api-builder

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Model Serving API Builder

Deploy machine learning models as production APIs using vLLM, TGI, ONNX Runtime, and custom FastAPI services with batching, autoscaling, and GPU optimization.

Activation Triggers

Activate on: "deploy model API", "serve LLM", "vLLM setup", "inference server", "GPU serving", "model endpoint", "batch inference API", "TGI deployment", "ONNX serving", "autoscale inference"

NOT for: Model training or fine-tuning (ai-engineer), prompt design (prompt-engineer), or LLM application logic (ai-engineer)

Quick Start

Choose serving framework — vLLM for LLMs (best throughput), TGI for HuggingFace models, ONNX Runtime for non-LLM models, Triton for multi-model.
Configure resources — GPU type and count, memory limits, tensor parallelism for large models.
Add batching — Continuous batching (vLLM native) for LLMs, dynamic batching for traditional ML models.
Deploy with health checks — Kubernetes with GPU node pools, or managed services (Modal, Replicate, RunPod).
Set up autoscaling — Scale on GPU utilization (70-80%), request queue depth, or custom latency targets.

Core Capabilities

| Domain | Technologies | Notes | |--------|-------------|-------| | LLM Serving | vLLM, TGI (HuggingFace), SGLang | PagedAttention, continuous batching, speculative decoding | | General ML | ONNX Runtime, Triton, TorchServe | Non-LLM models: vision, audio, tabular | | API Layer | FastAPI, gRPC, OpenAI-compatible endpoints | vLLM exposes OpenAI-compatible API natively | | Orchestration | Kubernetes + GPU operator, Docker, Modal, RunPod | GPU scheduling and resource management | | Quantization | AWQ, GPTQ, GGUF, bitsandbytes | 4-bit reduces VRAM 4x with < 2% quality loss | | Autoscaling | KEDA, HPA on GPU metrics, serverless (Modal) | Scale-to-zero for cost; scale-up for throughput |

Architecture Patterns

Pattern 1: vLLM Production Deployment

Client ──→ [Load Balancer] ──→ [vLLM Instance(s)] ──→ [GPU(s)]
                │                      │
           health checks          OpenAI-compatible API
           rate limiting           /v1/completions
           API key auth            /v1/chat/completions
                                       │
                                  continuous batching
                                  PagedAttention
                                  tensor parallelism

# vLLM serving with optimizations (2026 best practices)
pip install vllm

# Single GPU (e.g., A100 80GB, Llama 3.1 8B)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --dtype auto

# Multi-GPU tensor parallelism (e.g., 2x A100 for 70B)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching

# Quantized serving (4-bit AWQ, fits 70B on single A100)
vllm serve TheBloke/Llama-3.1-70B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.95

Pattern 2: Multi-Model Serving Architecture

Requests ──→ [API Gateway / Router]
                    │
          ┌────────┼────────┐
          │        │        │
          ▼        ▼        ▼
      [vLLM]   [ONNX]   [Triton]
      LLM       Vision    Ensemble
      Llama 3   CLIP      Multi-step
                ResNet    Pipeline
          │        │        │
          ▼        ▼        ▼
       GPU 0    GPU 1    GPU 2-3

Router logic:
  /v1/chat/*         → vLLM (LLM inference)
  /v1/embeddings/*   → ONNX (embedding model)
  /v1/classify/*     → Triton (vision classifier)

Pattern 3: Kubernetes GPU Deployment

# k8s deployment with GPU scheduling
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - --model=meta-llama/Llama-3.1-8B-Instruct
        - --gpu-memory-utilization=0.90
        - --enable-prefix-caching
        resources:
          limits:
            nvidia.com/gpu: 1      # Request 1 GPU per pod
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "24Gi"
        ports:
        - containerPort: 8000
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120  # Model loading takes time
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"

Anti-Patterns

No health check grace period — LLM model loading takes 30-120 seconds. Setting initialDelaySeconds: 5 causes restart loops. Profile startup time and set accordingly.
Over-provisioning GPU memory — Setting gpu-memory-utilization to 1.0 leaves no room for request spikes. Use 0.85-0.92 as the safe range.
Synchronous inference without batching — Processing one request at a time wastes GPU cycles. vLLM's continuous batching handles this automatically; for custom serving, implement dynamic batching.
No quantization evaluation — Serving a 70B FP16 model on 4x A100s when AWQ 4-bit on 1x A100 gives equivalent quality at 1/4 the cost. Always benchmark quantized variants.
Missing graceful shutdown — Killing inference mid-generation corrupts responses. Implement drain mode: stop accepting new requests, finish in-flight, then terminate.

Quality Checklist

[ ] Serving framework chosen based on model type (vLLM for LLMs, ONNX for traditional ML)
[ ] GPU memory utilization set to 0.85-0.92 (not 1.0)
[ ] Quantization benchmarked: quality vs. resource savings for target model
[ ] Health check probes configured with sufficient startup delay
[ ] Continuous batching enabled (vLLM default) or dynamic batching configured
[ ] Prefix caching enabled for repeated prompt patterns
[ ] Autoscaling configured: GPU utilization target 70-80%, or queue-depth based
[ ] Graceful shutdown implemented: drain mode before termination
[ ] Latency monitoring: P50, P95, P99 for time-to-first-token and total generation
[ ] Load tested at 2x expected peak traffic before production deployment

curiositech/model-serving-api-builder

skills/model-serving-api-builder/SKILL.md

Deploy ML models as production APIs with vLLM, TGI, ONNX Runtime, batching, autoscaling, and GPU optimization. Activate on: model serving, deploy LLM, vLLM setup, inference API, GPU serving. NOT for: model training (ai-engineer), prompt engineering (prompt-engineer).

development

Updated Apr 4, 2026

$ install --global

skillsauth

npx skillsauth add curiositech/windags-skills model-serving-api-builder

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 7:46 AM4.2s1 file scanned

SKILL.md

license:: Apache-2.0
name:: model-serving-api-builder
description:: Deploy ML models as production APIs with vLLM, TGI, ONNX Runtime, batching, autoscaling, and GPU optimization. Activate on: model serving, deploy LLM, vLLM setup, inference API, GPU serving. NOT for: model training (ai-engineer), prompt engineering (prompt-engineer).
allowed-tools:: Read,Write,Edit,Bash(python:*,pip:*,npm:*,npx:*)
category:: AI & Machine Learning
- skill:: llm-cost-optimizer
reason:: Serving optimization directly reduces inference cost

Model Serving API Builder

Deploy machine learning models as production APIs using vLLM, TGI, ONNX Runtime, and custom FastAPI services with batching, autoscaling, and GPU optimization.

Activation Triggers

Activate on: "deploy model API", "serve LLM", "vLLM setup", "inference server", "GPU serving", "model endpoint", "batch inference API", "TGI deployment", "ONNX serving", "autoscale inference"

NOT for: Model training or fine-tuning (ai-engineer), prompt design (prompt-engineer), or LLM application logic (ai-engineer)

Quick Start

Choose serving framework — vLLM for LLMs (best throughput), TGI for HuggingFace models, ONNX Runtime for non-LLM models, Triton for multi-model.
Configure resources — GPU type and count, memory limits, tensor parallelism for large models.
Add batching — Continuous batching (vLLM native) for LLMs, dynamic batching for traditional ML models.
Deploy with health checks — Kubernetes with GPU node pools, or managed services (Modal, Replicate, RunPod).
Set up autoscaling — Scale on GPU utilization (70-80%), request queue depth, or custom latency targets.

Core Capabilities

Architecture Patterns

Pattern 1: vLLM Production Deployment

Client ──→ [Load Balancer] ──→ [vLLM Instance(s)] ──→ [GPU(s)]
                │                      │
           health checks          OpenAI-compatible API
           rate limiting           /v1/completions
           API key auth            /v1/chat/completions
                                       │
                                  continuous batching
                                  PagedAttention
                                  tensor parallelism

# vLLM serving with optimizations (2026 best practices)
pip install vllm

# Single GPU (e.g., A100 80GB, Llama 3.1 8B)
vllm serve meta-llama/Llama-3.1-8B-Instruct \
  --host 0.0.0.0 --port 8000 \
  --max-model-len 8192 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching \
  --dtype auto

# Multi-GPU tensor parallelism (e.g., 2x A100 for 70B)
vllm serve meta-llama/Llama-3.1-70B-Instruct \
  --tensor-parallel-size 2 \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.90 \
  --enable-prefix-caching

# Quantized serving (4-bit AWQ, fits 70B on single A100)
vllm serve TheBloke/Llama-3.1-70B-Instruct-AWQ \
  --quantization awq \
  --max-model-len 4096 \
  --gpu-memory-utilization 0.95

Pattern 2: Multi-Model Serving Architecture

Requests ──→ [API Gateway / Router]
                    │
          ┌────────┼────────┐
          │        │        │
          ▼        ▼        ▼
      [vLLM]   [ONNX]   [Triton]
      LLM       Vision    Ensemble
      Llama 3   CLIP      Multi-step
                ResNet    Pipeline
          │        │        │
          ▼        ▼        ▼
       GPU 0    GPU 1    GPU 2-3

Router logic:
  /v1/chat/*         → vLLM (LLM inference)
  /v1/embeddings/*   → ONNX (embedding model)
  /v1/classify/*     → Triton (vision classifier)

Pattern 3: Kubernetes GPU Deployment

# k8s deployment with GPU scheduling
apiVersion: apps/v1
kind: Deployment
metadata:
  name: vllm-llama3
spec:
  replicas: 2
  template:
    spec:
      containers:
      - name: vllm
        image: vllm/vllm-openai:latest
        args:
        - --model=meta-llama/Llama-3.1-8B-Instruct
        - --gpu-memory-utilization=0.90
        - --enable-prefix-caching
        resources:
          limits:
            nvidia.com/gpu: 1      # Request 1 GPU per pod
            memory: "32Gi"
          requests:
            nvidia.com/gpu: 1
            memory: "24Gi"
        ports:
        - containerPort: 8000
        livenessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120  # Model loading takes time
          periodSeconds: 10
        readinessProbe:
          httpGet:
            path: /health
            port: 8000
          initialDelaySeconds: 120
      tolerations:
      - key: "nvidia.com/gpu"
        operator: "Exists"
        effect: "NoSchedule"

Anti-Patterns

No health check grace period — LLM model loading takes 30-120 seconds. Setting initialDelaySeconds: 5 causes restart loops. Profile startup time and set accordingly.
Over-provisioning GPU memory — Setting gpu-memory-utilization to 1.0 leaves no room for request spikes. Use 0.85-0.92 as the safe range.
Synchronous inference without batching — Processing one request at a time wastes GPU cycles. vLLM's continuous batching handles this automatically; for custom serving, implement dynamic batching.
No quantization evaluation — Serving a 70B FP16 model on 4x A100s when AWQ 4-bit on 1x A100 gives equivalent quality at 1/4 the cost. Always benchmark quantized variants.
Missing graceful shutdown — Killing inference mid-generation corrupts responses. Implement drain mode: stop accepting new requests, finish in-flight, then terminate.

Quality Checklist

[ ] Serving framework chosen based on model type (vLLM for LLMs, ONNX for traditional ML)
[ ] GPU memory utilization set to 0.85-0.92 (not 1.0)
[ ] Quantization benchmarked: quality vs. resource savings for target model
[ ] Health check probes configured with sufficient startup delay
[ ] Continuous batching enabled (vLLM default) or dynamic batching configured
[ ] Prefix caching enabled for repeated prompt patterns
[ ] Autoscaling configured: GPU utilization target 70-80%, or queue-depth based
[ ] Graceful shutdown implemented: drain mode before termination
[ ] Latency monitoring: P50, P95, P99 for time-to-first-token and total generation
[ ] Load tested at 2x expected peak traffic before production deployment

Related Skills

curiositech/revisiting-interview-data-analysing-turn

data-ai

VerifiedTrustedCommunity

license: Apache-2.0 NOT for unrelated tasks outside this domain.

8SKILL.mdUpdated Jul 19, 2026

curiositech/revisiting-interview-data-analysing-turn

curiositech/redis-patterns-expert

development

VerifiedTrustedCommunity

Use when designing caching strategies (cache-aside, write-through, write-behind), implementing distributed locks, building rate limiters, leaderboards, real-time streams (XADD/consumer groups), pub/sub, or tuning eviction policies. Triggers: thundering-herd on cache miss, dogpile on key expiry, Redlock vs SET-NX-PX choice, sliding-window rate limiter, hot-key on a single cluster slot, big-key blowup, MULTI/EXEC across slots, KEYS in production. NOT for Redis Cluster operations/admin (different domain), embedded KV (SQLite, leveldb), in-process LRU caches, or Memcached.

8SKILL.mdUpdated Jul 19, 2026

curiositech/redis-patterns-expert

curiositech/react-server-components-boundary

tools

VerifiedTrustedCommunity

Drawing the `'use client'` boundary correctly in React Server Components apps (Next.js App Router, RSC frameworks) — leaf-pushing, slot composition, serialization rules, and environment poisoning prevention. Grounded in react.dev and Next.js 16 docs.

8SKILL.mdUpdated Jul 19, 2026

curiositech/react-server-components-boundary

curiositech/rate-limiting-strategy

development

VerifiedTrustedCommunity

Use when designing rate limiting for an API, choosing between token bucket / sliding window / leaky bucket / fixed window, implementing it in Redis, deciding edge (Cloudflare/Upstash) vs origin enforcement, sizing per-user vs per-IP vs per-endpoint quotas, returning the right 429 response with Retry-After, or fixing the boundary-burst bug in fixed-window limiters. Triggers: 429 too many requests, INCR + EXPIRE, ZADD + ZREMRANGEBYSCORE + ZCARD, X-RateLimit-Remaining header, Cloudflare WAF rate limiting rules, Upstash @upstash/ratelimit, leaky bucket shaping vs policing, distributed rate limiter consistency. NOT for DDoS mitigation specifically (different scale), CAPTCHA / bot management, full WAF design, or per-user quota billing.

8SKILL.mdUpdated Jul 19, 2026

curiositech/rate-limiting-strategy

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/curiositech/windags-skills.git

# Copy into Claude Code skills folder (global)
cp -r windags-skills/skills/model-serving-api-builder ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

curiositech/windags-skills

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT