skills/vllm/SKILL.md
High-throughput LLM inference and serving engine with PagedAttention, continuous batching, and OpenAI-compatible API. MANDATORY TRIGGERS: vLLM, vllm, LLM serving, LLM inference engine, PagedAttention. Also trigger when the user wants to serve LLMs in production, deploy models with tensor parallelism, use speculative decoding, quantize models for inference, build OpenAI-compatible API servers, or optimize LLM throughput and latency. When in doubt about whether to use this skill for LLM serving tasks, use it.
npx skillsauth add abhisheksharma-17/skills-graph vllmInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Source: docs.vllm.ai | Version tracked: 0.22.1 |
pip install vllm
| File | Read When |
|------|-----------|
| references/00-overview.md | Starting with vLLM, understanding architecture, installation, quick start |
| references/01-offline-inference.md | Batch inference with LLM class, SamplingParams, chat API, generate() |
| references/02-serving.md | OpenAI-compatible API server, endpoints, vllm serve, client setup |
| references/03-sampling-params.md | Generation parameters: temperature, top_p, top_k, penalties, stop tokens |
| references/04-models.md | Supported model architectures, loading models, HuggingFace, model config |
| references/05-quantization.md | FP8, GPTQ, AWQ, GGUF, BitsAndBytes, hardware compatibility matrix |
| references/06-distributed-inference.md | Tensor/pipeline/expert parallelism, multi-GPU, multi-node, Ray |
| references/07-speculative-decoding.md | Draft models, EAGLE, MTP, n-gram, --speculative-config |
| references/08-structured-outputs.md | JSON schema, regex, grammar constraints, guided decoding backends |
| references/09-tool-calling.md | Function calling, tool parsers, supported models, custom parsers |
| references/10-multimodal.md | Vision, audio, video inputs, embedding inputs, media handling |
| references/11-lora-adapters.md | LoRA serving, dynamic loading/unloading, multi-adapter, plugins |
| references/12-production-deployment.md | Docker, Kubernetes, Prometheus metrics, autoscaling, best practices |
pip install vllm # NVIDIA CUDA (default)
uv pip install vllm --torch-backend=auto # With uv (recommended)
pip install vllm-tpu # Google TPU
tools
Type-safe Python agent framework for building production-grade GenAI applications with Pydantic validation, structured outputs, and dependency injection. MANDATORY TRIGGERS: pydantic-ai, pydantic_ai, PydanticAI, pydantic ai agent. Also trigger when the user wants to build type-safe AI agents in Python, create structured LLM outputs with Pydantic models, implement dependency injection for agents, use tools/capabilities with LLMs, or build multi-agent systems with Python type safety. When in doubt about whether to use this skill for Python AI agent tasks, use it.
development
Durable execution platform for building fault-tolerant workflows, long-running processes, and resilient distributed applications. MANDATORY TRIGGERS: temporal, temporal.io, temporalio, durable execution, workflow orchestration engine. Also trigger when the user wants to build fault-tolerant workflows, implement saga patterns, create long-running distributed processes, orchestrate microservices with retries and timeouts, or build durable AI agent pipelines. When in doubt about whether to use this skill for workflow orchestration or durable execution tasks, use it.
tools
AI framework for building RAG pipelines, agents, workflows, and data-augmented LLM applications with 300+ integrations. MANDATORY TRIGGERS: llamaindex, llama-index, llama_index, LlamaIndex, VectorStoreIndex, SimpleDirectoryReader, LlamaHub, LlamaParse. Also trigger when the user wants to build RAG applications with LlamaIndex, create document indexing pipelines, build agentic workflows with tool calling, implement structured data extraction from documents, or connect LLMs to custom data sources. When in doubt about whether to use this skill for RAG, document indexing, or LLM data augmentation tasks, use it.
development
AI developer platform for experiment tracking, LLM observability, hyperparameter sweeps, artifact versioning, and model registry. MANDATORY TRIGGERS: wandb, weights and biases, weights & biases, W&B, weave, wandb.init, wandb.log. Also trigger when the user wants to track ML experiments, log training metrics, tune hyperparameters with sweeps, version datasets or models, trace LLM calls, evaluate LLM applications, or monitor AI agents. When in doubt about whether to use this skill for ML experiment tracking or LLM observability tasks, use it.