Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

hsliuustc0106/vllm-omni-perf

Name: vllm-omni-perf
Author: hsliuustc0106

skills/vllm-omni-perf/SKILL.md

npx skillsauth add hsliuustc0106/vllm-omni-skills vllm-omni-perf

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

vLLM-Omni Performance Tuning

Overview

vLLM-Omni provides multiple optimization levers for both autoregressive and diffusion pipelines. Key techniques include KV cache optimization (inherited from vLLM), TeaCache/Cache-DiT for diffusion acceleration, quantization, CPU offloading, and parallelism configuration.

Optimization Quick Reference

| Technique | Applies To | Speedup | Quality Impact | |-----------|-----------|---------|----------------| | TeaCache | Diffusion models | 1.5-2.0x | Minimal | | Cache-DiT | Diffusion models | 1.3-1.8x | Minimal | | Quantization | All models | 1.2-1.5x | Slight | | Tensor Parallelism | All models | Near-linear | None | | Sequence Parallelism | DiT models | Near-linear | None | | CPU Offloading | All models | Enables larger models | Adds latency | | GPU Memory Tuning | All models | More throughput | None | | Multi-Thread Weight Loading | Diffusion models | Faster startup | None |

TeaCache (Diffusion Acceleration)

TeaCache provides adaptive caching for diffusion transformer denoising steps, skipping redundant computations:

vllm serve <model> --omni \
  --enable-teacache \
  --teacache-threshold 0.1

| Parameter | Description | Default | |-----------|-------------|---------| | --enable-teacache | Enable TeaCache | Disabled | | --teacache-threshold | Cache hit threshold (lower = more caching) | Model-specific |

Recommended thresholds by model:

Image models: 0.05-0.15
Video models: 0.08-0.20

Cache-DiT

Alternative diffusion acceleration backend:

vllm serve <model> --omni --enable-cache-dit

Can be combined with TeaCache, but test independently first to measure impact.

Supported models: FLUX.2-dev, Helios-Distilled, Wan2.2, and others using ForwardPattern.Pattern_2. Helios achieves ~20% speedup with cache-dit.

TeaCache and CPU Offload hooks are compatible — use them simultaneously with --enable-teacache --enable-cpu-offload (or --cpu-offload-gb). The HookRegistry sorts hooks alphabetically and ensures the forward-overriding hook (TeaCache) runs last in the pre-process chain. Only one forward-overriding hook is allowed at a time.

Quantization

For full quantization guidance (method selection, AWQ/GPTQ workflows, FP8 KV cache, quality verification), see the dedicated vllm-omni-quantization skill.

Multi-Thread Weight Loading

Diffusion models (Qwen-Image, Wan2.2, FLUX, HunyuanImage3.0, etc.) load safetensors shards in parallel using a thread pool instead of sequentially. This is enabled by default and significantly reduces cold-start time:

Qwen-Image: ~3 min -> substantially faster
Wan2.2-I2V 14B: ~5 min -> substantially faster

No configuration needed -- this is automatic for all diffusion models using safetensors format.

CPU Offloading

Offload model layers to CPU RAM to fit larger models:

Model-Level Offloading

vllm serve <model> --omni --cpu-offload-gb 10

Offloads approximately 10 GB of model weights to CPU. Adds latency for offloaded layers.

Layer-Wise Offloading

For diffusion models, layer-wise offloading moves individual transformer layers to CPU between forward passes:

vllm serve <model> --omni --enable-layerwise-cpu-offload

When multiple DiT transformers exist in a pipeline (e.g., Wan2.2-T2V's transformer + transformer-2), the sequential offloader applies mutual exclusion: only one DiT is loaded on GPU at a time, and all others are offloaded to CPU along with encoders. This prevents OOM on memory-constrained GPUs (64 GB).

GPU Memory Configuration

Maximize throughput by tuning GPU memory allocation:

# Default: 90% of GPU memory
vllm serve <model> --omni --gpu-memory-utilization 0.9

# Conservative: 80% (leaves room for other processes)
vllm serve <model> --omni --gpu-memory-utilization 0.8

# Aggressive: 95%
vllm serve <model> --omni --gpu-memory-utilization 0.95

Benchmarking

Quick Benchmark

python -m vllm_omni.benchmarks.benchmark_serving \
  --model Tongyi-MAI/Z-Image-Turbo \
  --num-prompts 100 \
  --port 8091

Measuring Latency

Time a single request:

time curl -s http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "a red circle"}],
    "extra_body": {"height": 512, "width": 512, "num_inference_steps": 20}
  }' > /dev/null

Monitoring During Benchmark

# GPU utilization
watch -n 1 nvidia-smi

# Server metrics
curl http://localhost:8091/metrics

Optimization Workflow

Baseline: Run benchmark with default settings
Memory: Tune --gpu-memory-utilization to maximize without OOM
Parallelism: Add tensor parallelism if multi-GPU available
Caching: Enable TeaCache or Cache-DiT for diffusion models
Quantization: Apply if memory-constrained
Offloading: Use CPU offloading as last resort for large models
Re-benchmark: Compare against baseline

Troubleshooting

No speedup with TeaCache: Threshold may be too conservative. Lower it gradually (e.g., 0.05) and check quality.

OOM after optimization: Quantization reduces memory. Combine with lower gpu-memory-utilization.

Latency regression with TP: For small models, the communication overhead of tensor parallelism may exceed the compute savings. Use TP only for models that saturate a single GPU.

References

For TeaCache configuration details, see references/teacache.md
For quantization methods and compatibility, see references/quantization.md

hsliuustc0106/vllm-omni-perf

skills/vllm-omni-perf/SKILL.md

Optimize vLLM-Omni performance through benchmarking, TeaCache, Cache-DiT, quantization, CPU offloading, and parallelism tuning. Use when improving inference speed, reducing latency, lowering memory usage, running benchmarks, or enabling diffusion acceleration.

67 stars

testing

Updated May 25, 2026

$ install --global

skillsauth

npx skillsauth add hsliuustc0106/vllm-omni-skills vllm-omni-perf

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 25, 2026, 2:17 AM10.5s4 files scanned

SKILL.md

name:: vllm-omni-perf
description:: Optimize vLLM-Omni performance through benchmarking, TeaCache, Cache-DiT, quantization, CPU offloading, and parallelism tuning. Use when improving inference speed, reducing latency, lowering memory usage, running benchmarks, or enabling diffusion acceleration.

vLLM-Omni Performance Tuning

Overview

Optimization Quick Reference

TeaCache (Diffusion Acceleration)

TeaCache provides adaptive caching for diffusion transformer denoising steps, skipping redundant computations:

vllm serve <model> --omni \
  --enable-teacache \
  --teacache-threshold 0.1

Recommended thresholds by model:

Image models: 0.05-0.15
Video models: 0.08-0.20

Cache-DiT

Alternative diffusion acceleration backend:

vllm serve <model> --omni --enable-cache-dit

Can be combined with TeaCache, but test independently first to measure impact.

Supported models: FLUX.2-dev, Helios-Distilled, Wan2.2, and others using ForwardPattern.Pattern_2. Helios achieves ~20% speedup with cache-dit.

Quantization

For full quantization guidance (method selection, AWQ/GPTQ workflows, FP8 KV cache, quality verification), see the dedicated vllm-omni-quantization skill.

Multi-Thread Weight Loading

Qwen-Image: ~3 min -> substantially faster
Wan2.2-I2V 14B: ~5 min -> substantially faster

No configuration needed -- this is automatic for all diffusion models using safetensors format.

CPU Offloading

Offload model layers to CPU RAM to fit larger models:

Model-Level Offloading

vllm serve <model> --omni --cpu-offload-gb 10

Offloads approximately 10 GB of model weights to CPU. Adds latency for offloaded layers.

Layer-Wise Offloading

For diffusion models, layer-wise offloading moves individual transformer layers to CPU between forward passes:

vllm serve <model> --omni --enable-layerwise-cpu-offload

GPU Memory Configuration

Maximize throughput by tuning GPU memory allocation:

# Default: 90% of GPU memory
vllm serve <model> --omni --gpu-memory-utilization 0.9

# Conservative: 80% (leaves room for other processes)
vllm serve <model> --omni --gpu-memory-utilization 0.8

# Aggressive: 95%
vllm serve <model> --omni --gpu-memory-utilization 0.95

Benchmarking

Quick Benchmark

python -m vllm_omni.benchmarks.benchmark_serving \
  --model Tongyi-MAI/Z-Image-Turbo \
  --num-prompts 100 \
  --port 8091

Measuring Latency

Time a single request:

time curl -s http://localhost:8091/v1/chat/completions \
  -H "Content-Type: application/json" \
  -d '{
    "messages": [{"role": "user", "content": "a red circle"}],
    "extra_body": {"height": 512, "width": 512, "num_inference_steps": 20}
  }' > /dev/null

Monitoring During Benchmark

# GPU utilization
watch -n 1 nvidia-smi

# Server metrics
curl http://localhost:8091/metrics

Optimization Workflow

Baseline: Run benchmark with default settings
Memory: Tune --gpu-memory-utilization to maximize without OOM
Parallelism: Add tensor parallelism if multi-GPU available
Caching: Enable TeaCache or Cache-DiT for diffusion models
Quantization: Apply if memory-constrained
Offloading: Use CPU offloading as last resort for large models
Re-benchmark: Compare against baseline

Troubleshooting

No speedup with TeaCache: Threshold may be too conservative. Lower it gradually (e.g., 0.05) and check quality.

OOM after optimization: Quantization reduces memory. Combine with lower gpu-memory-utilization.

Latency regression with TP: For small models, the communication overhead of tensor parallelism may exceed the compute savings. Use TP only for models that saturate a single GPU.

References

For TeaCache configuration details, see references/teacache.md
For quantization methods and compatibility, see references/quantization.md

Related Skills

hsliuustc0106/vllm-omni-pre-check

development

VerifiedTrustedCommunity

Use before submitting a PR to vllm-project/vllm-omni — self-check the branch against project conventions, catch dead code, verify accuracy/performance claims, and confirm merge readiness. Use when the user says "pre-check", "self review", "pre-submit check", or "check my PR before I open it."

69SKILL.mdUpdated May 29, 2026

hsliuustc0106/vllm-omni-pre-check

hsliuustc0106/skills/vllm-omni-test-report

development

VerifiedTrustedCommunity

--- name: vllm-omni-test-report description: Two report kinds; **default output is always HTML** unless the user explicitly asks for Markdown (.md). **Release** — `scripts/compose_full_report.py` (**测试结论**, Buildkite metrics, **Test Result** = Common stack + optional `--log-dir-h*` nightly-style summaries + H100/CI block, **Issue tracking** = GitHub `ci-failure` + *local test* in:title, Open bugs); use `--format markdown` only when the user wants .md or `patch_report_*.py`. **Nightly** — `script

69SKILL.mdUpdated May 3, 2026

hsliuustc0106/skills/vllm-omni-test-report

hsliuustc0106/vllm-omni-review

testing

VerifiedTrustedCommunity

Review PRs on vllm-project/vllm-omni by routing to the right domain skills, checking critical evidence, and focusing comments on blocking issues. Use when reviewing pull requests or local branches, triaging review depth, running detailed or default review, or checking tests, benchmarks, and breaking changes in vllm-omni.

69SKILL.mdUpdated May 3, 2026

hsliuustc0106/vllm-omni-review

hsliuustc0106/vllm-omni-video-gen

data-ai

VerifiedTrustedCommunity

Generate videos with vLLM-Omni using Wan2.2 and other video generation models. Use when generating videos from text, creating videos from images, configuring video generation parameters, or working with text-to-video or image-to-video models.

67SKILL.mdUpdated May 3, 2026

hsliuustc0106/vllm-omni-video-gen

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/hsliuustc0106/vllm-omni-skills.git

# Copy into Claude Code skills folder (global)
cp -r vllm-omni-skills/skills/vllm-omni-perf ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

hsliuustc0106/vllm-omni-skills

67 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT