Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

hsliuustc0106/vllm-omni-quantization

Name: vllm-omni-quantization
Author: hsliuustc0106

skills/vllm-omni-quantization/SKILL.md

npx skillsauth add hsliuustc0106/vllm-omni-skills vllm-omni-quantization

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

vLLM-Omni Quantization

Overview

Use this skill for vllm-omni quantization work. The current codebase has a unified quantization framework centered on vllm_omni.quantization.build_quant_config(), but the runtime integration still splits into distinct patterns:

AR and general quantization inherited from upstream vllm
diffusion quantization for DiT models in vllm-omni, currently fp8, int8, and gguf
multi-stage omni quantization using scoped pre-quantized checkpoints such as Qwen3-Omni thinker ModelOpt checkpoints

Core principle: keep generic quantization infrastructure in upstream vllm. Keep vllm-omni focused on unified config routing, component scoping, diffusion-specific model wiring, adapter logic, and verification.

Quick Decision

| Task | Use | |------|-----| | Quantize Qwen-Omni, Qwen-TTS, or another AR-backed model | references/methods.md and references/modality-compat.md | | Use build_quant_config() or per-component quantization | references/methods.md | | Quantize diffusion transformer weights with fp8, int8, or gguf | references/diffusion.md | | Add quantization support to a new diffusion model | references/adding-models.md | | Add a new quantization method such as nvfp4 or a new ModelOpt path | references/diffusion.md and references/adding-models.md | | Unsure whether a change belongs in vllm or vllm-omni | references/diffusion.md |

When to Use

Choosing a quantization method for memory or throughput
Checking whether a modality or model family actually supports quantization
Using the unified build_quant_config() entrypoint or per-component config dicts
Enabling diffusion fp8, int8, or gguf
Adding a new diffusion quantization method or a pre-quantized multi-stage model path
Debugging quantized loading, tensor-name mapping, shape mismatch, quality drift, or performance regressions

AR vs Diffusion Boundary

AR and general quantization usually mean upstream vllm methods such as awq, gptq, fp8, and KV-cache FP8.
Diffusion quantization means vllm-omni DiT-specific integration on top of the unified framework and should not duplicate upstream vllm kernels or config semantics.
Multi-stage omni quantization often means pre-quantized checkpoints whose scope must be constrained to the intended component, such as the thinker language_model.

Rule: if a new method is missing generic kernels, loader behavior, or config classes, fix upstream vllm first. vllm-omni should add thin wrappers, component routing, and model-specific wiring, not a private quantization stack.

Example: Enable FP8 for a Diffusion Model

# 1. Start server with fp8 quantization
vllm serve black-forest-labs/FLUX.1-dev --omni \
  --quantization fp8 --tensor-parallel-size 2

# 2. Verify quantized model loaded correctly
curl -s http://localhost:8091/v1/models | python3 -c "
import sys, json
models = json.load(sys.stdin)['data']
print(f'Loaded: {models[0][\"id\"]}') if models else print('ERROR: No models loaded')
"

# 3. Generate test image with fixed seed for quality comparison
curl -s http://localhost:8091/v1/images/generations \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"a cat","seed":42}' -o test_fp8.json

# 4. Compare against BF16 baseline (run separately without --quantization)
# If PSNR < 25 dB or visual artifacts appear, check references/diffusion.md

Common Mistakes

| Symptom | Likely Cause | Fix | |--------|--------------|-----| | quantization flag has no visible effect | wrong model stage or unsupported modality | check references/modality-compat.md | | unified config behaves unexpectedly | per-component dict, default routing, or method override is misunderstood | check references/methods.md and references/diffusion.md | | AR model quality drops too much | aggressive 4-bit setup or wrong method | check calibration and method tradeoffs in references/methods.md | | diffusion method works on one image only | no baseline comparison, no LPIPS gate, or no ignored_layers tuning | use the verification flow in references/diffusion.md and references/adding-models.md | | GGUF mapping fails | missing architecture-specific adapter | add explicit adapter logic, do not rely on generic fallback | | new quantization method design keeps growing | unified framework boundary is unclear | re-check ownership before touching model code | | multi-stage omni checkpoint loads but wrong stages get quantized | component scope is not constrained correctly | check component routing and model config normalization | | diffusion FA uses wrong KV-cache dtype when --kv-cache-dtype is passed | vLLM's --kv-cache-dtype leaked into diffusion config | use --diffusion-kv-cache-dtype instead (#3596) | | offline text_to_image fails | quantization config conflicts with diffusion pipeline init | fixed in #1515, update vllm-omni | | OmniDiffusion init crash | pipeline_class variable not initialized during quantized load | fixed in #1562, update vllm-omni | | HunyuanImage3.0 load_weights error | weight loading fails with quantized HunyuanImage3.0 | fixed in #1598, update vllm-omni |

References

AR and general methods: references/methods.md
Model and modality support matrix: references/modality-compat.md
Diffusion fp8, int8, gguf, and unified-framework workflow: references/diffusion.md
Adding quantization support to a new model: references/adding-models.md

hsliuustc0106/vllm-omni-quantization

skills/vllm-omni-quantization/SKILL.md

Use when working on vLLM-Omni quantization for autoregressive, diffusion, or multi-stage omni models, choosing methods such as `awq`, `gptq`, `fp8`, `int8`, `gguf`, or ModelOpt checkpoints, adding quantized model support, or debugging memory, loader, quality, or performance issues.

67 stars

development

Updated May 25, 2026

$ install --global

skillsauth

npx skillsauth add hsliuustc0106/vllm-omni-skills vllm-omni-quantization

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 25, 2026, 2:17 AM13.1s5 files scanned

SKILL.md

name:: vllm-omni-quantization
description:: Use when working on vLLM-Omni quantization for autoregressive, diffusion, or multi-stage omni models, choosing methods such as `awq`, `gptq`, `fp8`, `int8`, `gguf`, or ModelOpt checkpoints, adding quantized model support, or debugging memory, loader, quality, or performance issues.

vLLM-Omni Quantization

Overview

AR and general quantization inherited from upstream vllm
diffusion quantization for DiT models in vllm-omni, currently fp8, int8, and gguf
multi-stage omni quantization using scoped pre-quantized checkpoints such as Qwen3-Omni thinker ModelOpt checkpoints

Quick Decision

When to Use

Choosing a quantization method for memory or throughput
Checking whether a modality or model family actually supports quantization
Using the unified build_quant_config() entrypoint or per-component config dicts
Enabling diffusion fp8, int8, or gguf
Adding a new diffusion quantization method or a pre-quantized multi-stage model path
Debugging quantized loading, tensor-name mapping, shape mismatch, quality drift, or performance regressions

AR vs Diffusion Boundary

AR and general quantization usually mean upstream vllm methods such as awq, gptq, fp8, and KV-cache FP8.
Diffusion quantization means vllm-omni DiT-specific integration on top of the unified framework and should not duplicate upstream vllm kernels or config semantics.
Multi-stage omni quantization often means pre-quantized checkpoints whose scope must be constrained to the intended component, such as the thinker language_model.

Example: Enable FP8 for a Diffusion Model

# 1. Start server with fp8 quantization
vllm serve black-forest-labs/FLUX.1-dev --omni \
  --quantization fp8 --tensor-parallel-size 2

# 2. Verify quantized model loaded correctly
curl -s http://localhost:8091/v1/models | python3 -c "
import sys, json
models = json.load(sys.stdin)['data']
print(f'Loaded: {models[0][\"id\"]}') if models else print('ERROR: No models loaded')
"

# 3. Generate test image with fixed seed for quality comparison
curl -s http://localhost:8091/v1/images/generations \
  -H 'Content-Type: application/json' \
  -d '{"prompt":"a cat","seed":42}' -o test_fp8.json

# 4. Compare against BF16 baseline (run separately without --quantization)
# If PSNR < 25 dB or visual artifacts appear, check references/diffusion.md

Common Mistakes

References

AR and general methods: references/methods.md
Model and modality support matrix: references/modality-compat.md
Diffusion fp8, int8, gguf, and unified-framework workflow: references/diffusion.md
Adding quantization support to a new model: references/adding-models.md

Related Skills

hsliuustc0106/vllm-omni-pre-check

development

VerifiedTrustedCommunity

Use before submitting a PR to vllm-project/vllm-omni — self-check the branch against project conventions, catch dead code, verify accuracy/performance claims, and confirm merge readiness. Use when the user says "pre-check", "self review", "pre-submit check", or "check my PR before I open it."

69SKILL.mdUpdated May 29, 2026

hsliuustc0106/vllm-omni-pre-check

hsliuustc0106/skills/vllm-omni-test-report

development

VerifiedTrustedCommunity

--- name: vllm-omni-test-report description: Two report kinds; **default output is always HTML** unless the user explicitly asks for Markdown (.md). **Release** — `scripts/compose_full_report.py` (**测试结论**, Buildkite metrics, **Test Result** = Common stack + optional `--log-dir-h*` nightly-style summaries + H100/CI block, **Issue tracking** = GitHub `ci-failure` + *local test* in:title, Open bugs); use `--format markdown` only when the user wants .md or `patch_report_*.py`. **Nightly** — `script

69SKILL.mdUpdated May 3, 2026

hsliuustc0106/skills/vllm-omni-test-report

hsliuustc0106/vllm-omni-review

testing

VerifiedTrustedCommunity

Review PRs on vllm-project/vllm-omni by routing to the right domain skills, checking critical evidence, and focusing comments on blocking issues. Use when reviewing pull requests or local branches, triaging review depth, running detailed or default review, or checking tests, benchmarks, and breaking changes in vllm-omni.

69SKILL.mdUpdated May 3, 2026

hsliuustc0106/vllm-omni-review

hsliuustc0106/vllm-omni-video-gen

data-ai

VerifiedTrustedCommunity

Generate videos with vLLM-Omni using Wan2.2 and other video generation models. Use when generating videos from text, creating videos from images, configuring video generation parameters, or working with text-to-video or image-to-video models.

67SKILL.mdUpdated May 3, 2026

hsliuustc0106/vllm-omni-video-gen

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/hsliuustc0106/vllm-omni-skills.git

# Copy into Claude Code skills folder (global)
cp -r vllm-omni-skills/skills/vllm-omni-quantization ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

hsliuustc0106/vllm-omni-skills

67 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT