skills/vllm-omni-quantization/SKILL.md
Use when working on vLLM-Omni quantization for autoregressive, diffusion, or multi-stage omni models, choosing methods such as `awq`, `gptq`, `fp8`, `int8`, `gguf`, or ModelOpt checkpoints, adding quantized model support, or debugging memory, loader, quality, or performance issues.
npx skillsauth add hsliuustc0106/vllm-omni-skills vllm-omni-quantizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill for vllm-omni quantization work. The current codebase has a unified quantization framework centered on vllm_omni.quantization.build_quant_config(), but the runtime integration still splits into distinct patterns:
vllmvllm-omni, currently fp8, int8, and ggufCore principle: keep generic quantization infrastructure in upstream vllm. Keep vllm-omni focused on unified config routing, component scoping, diffusion-specific model wiring, adapter logic, and verification.
| Task | Use |
|------|-----|
| Quantize Qwen-Omni, Qwen-TTS, or another AR-backed model | references/methods.md and references/modality-compat.md |
| Use build_quant_config() or per-component quantization | references/methods.md |
| Quantize diffusion transformer weights with fp8, int8, or gguf | references/diffusion.md |
| Add quantization support to a new diffusion model | references/adding-models.md |
| Add a new quantization method such as nvfp4 or a new ModelOpt path | references/diffusion.md and references/adding-models.md |
| Unsure whether a change belongs in vllm or vllm-omni | references/diffusion.md |
build_quant_config() entrypoint or per-component config dictsfp8, int8, or ggufvllm methods such as awq, gptq, fp8, and KV-cache FP8.vllm-omni DiT-specific integration on top of the unified framework and should not duplicate upstream vllm kernels or config semantics.language_model.Rule: if a new method is missing generic kernels, loader behavior, or config classes, fix upstream vllm first. vllm-omni should add thin wrappers, component routing, and model-specific wiring, not a private quantization stack.
# 1. Start server with fp8 quantization
vllm serve black-forest-labs/FLUX.1-dev --omni \
--quantization fp8 --tensor-parallel-size 2
# 2. Verify quantized model loaded correctly
curl -s http://localhost:8091/v1/models | python3 -c "
import sys, json
models = json.load(sys.stdin)['data']
print(f'Loaded: {models[0][\"id\"]}') if models else print('ERROR: No models loaded')
"
# 3. Generate test image with fixed seed for quality comparison
curl -s http://localhost:8091/v1/images/generations \
-H 'Content-Type: application/json' \
-d '{"prompt":"a cat","seed":42}' -o test_fp8.json
# 4. Compare against BF16 baseline (run separately without --quantization)
# If PSNR < 25 dB or visual artifacts appear, check references/diffusion.md
| Symptom | Likely Cause | Fix |
|--------|--------------|-----|
| quantization flag has no visible effect | wrong model stage or unsupported modality | check references/modality-compat.md |
| unified config behaves unexpectedly | per-component dict, default routing, or method override is misunderstood | check references/methods.md and references/diffusion.md |
| AR model quality drops too much | aggressive 4-bit setup or wrong method | check calibration and method tradeoffs in references/methods.md |
| diffusion method works on one image only | no baseline comparison, no LPIPS gate, or no ignored_layers tuning | use the verification flow in references/diffusion.md and references/adding-models.md |
| GGUF mapping fails | missing architecture-specific adapter | add explicit adapter logic, do not rely on generic fallback |
| new quantization method design keeps growing | unified framework boundary is unclear | re-check ownership before touching model code |
| multi-stage omni checkpoint loads but wrong stages get quantized | component scope is not constrained correctly | check component routing and model config normalization |
| diffusion FA uses wrong KV-cache dtype when --kv-cache-dtype is passed | vLLM's --kv-cache-dtype leaked into diffusion config | use --diffusion-kv-cache-dtype instead (#3596) |
| offline text_to_image fails | quantization config conflicts with diffusion pipeline init | fixed in #1515, update vllm-omni |
| OmniDiffusion init crash | pipeline_class variable not initialized during quantized load | fixed in #1562, update vllm-omni |
| HunyuanImage3.0 load_weights error | weight loading fails with quantized HunyuanImage3.0 | fixed in #1598, update vllm-omni |
fp8, int8, gguf, and unified-framework workflow: references/diffusion.mddevelopment
Use before submitting a PR to vllm-project/vllm-omni — self-check the branch against project conventions, catch dead code, verify accuracy/performance claims, and confirm merge readiness. Use when the user says "pre-check", "self review", "pre-submit check", or "check my PR before I open it."
development
--- name: vllm-omni-test-report description: Two report kinds; **default output is always HTML** unless the user explicitly asks for Markdown (.md). **Release** — `scripts/compose_full_report.py` (**测试结论**, Buildkite metrics, **Test Result** = Common stack + optional `--log-dir-h*` nightly-style summaries + H100/CI block, **Issue tracking** = GitHub `ci-failure` + *local test* in:title, Open bugs); use `--format markdown` only when the user wants .md or `patch_report_*.py`. **Nightly** — `script
testing
Review PRs on vllm-project/vllm-omni by routing to the right domain skills, checking critical evidence, and focusing comments on blocking issues. Use when reviewing pull requests or local branches, triaging review depth, running detailed or default review, or checking tests, benchmarks, and breaking changes in vllm-omni.
data-ai
Generate videos with vLLM-Omni using Wan2.2 and other video generation models. Use when generating videos from text, creating videos from images, configuring video generation parameters, or working with text-to-video or image-to-video models.