skills/model-optimization/sglang/sglang-qwen-image-optimization/SKILL.md
PR-backed and current-main optimization manual for Qwen-Image and Qwen-Image-Edit in SGLang Diffusion. Use when Codex needs to recover, extend, or audit diffusion transformer loading, layer serving, CUDA graph, TeaCache, IMA, ModelOpt FP8, AMD kernels, Qwen-Image detectors, or cookbook diffusion recipes.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-qwen-image-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Qwen-Image is the diffusion Qwen lane. It is intentionally split from autoregressive Qwen because the runtime surfaces are SGLang Diffusion pipelines, transformer layers, ModelOpt export/loading, image-edit conditioning, CUDA graph, TeaCache, and AMD diffusion kernels.
Current evidence snapshot:
origin/main: bca3dd958 on 2026-04-24origin/main: 816bad5 on 2026-04-212026-04-23docs_new/cookbook/diffusion/Qwen-Image/Qwen-Image.mdx, Qwen-Image-Edit.mdx, and docs_new/src/snippets/diffusion/qwen-image*.jsxWhen writing or updating Qwen-Image optimization docs, manually inspect every cited PR diff. Include PR link, state, diff stats, motivation, implementation path, key code excerpt, reviewed files, and validation implications. Use model-pr-diff-dossier for new PR-history production.
Capture:
Keep diffusion pipeline changes separate from autoregressive Qwen.
docs_new/cookbook/diffusion/Qwen-Image/Qwen-Image.mdxdocs_new/cookbook/diffusion/Qwen-Image/Qwen-Image-Edit.mdxdocs_new/src/snippets/diffusion/qwen-image-deployment.jsxdocs_new/src/snippets/diffusion/qwen-image-edit-deployment.jsxpython/sglang/multimodal_gen/configs/pipeline_configs/qwen_image.pypython/sglang/multimodal_gen/configs/models/dits/qwenimage.pypython/sglang/multimodal_gen/runtime/models/dits/qwen_image.pypython/sglang/multimodal_gen/runtime/models/encoders/qwen2_5vl.pypython/sglang/multimodal_gen/runtime/cache/teacache.pypython/sglang/multimodal_gen/runtime/utils/cuda_graph.pypython/sglang/multimodal_gen/tools/build_modelopt_fp8_transformer.pyRead references/pr-history.md for the diff-reviewed card of each PR. Most entries are currently open; #22953 is merged in current main and can be treated as current behavior.
references/pr-history.md: manually diff-reviewed PR cards.references/playbook.md: symptom map, toggle matrix, and validation rules.development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.
development
Run an autonomous Humanize-governed SGLang SOTA performance loop for one LLM model: first perform the fixed fair SGLang/vLLM/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches SGLang code, optionally uses ncu-report-skill for kernel evidence, and revalidates until SGLang matches or beats the best observed framework under the same workload and SLA.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.