skills/model-optimization/sglang/sglang-moss-vl-optimization/SKILL.md
PR-backed and current-main optimization manual for Moss-VL in SGLang. Use when an engineer needs to audit or extend Moss-VL multimodal runtime support, Qwen3VL-like vision encoder plumbing, cross-attention custom masks, vision position ids, image/video processor behavior, conversation template registration, flashinfer prefill requirements, or Moss-VL weight loading.
npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-moss-vl-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Moss-VL landed as a native SGLang runtime path in #23454. It is not a thin alias: the PR added a dedicated model file, processor, multimodal scheduling fields, prompt template, encoder-prefix handling, and a flashinfer prefill requirement for cross-attention custom masks.
Current evidence snapshot:
origin/main: bca3dd958 on 2026-04-24moss_vl.py, multimodal/processors/moss_vl.py, schedule_batch.py, conversation.py, server_args.pyUse skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar.
Every PR cited for this family must be based on diff reading, not only PR titles.
Capture:
grid_thw, media_nums_per_sample, visible_frame_counts, and vision_position_idsmoss-vl)Debug Moss-VL at the multimodal boundary.
cross_attention_mask and becomes a packed flashinfer custom mask.Before adding Moss-VL evidence, open the PR diff/source and update references/pr-history.md with motivation, key implementation, real code excerpts, reviewed files, and validation implications.
--prefill-attention-backend flashinfer and failure when another backend is forcedreferences/pr-history.md: diff-reviewed Moss-VL PR cards.development
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
documentation
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
development
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
devops
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.