Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

BBuf/sglang-qwen3-core-optimization

Name: sglang-qwen3-core-optimization
Author: BBuf

skills/model-optimization/sglang/sglang-qwen3-core-optimization/SKILL.md

npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-qwen3-core-optimization

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

SGLang Qwen3 Core Optimization

This skill covers the non-hybrid Qwen3 text path: dense Qwen3, Qwen3 MoE, Qwen3-30B-A3B, Qwen3-235B-A22B, Qwen3 Instruct/Thinking variants, and the shared Qwen3 infrastructure reused by Qwen3.5, Qwen3-Next, Qwen3.6, Qwen3 Omni thinker-only, and Qwen-family quantization loaders.

Mandatory Evidence Standard

When writing or revising Qwen3 Core optimization docs, use ../../model-pr-diff-dossier/SKILL.md as the production standard.

Do not fill PR history from a script, generated table, or title-level summary. For every PR you cite, read the PR diff or the final merge commit and record:

why the PR existed,
which runtime/docs/tests changed,
the key implementation idea,
the most important real code excerpt,
validation impact and regression risk.

If a PR is docs-only, quote the exact launch/config line that changed and explain why that line matters for serving or validation.

Evidence Files

references/pr-history.md: canonical per-PR diff dossier for Qwen3 Core.
references/playbook.md: fast triage map from symptom to PR families and validation lanes.
model-pr-optimization-history/sglang/qwen3-core/README.zh.md: Chinese PR optimization history.
model-pr-optimization-history/sglang/qwen3-core/README.en.md: English PR optimization history.

Current evidence snapshot:

SGLang mainline checked around 2026-04-22: b3e6cf60a
sgl-cookbook mainline checked around 2026-04-21: 816bad5

Runtime Surfaces

Start from these files before making a Qwen3 Core change:

python/sglang/srt/models/qwen3.py
python/sglang/srt/models/qwen3_moe.py
python/sglang/srt/models/qwen2.py
python/sglang/srt/layers/moe/
python/sglang/srt/layers/layernorm.py
python/sglang/srt/layers/quantization/
python/sglang/srt/layers/attention/
python/sglang/srt/distributed/
python/sglang/srt/function_call/qwen25_detector.py
test/registered/models/test_qwen_models.py
test/registered/4-gpu-models/test_qwen3_30b.py
test/registered/stress/test_stress_qwen3_235b.py
test/srt/models/test_lora_qwen3.py
test/registered/backends/test_qwen3_fp4_trtllm_gen_moe.py
NPU Qwen tests under test/registered/npu/
docs/basic_usage/qwen3.md
docs_new/cookbook/autoregressive/Qwen/Qwen3.mdx

Orientation

Treat Qwen3 Core as the compatibility baseline:

Dense Qwen3 owns the common packed QKV, Q/K RMSNorm, RoPE fallback, LM-head, LoRA, embedding, parser, and many platform paths.
Qwen3 MoE owns FusedMoE/EPMoE/DeepEP/EPLB/TBO/FlashInfer/FP8/NVFP4/CP behavior.
Later Qwen families often reuse Qwen3 fixes: PP splitting, tied embeddings, packed module mapping, RoPE config fallback, fused QK norm, EAGLE3 hidden capture, and backend guards.
Open PRs are radar only. Re-read their latest diffs before presenting them as supported behavior.

Change Order

Identify the exact model class: Qwen3ForCausalLM, Qwen3MoeForCausalLM, pooled output, embedding, or a downstream Qwen family reusing Qwen3 logic.
Confirm config compatibility: rope_parameters, top-level rope_theta, rope_scaling, layer_types, tie_word_embeddings, and quantization config.
Establish BF16 correctness with the smallest viable TP/PP shape.
Add or debug quantization only after BF16 works: ModelOpt FP8/NVFP4, MXFP4, W4AFP8, GPTQ, ModelSlim, or platform-native quant.
For MoE, separate routing/top-k, expert weight loading, A2A backend, all-reduce/reduce-scatter, and load balancing.
For fused kernels, prove fallback correctness first, then enable fused QK-norm/RoPE, fused KV store, fused all-reduce, or FP8 KV write.
For platform work, keep CUDA, NPU, XPU, MLX, and AMD logic behind backend-specific gates.
Update PR history docs with per-PR diff cards if the change affects model support or serving behavior.

Open Optimization Items

#9147: Qwen3-MoE W4AFP8.
#20127: tied embeddings for Qwen MoE and Qwen3-Next.
#20474: Intel XPU Qwen3 layernorm/MRoPE support.
#20520: NPU TP communication compression.
#21412: dense Qwen3 old-style RoPE compatibility.
#21770: Apple MLX Qwen3 tests.
#22529: sliding window attention for Qwen3.
#22674: shared Qwen NPU quant packed mappings.
#22837: Qwen3 reasoning detector tool-call boundary.
#23372: NPU speculative decoding CI.
#23397: dense deterministic math for alignment.
#23434: Qwen3 pooled-output embedding accessor.

Validation Lanes

Dense correctness: Qwen3 tiny/0.6B/1.7B/4B/8B launch and deterministic prompt checks.
MoE correctness: Qwen3-30B-A3B and Qwen3-235B-A22B under TP/EP/DP attention.
Quantization: BF16 reference plus one target checkpoint per backend, especially ModelOpt FP8/NVFP4 and NPU GPTQ/ModelSlim.
Fused kernels: compare fallback vs fused QK-norm/RoPE, fused KV store, fused FP8 KV write, and fused all-reduce.
PP/tied embeddings: tied and untied checkpoints, with and without lm_head.weight.
Parser: streaming and non-streaming reasoning plus <tool_call> before and after </think>.
Platform: CUDA H100/H200/B200/GB200, Ascend NPU, Intel XPU, Apple MLX, and AMD lanes separately.

Non-Negotiable Evidence Rule

Use skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar. Every PR cited for this family must be based on diff reading, not only PR titles.

BBuf/sglang-qwen3-core-optimization

skills/model-optimization/sglang/sglang-qwen3-core-optimization/SKILL.md

PR-diff-backed optimization manual for Qwen3 dense and Qwen3 MoE in SGLang. Use when an engineer needs to recover, extend, audit, or write documentation for Qwen3/Qwen3-30B/Qwen3-235B-A22B, FP8/NVFP4/MXFP4/W4A4, fused QK-norm/RoPE/KV-store paths, FlashInfer TRTLLM-GEN-MoE, DeepEP/EPLB/TBO/context parallel, EAGLE3, LoRA, PP/tied embeddings, Ascend NPU/XPU/MLX support, or Qwen3 reasoning/tool-parser behavior.

195 stars

tools

Updated May 2, 2026

$ install --global

skillsauth

npx skillsauth add BBuf/AI-Infra-Auto-Driven-SKILLS sglang-qwen3-core-optimization

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: May 2, 2026, 2:37 AM145.9s3 files scanned

SKILL.md

name:: sglang-qwen3-core-optimization
description:: PR-diff-backed optimization manual for Qwen3 dense and Qwen3 MoE in SGLang. Use when an engineer needs to recover, extend, audit, or write documentation for Qwen3/Qwen3-30B/Qwen3-235B-A22B, FP8/NVFP4/MXFP4/W4A4, fused QK-norm/RoPE/KV-store paths, FlashInfer TRTLLM-GEN-MoE, DeepEP/EPLB/TBO/context parallel, EAGLE3, LoRA, PP/tied embeddings, Ascend NPU/XPU/MLX support, or Qwen3 reasoning/tool-parser behavior.

SGLang Qwen3 Core Optimization

Mandatory Evidence Standard

When writing or revising Qwen3 Core optimization docs, use ../../model-pr-diff-dossier/SKILL.md as the production standard.

Do not fill PR history from a script, generated table, or title-level summary. For every PR you cite, read the PR diff or the final merge commit and record:

why the PR existed,
which runtime/docs/tests changed,
the key implementation idea,
the most important real code excerpt,
validation impact and regression risk.

If a PR is docs-only, quote the exact launch/config line that changed and explain why that line matters for serving or validation.

Evidence Files

references/pr-history.md: canonical per-PR diff dossier for Qwen3 Core.
references/playbook.md: fast triage map from symptom to PR families and validation lanes.
model-pr-optimization-history/sglang/qwen3-core/README.zh.md: Chinese PR optimization history.
model-pr-optimization-history/sglang/qwen3-core/README.en.md: English PR optimization history.

Current evidence snapshot:

SGLang mainline checked around 2026-04-22: b3e6cf60a
sgl-cookbook mainline checked around 2026-04-21: 816bad5

Runtime Surfaces

Start from these files before making a Qwen3 Core change:

python/sglang/srt/models/qwen3.py
python/sglang/srt/models/qwen3_moe.py
python/sglang/srt/models/qwen2.py
python/sglang/srt/layers/moe/
python/sglang/srt/layers/layernorm.py
python/sglang/srt/layers/quantization/
python/sglang/srt/layers/attention/
python/sglang/srt/distributed/
python/sglang/srt/function_call/qwen25_detector.py
test/registered/models/test_qwen_models.py
test/registered/4-gpu-models/test_qwen3_30b.py
test/registered/stress/test_stress_qwen3_235b.py
test/srt/models/test_lora_qwen3.py
test/registered/backends/test_qwen3_fp4_trtllm_gen_moe.py
NPU Qwen tests under test/registered/npu/
docs/basic_usage/qwen3.md
docs_new/cookbook/autoregressive/Qwen/Qwen3.mdx

Orientation

Treat Qwen3 Core as the compatibility baseline:

Dense Qwen3 owns the common packed QKV, Q/K RMSNorm, RoPE fallback, LM-head, LoRA, embedding, parser, and many platform paths.
Qwen3 MoE owns FusedMoE/EPMoE/DeepEP/EPLB/TBO/FlashInfer/FP8/NVFP4/CP behavior.
Later Qwen families often reuse Qwen3 fixes: PP splitting, tied embeddings, packed module mapping, RoPE config fallback, fused QK norm, EAGLE3 hidden capture, and backend guards.
Open PRs are radar only. Re-read their latest diffs before presenting them as supported behavior.

Change Order

Identify the exact model class: Qwen3ForCausalLM, Qwen3MoeForCausalLM, pooled output, embedding, or a downstream Qwen family reusing Qwen3 logic.
Confirm config compatibility: rope_parameters, top-level rope_theta, rope_scaling, layer_types, tie_word_embeddings, and quantization config.
Establish BF16 correctness with the smallest viable TP/PP shape.
Add or debug quantization only after BF16 works: ModelOpt FP8/NVFP4, MXFP4, W4AFP8, GPTQ, ModelSlim, or platform-native quant.
For MoE, separate routing/top-k, expert weight loading, A2A backend, all-reduce/reduce-scatter, and load balancing.
For fused kernels, prove fallback correctness first, then enable fused QK-norm/RoPE, fused KV store, fused all-reduce, or FP8 KV write.
For platform work, keep CUDA, NPU, XPU, MLX, and AMD logic behind backend-specific gates.
Update PR history docs with per-PR diff cards if the change affects model support or serving behavior.

Open Optimization Items

#9147: Qwen3-MoE W4AFP8.
#20127: tied embeddings for Qwen MoE and Qwen3-Next.
#20474: Intel XPU Qwen3 layernorm/MRoPE support.
#20520: NPU TP communication compression.
#21412: dense Qwen3 old-style RoPE compatibility.
#21770: Apple MLX Qwen3 tests.
#22529: sliding window attention for Qwen3.
#22674: shared Qwen NPU quant packed mappings.
#22837: Qwen3 reasoning detector tool-call boundary.
#23372: NPU speculative decoding CI.
#23397: dense deterministic math for alignment.
#23434: Qwen3 pooled-output embedding accessor.

Validation Lanes

Dense correctness: Qwen3 tiny/0.6B/1.7B/4B/8B launch and deterministic prompt checks.
MoE correctness: Qwen3-30B-A3B and Qwen3-235B-A22B under TP/EP/DP attention.
Quantization: BF16 reference plus one target checkpoint per backend, especially ModelOpt FP8/NVFP4 and NPU GPTQ/ModelSlim.
Fused kernels: compare fallback vs fused QK-norm/RoPE, fused KV store, fused FP8 KV write, and fused all-reduce.
PP/tied embeddings: tied and untied checkpoints, with and without lm_head.weight.
Parser: streaming and non-streaming reasoning plus <tool_call> before and after </think>.
Platform: CUDA H100/H200/B200/GB200, Ascend NPU, Intel XPU, Apple MLX, and AMD lanes separately.

Non-Negotiable Evidence Rule

Use skills/model-optimization/model-pr-diff-dossier/SKILL.md as the production bar. Every PR cited for this family must be based on diff reading, not only PR titles.

Related Skills

BBuf/sglang-humanize-review

development

VerifiedTrustedCommunity

Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.

531SKILL.mdUpdated May 21, 2026

BBuf/sglang-humanize-review

BBuf/model-pr-history-knowledge

documentation

VerifiedTrustedCommunity

Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.

531SKILL.mdUpdated May 17, 2026

BBuf/model-pr-history-knowledge

BBuf/vllm-sota-humanize-loop

development

VerifiedTrustedCommunity

Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.

423SKILL.mdUpdated May 27, 2026

BBuf/vllm-sota-humanize-loop

BBuf/llm-pipeline-analysis

devops

VerifiedTrustedCommunity

Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.

423SKILL.mdUpdated May 21, 2026

BBuf/llm-pipeline-analysis

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/BBuf/AI-Infra-Auto-Driven-SKILLS.git

# Copy into Claude Code skills folder (global)
cp -r AI-Infra-Auto-Driven-SKILLS/skills/model-optimization/sglang/sglang-qwen3-core-optimization ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

BBuf/AI-Infra-Auto-Driven-SKILLS

195 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT