
Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.
Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.
Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.
Run an autonomous Humanize-governed SGLang SOTA performance loop for one LLM model: first perform the fixed fair SGLang/vLLM/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches SGLang code, optionally uses ncu-report-skill for kernel evidence, and revalidates until SGLang matches or beats the best observed framework under the same workload and SLA.
Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.
Build an operator-level compute template for an LLM and estimate FLOPs/MFU for a serving shape. Use when you need tensor shapes, per-op FLOPs, kernel-to-op MFU mapping, or parallelism what-if analysis.
Parse SGLang/vLLM startup logs to explain GPU memory use and request capacity. Use for KV cache budget, mem-fraction-static comparisons, OOM triage, and max-concurrency estimates.
Use when creating or revising model PR optimization history documents for SGLang, vLLM, or another serving framework that cite GitHub PRs. Requires manual, per-PR source-diff review and documentation of motivation, key implementation approach, most important code excerpts, reviewed files, and validation implications instead of generated or one-line summaries.
Unified LLM torch-profiler triage skill for `sglang`, `vllm`, and `TensorRT-LLM`. Use it to inspect an existing `trace.json(.gz)` or profile directory, or to drive live profiling against a running server and return one three-table report with kernel, overlap-opportunity, and fuse-pattern tables.
Framework-independent LLM serving benchmark skill for comparing SGLang, vLLM, TensorRT-LLM, or another serving framework. Use when a user wants to find the best deployment command for one model across multiple serving frameworks under the same workload, GPU budget, and latency SLA.
Use when developing, optimizing, debugging, or porting AI-infra GPU kernels through an AKO4ALL-centered loop, including Triton, CUDA C++/PTX, CUTLASS/CuTe C++, and CuTe DSL kernels; also use when setting up a sibling AKO4ALL repo, creating microbench harnesses, profiling with nsys/ncu, and validating kernel changes against real operator or model benchmarks. Do not trigger on simple Triton or CUDA API lookups; this skill is for full optimization or rewrite tasks where AKO discipline pays off.
Return public original model architecture diagrams for user-specified LLM, VLM, MoE, diffusion, OCR, and SGLang/sgl-cookbook model families. Use when the user asks for a model structure chart, architecture diagram, or rendered image link for a specific model such as DeepSeek, GLM, Qwen, Kimi, MiniMax, Step, Hunyuan, or Qwen3-VL.
PR-backed optimization manual for MOSS-VL in vLLM. Use when an engineer needs to audit, debug, extend, or document Tracking note for MOSS-VL, which does not have a native vLLM mainline model module today.
PR-backed optimization manual for Step3.5 / Step3-VL in vLLM. Use when an engineer needs to audit, debug, extend, or document Step3.5-Flash and Step3-VL serving, NVFP4, tool/reasoning parser, and HF-style processor evolution.
PR-backed optimization manual for GPT-OSS in SGLang. Use when an engineer needs to audit, debug, extend, or document OpenAI GPT-OSS MoE, MXFP4/FP8 quantization, DP/EP, reasoning parser, tool calling, and Eagle/spec decode.
PR-backed and current-main optimization manual for DeepSeek V3 and DeepSeek R1 in SGLang. Use when an engineer needs to recover, extend, or audit DeepSeek V3/R1 MLA, MoE, shared experts, FP8/FP4/W4AFP8/MXFP4/NVFP4 loading, MTP, DeepEP, DP attention, LoRA, backend selection, or validation lanes.
PR-backed and current-main optimization manual for GLM-4V, GLM-4.1V, GLM-4.5V, GLM-4.6V, GLM-Glyph, and GLM-OCR in SGLang. Use when an engineer needs to recover, extend, or audit GLM vision processors, GLM4V MoE, vision encoder DP/PP, GLM-OCR/NextN loading, transformers compatibility, Conv3D-to-linear projection, AMD/NPU validation, or VLM/OCR cookbook recipes.
PR-backed and current-main optimization manual for GLM-4.5 and GLM-4.5 Air/MoE in SGLang. Use when an engineer needs to recover, extend, or audit GLM-4.5 MoE loading, A2A/DeepEP, reduce-scatter behavior, NVFP4 padding, tool parser behavior, AMD/NPU/Blackwell validation, or GLM-4.5 cookbook recipes.
PR-backed optimization manual for Gemma 4 in SGLang. Use when an engineer needs to audit, debug, extend, or document Gemma 4 text, MoE, multimodal, reasoning, tool use, and quantized MoE serving.
PR-backed optimization manual for Ernie4.5 / Ernie4.5-VL in SGLang. Use when an engineer needs to audit, debug, extend, or document the SGLang Ernie4.5 multimodal runtime, especially the initial VL landing, fused Triton rotary path, and later cos/sin cache rewrite for Ernie4.5-VL.
PR-backed optimization manual for Intern-S1 in SGLang. Use when an engineer needs to audit, debug, extend, or document Intern-S1 language and video-aware serving, processor integration, and tool/reasoning parser behavior.
PR-backed and current-main optimization manual for GLM-5 and GLM-5.1 in SGLang. Use when an engineer needs to recover, extend, or audit GLM-5 DSA/NSA/NSA indexer paths, GLM-5.1 FP8/MXFP4/NVFP4, NextN/MTP, dense-attention threshold, NSA TileLang/AITER, tool templates, EAGLE, PCG, AMD/Blackwell/NPU validation, or GLM-5 cookbook recipes.
PR-backed optimization manual for InternVL3.5 in SGLang. Use when an engineer needs to audit, debug, extend, or document InternVL3.5 multimodal processor, video support, ViT DP / CUDA graph, and non-CUDA backend compatibility.
PR-backed and current-main optimization manual for `moonshotai/Kimi-K2*`, `moonshotai/Kimi-K2.5*`, and `moonshotai/Kimi-K2.6*` in SGLang. Use when an engineer needs to recover, extend, or audit Kimi optimizations, including K2 router/MoE fast paths, K2 thinking Marlin paths, K2.5/K2.6 wrapper/multimodal/runtime plumbing, W4AFP8/W4A16/MXFP4 quant tracks, parser contracts, LoRA coverage, and backend-specific validation.
PR-backed and current-main optimization manual for Moss-VL in SGLang. Use when an engineer needs to audit or extend Moss-VL multimodal runtime support, Qwen3VL-like vision encoder plumbing, cross-attention custom masks, vision position ids, image/video processor behavior, conversation template registration, flashinfer prefill requirements, or Moss-VL weight loading.
PR-backed optimization manual for MiMo-V2 / MiMo-V2-Flash / MiMo-V2.5 in SGLang. Use when an engineer needs to audit, debug, extend, or document MiMo-V2 inference-centric MoE runtime, flashinfer/TRT-LLM fused all-reduce, overlap, MTP/EAGLE, multimodal/pro variants, and reasoning parser behavior.
PR-backed and current-main optimization manual for the `MiniMaxAI/MiniMax-M2` series, including M2, M2.1, M2.5, M2.7, and M2.7-highspeed. Use when an engineer needs to recover, extend, or audit MiniMax-specific optimizations, TP QK norm/all-reduce behavior, parser contracts, distributed runtime behavior, quantized loading, or backend-specific validation.
PR-backed and current-main optimization manual for GLM-4.6, GLM-4.6V-adjacent text paths, GLM-4.7, and GLM-4.7-Flash in SGLang. Use when an engineer needs to recover, extend, or audit GLM shared-expert fusion, dual-stream MoE GEMM overlap, GLM-4.7 tool parser, NVFP4/MTP, GLM4-MoE-Lite/Flash loading, AMD/NPU validation, or GLM-4.6/4.7 cookbook recipes.
PR-backed optimization manual for Nemotron Super / Nano Hybrid in SGLang. Use when an engineer needs to audit, debug, extend, or document NemotronH, Nemotron 3 Super, Nemotron Nano hybrid Mamba+Attention+MoE, MTP, NVFP4, and VL adjacencies.
PR-backed and current-main optimization manual for Qwen3.5 in SGLang. Use when an engineer needs to recover, extend, or audit Qwen3.5 dense/MoE, Qwen3.5 FP8/NVFP4/MXFP4, MTP, GDN projection, PP, EPLB, AMD/NPU/Blackwell deployments, FP8 KV caution paths, or Qwen3.5 cookbook recipes.
PR-backed and current-main optimization manual for Qwen3-Coder and Qwen3-Coder-Next in SGLang. Use when an engineer needs to recover, extend, or audit Qwen3-Coder-480B-A35B, Qwen3-Coder-Next, tool-call parser behavior, incremental streaming tool arguments, NVFP4/FP8 loading, MoE fused configs, AMD/NPU/Blackwell recipes, or coding-agent deployment docs.
PR-backed and current-main optimization manual for Tencent Hunyuan 3 Preview in SGLang. Use when an engineer needs to audit or extend Hunyuan3 Preview cookbook recipes, BF16 MoE hardware sizing, H200/B200/B300/GB300 command generation, MTP/EAGLE flags, `hunyuan` reasoning/tool parsers, Blackwell attention backend selection, or trust-remote-code launch guidance.
PR-backed optimization manual for DeepSeek V3.1 in vLLM. Use when an engineer needs to audit, debug, extend, or document DeepSeek V3.1 parser, scale-format, DeepGEMM, and reasoning-tooling deltas layered on top of the base DeepSeek V3 runtime.
PR-backed optimization manual for DeepSeek V3.2 in vLLM. Use when an engineer needs to audit, debug, extend, or document DeepSeek V3.2 sparse-MLA / DSA runtime, indexer, tool parser, MTP fallback, and long-context decode kernels in vLLM.
PR-backed optimization manual for DeepSeek V4 in vLLM. Use when an engineer needs to audit, debug, extend, or document DeepSeek V4 current-main support in vLLM, including the model module, MTP path, tokenizer/renderer, DSML tool parser, expert-dtype handling, and BF16 persistent-topk follow-up.
PR-backed optimization manual for GLM VLM / OCR in vLLM. Use when an engineer needs to audit, debug, extend, or document GLM-4V, GLM-4.1V, GLM-OCR, GLM visual processor, MRoPE, video, and OCR MTP behavior in vLLM.
PR-backed optimization manual for GLM-4.5 / 4.5V in vLLM. Use when an engineer needs to audit, debug, extend, or document GLM-4.5 text, GLM-4.5V, GLM-4.5-Air, shared MoE routing, and tool/reasoning parser behavior in vLLM.
PR-backed optimization manual for GPT-OSS in vLLM. Use when an engineer needs to audit, debug, extend, or document OpenAI GPT-OSS MoE, MXFP4/FP8 quantization, DP/EP, reasoning parser, tool calling, and Eagle/spec decode.
PR-backed optimization manual for GLM-4.6 / 4.7 in vLLM. Use when an engineer needs to audit, debug, extend, or document GLM-4.6, GLM-4.6V, GLM-4.7, GLM-4.7-Flash, GLM-Lite, and the parser / quant / fused-MoE deltas after the 4.5 generation.
PR-backed optimization manual for MiMo-V2 / MiMo-V2-Flash / MiMo-V2.5 in vLLM. Use when an engineer needs to audit, debug, extend, or document MiMo-V2 inference-centric MoE runtime, MTP behavior, MiMo-V2.5 Pro/Omni support, and the transition from older MiMo checkpoints in vLLM.
PR-backed optimization manual for Llama 4 in vLLM. Use when an engineer needs to audit, debug, extend, or document Llama4 text and multimodal runtime, FP8/FP4 quantization, router behavior, long-context attention, and Eagle support.
PR-backed optimization manual for Kimi K2 / K2.5 / Linear / Audio / VL in vLLM. Use when an engineer needs to audit, debug, extend, or document Kimi-VL, Kimi-Linear, Kimi-K2.5, Kimi-Audio, parser aliases, and quantized MLA behavior in vLLM.
PR-backed optimization manual for Ernie4.5 / Ernie4.5-VL in vLLM. Use when an engineer needs to audit, debug, extend, or document Baidu Ernie4.5 text/VL/MoE runtime, vision rotary, and long-input stability.
PR-backed optimization manual for InternVL3.5 in vLLM. Use when an engineer needs to audit, debug, extend, or document InternVL3.5 multimodal processor, video support, ViT DP / torch.compile, and backend compatibility.
PR-backed optimization manual for Hunyuan 3 Preview in vLLM. Use when an engineer needs to audit, debug, extend, or document Adjacent Hunyuan dense / OCR / VL support in vLLM relevant to Hunyuan 3 Preview planning, without a dedicated `Hunyuan3Preview` mainline alias yet.
PR-backed optimization manual for Nemotron Super / Nano Hybrid in vLLM. Use when an engineer needs to audit, debug, extend, or document NemotronH, Nemotron 3 Super, Nemotron Nano hybrid Mamba+Attention+MoE, MTP, NVFP4, and VL adjacencies.
PR-backed optimization manual for Intern-S1 in vLLM. Use when an engineer needs to audit, debug, extend, or document Intern-S1 language and video-aware serving, processor integration, and tool/reasoning parser behavior.
PR-backed optimization manual for Step3.5 / Step3-VL in SGLang. Use when an engineer needs to audit, debug, extend, or document Step3.5-Flash and Step3-VL-10B serving, MTP, MoE all-reduce, tool/reasoning parser, and processor evolution.
PR-backed optimization manual for Qwen3 Coder in vLLM. Use when an engineer needs to audit, debug, extend, or document Qwen3 Coder tool parser, structured tool arguments, and coding-oriented parser behavior layered on top of the base Qwen3 runtime.
PR-backed optimization manual for Qwen3-Next in vLLM. Use when an engineer needs to audit, debug, extend, or document Qwen3-Next GDN attention, MTP, packed module naming, PP, and cross-hardware tuned MoE configuration in vLLM.
PR-backed optimization manual for Mixtral Quark / INT4-FP8 MoE in vLLM. Use when an engineer needs to audit, debug, extend, or document Mixtral MoE, expert parallelism, FP8 / ModelOpt quantization, and EPLB in vLLM, which together form the nearest equivalent to Quark INT4-FP8 Mixtral serving.
PR-backed optimization manual for Mistral Small 4 in vLLM. Use when an engineer needs to audit, debug, extend, or document Mistral Small 4, Leanstral, and closely related Mistral Large 3 / Ministral serving behavior, including multimodal and MoE execution.
Replay-first debug flow for SGLang serving problems. Use when a live or recent server shows health-check failures, latency or throughput regressions, queue growth, timeouts, distributed stalls, crash dumps, wrong outputs after deploys, or PD/EP/HiCache issues, and the job is to turn the problem into a replay plus the right next debug tool.
PR-backed optimization manual for Qwen3 Core in vLLM. Use when an engineer needs to audit, debug, extend, or document Qwen3 dense, Qwen3 MoE, embeddings/rerankers, GGUF/GPTQ/ModelOpt quant paths, and Eagle3 speculative decoding in vLLM.
PR-backed optimization manual for Qwen3.6 in vLLM. Use when an engineer needs to audit, debug, extend, or document Tracking note for Qwen3.6-specific documentation; current vLLM mainline does not expose a dedicated Qwen3.6 architecture alias.
PR-backed optimization manual for Qwen3.5 in vLLM. Use when an engineer needs to audit, debug, extend, or document Qwen3.5 dense / MoE / GDN runtime, MTP, FP8 and NVFP4 quantization, LoRA, and Eagle3 in vLLM.
End-to-end SGLang SOTA performance workflow. Use when a user names an LLM model and wants SGLang to match or beat the best observed vLLM and TensorRT-LLM serving performance by searching each framework's best deployment command, benchmarking them fairly, profiling SGLang if it is slower, identifying kernel/overlap/fusion bottlenecks, patching SGLang code, and revalidating with real model runs.
PR-backed and current-main optimization manual for DeepSeek-V4 in SGLang. Use when an engineer needs to audit or extend DeepSeek-V4 Flash/Pro serving recipes, FP4-vs-FP8 checkpoint selection, H200/B200/GB300 launch commands, DeepEP dispatch-token budgets, context-parallel and PD-disaggregation recipes, MTP/EAGLE settings, or DeepSeek-V4 parser flags.
PR-backed optimization manual for Mixtral-8x7B with SGLang's AMD-only quark_int4fp8_moe online MoE quantization. Use when an engineer needs to audit or extend Mixtral AMD quantization, online INT4-to-FP8 MoE loading, AITER fused-MoE execution, or the registered GSM8K regression test.
PR-diff-backed optimization manual for Qwen3-Next, Qwen3-Next MTP, and Qwen3-Coder-Next shared hybrid paths in SGLang. Use when an engineer needs to audit, extend, or debug Qwen3-Next GDN/Mamba/RadixLinearAttention, MTP/EAGLE/NEXTN, FP8/NVFP4/ModelOpt loading, CPU offload, FlashInfer/CuTe/Gluon GDN kernels, AMD/NPU/Blackwell paths, mixed-chunk extra_buffer behavior, or Qwen3-Next cookbook deployment flags.
PR-backed optimization manual for Qwen2.5-VL / Qwen3-VL / Qwen3-Omni / Qwen3-ASR in vLLM. Use when an engineer needs to audit, debug, extend, or document the multimodal Qwen runtime in vLLM, especially Qwen2.5-VL attention hot paths, Qwen3-VL video and interleaved MRoPE handling, Qwen3-Omni thinker audio-in-video logic, and Qwen3-ASR / realtime speech support.
PR-backed and current-main optimization manual for DeepSeek V3.1 and DeepSeek-V3.1-Terminus in SGLang. Use when an engineer needs to recover, extend, or audit DeepSeek V3.1 tool calling, thinking mode, chat templates, streaming parser behavior, loading fixes, MTP validation, fused MoE configs, or backend-specific tests.
PR-backed optimization manual for Llama 4 in SGLang. Use when an engineer needs to audit, debug, extend, or document Llama4 text and multimodal runtime, FP8/FP4 quantization, router behavior, long-context attention, and Eagle support.
PR-backed optimization manual for Mistral Small 4 in SGLang. Use when an engineer needs to audit, debug, extend, or document Mistral Small 4, Leanstral, and closely related Mistral Large 3 / Ministral serving behavior, including multimodal and EAGLE paths.
PR-diff-backed optimization manual for Qwen2.5-VL, Qwen3-VL, Qwen3-VL-MoE, Qwen3-Omni, Qwen3-ASR, and Qwen3.5 multimodal paths in SGLang. Use when an engineer needs to audit, debug, extend, or document multimodal processors, ViT DP/PP/chunk/cache, mRoPE, DeepStack, EAGLE3, LoRA, audio encoder, streaming ASR, encoder disaggregation, AMD/NPU/CPU support, or Qwen VLM cookbook deployment recipes.
PR-diff-backed optimization manual for Qwen3 dense and Qwen3 MoE in SGLang. Use when an engineer needs to recover, extend, audit, or write documentation for Qwen3/Qwen3-30B/Qwen3-235B-A22B, FP8/NVFP4/MXFP4/W4A4, fused QK-norm/RoPE/KV-store paths, FlashInfer TRTLLM-GEN-MoE, DeepEP/EPLB/TBO/context parallel, EAGLE3, LoRA, PP/tied embeddings, Ascend NPU/XPU/MLX support, or Qwen3 reasoning/tool-parser behavior.
PR-backed optimization manual for DeepSeek V3 / R1 in vLLM. Use when an engineer needs to audit, debug, extend, or document DeepSeek V3 and DeepSeek R1 MLA, MoE, packed-module loading, LoRA, MTP/Eagle, and quantized ROCm/CUDA validation paths.
PR-backed optimization manual for Gemma 4 in vLLM. Use when an engineer needs to audit, debug, extend, or document Gemma 4 text, MoE, multimodal, reasoning, tool use, and quantized MoE serving.
PR-backed optimization manual for GLM-5 / GLM-5.1 in vLLM. Use when an engineer needs to audit, debug, extend, or document the current partial GLM-5 bring-up in vLLM, especially the `GlmMoeDsaForCausalLM` aliasing into the DeepSeek-V2/V3 runtime, rope interleave handling, and GLM-5 MTP correctness.
PR-backed and current-main optimization manual for DeepSeek V3.2, V3.2-Exp, V3.2-Speciale, NVFP4, and MXFP4 in SGLang. Use when an engineer needs to recover, extend, or audit DSA/NSA sparse attention, NSA indexer, FP8/BF16/FP4 KV cache, context parallel, MTP, IndexCache, DSML tool calling, V3.2 docs/tests, AMD/NPU/Blackwell backends, or open NSA/DSA PRs.
PR-backed optimization manual for MiniMax M1 / M2 / VL in vLLM. Use when an engineer needs to audit, debug, extend, or document MiniMaxText01, MiniMax-M1, MiniMax-M2, MiniMax-VL-01, LoRA, and Eagle3 support in vLLM.
PR-backed and current-main optimization manual for Qwen3.6 in SGLang. Use when an engineer needs to recover, extend, or audit Qwen3.6-35B-A3B/27B dense deployment, hybrid Gated Delta Network behavior, multimodal inputs, thinking preservation, Qwen3 reasoning plus Qwen3-Coder tool parser, MTP, Mamba scheduler strategy, FP8/BF16 commands, CPU offload, or cookbook parity.
SSH into host `h100_sglang`, enter Docker container `sglang_bbuf`, work in `/data/bbuf/repos/sglang`, and use the ready H100 remote environment for SGLang **diffusion** development and validation. Use when a task needs diffusion model smoke tests, Triton/CUDA kernel validation, torch.compile diffusion checks, or a safe remote copy for diffusion-specific SGLang changes.
SSH into host `h100_sglang`, enter Docker container `sglang_bbuf`, work in `/sgl-workspace/sglang`, and use the ready H100 remote environment for SGLang development and validation. Use when a task needs remote CUDA work, GPU-backed smoke tests, diffusion checks, or a safe remote copy instead of local-only execution.
PR-backed and current-main optimization manual for Qwen-Image and Qwen-Image-Edit in SGLang Diffusion. Use when Codex needs to recover, extend, or audit diffusion transformer loading, layer serving, CUDA graph, TeaCache, IMA, ModelOpt FP8, AMD kernels, Qwen-Image detectors, or cookbook diffusion recipes.
PR-backed optimization manual for Z-Image-Turbo in vLLM. Use when Codex needs to audit, debug, extend, or document Tracking note for Z-Image-Turbo diffusion generation, which is outside vLLM mainline today.
PR-backed optimization manual for Qwen-Image in vLLM. Use when Codex needs to audit, debug, extend, or document Tracking note for Qwen-Image diffusion generation, which is outside the current vLLM runtime surface.
PR-backed and current-main optimization manual for LTX-2.3 High Quality pipeline in SGLang Diffusion. Use when Codex needs to audit or extend LTX2TwoStageHQPipeline, LTX-2.3 two-stage LoRA switching, HQ sigma/timestep semantics, res2s RK2 refinement, audio/video denoising, Gemma prompt trimming, low-VRAM device snapshots, or LTX-2.3 HQ sampling defaults.
PR-backed optimization manual for LTX 2.3 HQ in vLLM. Use when Codex needs to audit, debug, extend, or document Tracking note for LTX 2.3 HQ diffusion/video style models, which are outside current vLLM autoregressive runtime coverage.
PR-backed optimization manual for Z-Image and Z-Image-Turbo in SGLang Diffusion. Use when Codex needs to audit or extend Z-Image registry entries, Turbo/base sampling defaults, CFG normalization, sequence-parallel latent sharding, Cache-DiT/TeaCache behavior, LoRA/FP8 coverage, or AMD nightly validation.
Compact SGLang torch-profiler triage skill. Use when Codex should inspect an existing `trace.json(.gz)` or profile directory, trigger `sglang.profiler` against a live server, and return one compact report with kernel, overlap-opportunity, and fuse-pattern tables. Single-trace triage is enough for quick diagnosis; mapping+formal two-trace triage gives stronger overlap conclusions.