BBuf

81 verified skills17,015 total stars

sglang-humanize-review

Perform SGLang code review in the style of human maintainers by consulting the full non-agent PR review episode corpus from project start through the latest refresh (June 2026), including inline review threads, top-level PR comments, review submissions, original multilingual text, and multi-round discussions. Use when reviewing SGLang PRs, diffs, patches, or local changes for correctness, tests, performance, GPU/runtime risks, API compatibility, and maintainability.

development531

model-pr-history-knowledge

Use when an SGLang, vLLM, or TensorRT-LLM serving/model optimization task needs prior model-family PR evidence. Query and read the PR-driven history docs under model-pr-optimization-history before choosing source paths, fast paths, kernel/fusion ideas, regression risks, or validation lanes.

documentation531

llm-pipeline-analysis

Inspect LLM torch profiler traces at forward-pass, layer, and kernel level. Use when you need layer timings, anchor-kernel boundaries, representative kernel flows, or Perfetto time ranges.

devops423

sglang-sota-humanize-loop

Run an autonomous Humanize-governed SGLang SOTA performance loop for one LLM model: first perform the fixed fair SGLang/vLLM/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches SGLang code, optionally uses ncu-report-skill for kernel evidence, and revalidates until SGLang matches or beats the best observed framework under the same workload and SLA.

development423

vllm-sota-humanize-loop

Run an autonomous Humanize-governed vLLM SOTA performance loop for one LLM model: first perform the fixed fair vLLM/SGLang/TensorRT-LLM deployment search and benchmark, then start one RLCR loop that repeatedly decides the gap, profiles the current bottleneck, runs layer/kernel pipeline analysis, patches vLLM code, optionally uses ncu-report-skill for kernel evidence, and revalidates until vLLM matches or beats the best observed framework under the same workload and SLA.

development423

model-compute-simulation

Build an operator-level compute template for an LLM and estimate FLOPs/MFU for a serving shape. Use when you need tensor shapes, per-op FLOPs, kernel-to-op MFU mapping, or parallelism what-if analysis.

development313

llm-serving-capacity-planner

Parse SGLang/vLLM startup logs to explain GPU memory use and request capacity. Use for KV cache budget, mem-fraction-static comparisons, OOM triage, and max-concurrency estimates.

testing313

model-pr-diff-dossier

Use when creating or revising model PR optimization history documents for SGLang, vLLM, or another serving framework that cite GitHub PRs. Requires manual, per-PR source-diff review and documentation of motivation, key implementation approach, most important code excerpts, reviewed files, and validation implications instead of generated or one-line summaries.

development295

llm-torch-profiler-analysis

Unified LLM torch-profiler triage skill for `sglang`, `vllm`, and `TensorRT-LLM`. Use it to inspect an existing `trace.json(.gz)` or profile directory, or to drive live profiling against a running server and return one three-table report with kernel, overlap-opportunity, and fuse-pattern tables.

data-ai295

llm-serving-auto-benchmark

Framework-independent LLM serving benchmark skill for comparing SGLang, vLLM, TensorRT-LLM, or another serving framework. Use when a user wants to find the best deployment command for one model across multiple serving frameworks under the same workload, GPU budget, and latency SLA.

development295

gpu-kernel-ako4all

Use when developing, optimizing, debugging, or porting AI-infra GPU kernels through an AKO4ALL-centered loop, including Triton, CUDA C++/PTX, CUTLASS/CuTe C++, and CuTe DSL kernels; also use when setting up a sibling AKO4ALL repo, creating microbench harnesses, profiling with nsys/ncu, and validating kernel changes against real operator or model benchmarks. Do not trigger on simple Triton or CUDA API lookups; this skill is for full optimization or rewrite tasks where AKO discipline pays off.

development276

model-architecture-diagram

Return public original model architecture diagrams for user-specified LLM, VLM, MoE, diffusion, OCR, and SGLang/sgl-cookbook model families. Use when the user asks for a model structure chart, architecture diagram, or rendered image link for a specific model such as DeepSeek, GLM, Qwen, Kimi, MiniMax, Step, Hunyuan, or Qwen3-VL.

testing202

vllm-moss-vl-optimization

PR-backed optimization manual for MOSS-VL in vLLM. Use when an engineer needs to audit, debug, extend, or document Tracking note for MOSS-VL, which does not have a native vLLM mainline model module today.

BBuf

sglang-humanize-review

model-pr-history-knowledge

llm-pipeline-analysis

sglang-sota-humanize-loop

vllm-sota-humanize-loop

model-compute-simulation

llm-serving-capacity-planner

model-pr-diff-dossier

llm-torch-profiler-analysis

llm-serving-auto-benchmark

gpu-kernel-ako4all

model-architecture-diagram

vllm-moss-vl-optimization

vllm-step35-optimization

sglang-gpt-oss-optimization

sglang-deepseek-v3-r1-optimization

sglang-glm-vlm-ocr-optimization

sglang-glm45-optimization

sglang-gemma4-optimization

sglang-ernie45-optimization

sglang-intern-s1-optimization

sglang-glm5-glm51-optimization

sglang-internvl35-optimization

sglang-kimi-k2-k25-optimization

sglang-moss-vl-optimization

sglang-mimo-v2-flash-optimization

sglang-minimax-m2-series-optimization

sglang-glm46-glm47-optimization

sglang-nemotron-super-optimization

sglang-qwen35-optimization

sglang-qwen3-coder-optimization

sglang-hunyuan3-preview-optimization

vllm-deepseek-v31-optimization

vllm-deepseek-v32-optimization

vllm-deepseek-v4-optimization

vllm-glm-vlm-ocr-optimization

vllm-glm45-optimization

vllm-gpt-oss-optimization

vllm-glm46-glm47-optimization

vllm-mimo-v2-flash-optimization

vllm-llama4-optimization

vllm-kimi-optimization

vllm-ernie45-optimization

vllm-internvl35-optimization

vllm-hunyuan3-preview-optimization

vllm-nemotron-super-optimization

vllm-intern-s1-optimization

sglang-step35-optimization

vllm-qwen3-coder-optimization

vllm-qwen3-next-optimization

vllm-mixtral-quark-int4fp8-moe-optimization

vllm-mistral-small-4-optimization

sglang-prod-incident-triage

vllm-qwen3-core-optimization

vllm-qwen36-optimization

vllm-qwen35-optimization

sglang-sota-performance

sglang-deepseek-v4-optimization

sglang-mixtral-quark-int4fp8-moe-optimization

sglang-qwen3-next-optimization

vllm-qwen-vlm-omni-asr-optimization

sglang-deepseek-v31-optimization

sglang-llama4-optimization

sglang-mistral-small-4-optimization

sglang-qwen-vlm-omni-asr-optimization

sglang-qwen3-core-optimization

vllm-deepseek-v3-r1-optimization

vllm-gemma4-optimization

vllm-glm5-glm51-optimization

sglang-deepseek-v32-optimization

vllm-minimax-optimization

sglang-qwen36-optimization

h100-sglang-diffusion

h100

sglang-qwen-image-optimization

vllm-z-image-turbo-optimization

vllm-qwen-image-optimization

sglang-ltx23-hq-optimization

vllm-ltx23-hq-optimization