skills/doppler-perf/SKILL.md
Diagnose and improve Doppler model/path performance with baselines, profiling traces, and controlled runtime/code experiments. (project)
npx skillsauth add clocksmith/doppler doppler-perfInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Use this skill when Doppler is slower than expected on decode, prefill, TTFT, or model-load-sensitive warm UX, and you need to diagnose or change the hot path for a specific model or runtime path. Use doppler-bench when the goal is reproducible benchmark evidence, compare-engine reporting, or vendor-registry coverage rather than tuning.
Read these before non-trivial performance, profiling, or methodology changes:
docs/style/general-style-guide.mddocs/style/javascript-style-guide.mddocs/style/config-style-guide.mddocs/style/harness-style-guide.mddocs/style/benchmark-style-guide.mdAlso read:
docs/style/wgsl-style-guide.md for shader changesdocs/style/command-interface-design-guide.md when changing bench or debug command behaviorWhen performance work requires additive implementation changes, also open:
docs/developer-guides/README.mdCommon routes:
docs/developer-guides/06-kernel-path-config.mddocs/developer-guides/11-wgsl-kernel.mddocs/developer-guides/13-attention-variant.mddocs/developer-guides/15-kvcache-layout.mddocs/developer-guides/12-command-surface.mdruntime profiles + workload contracts).# Start from one clean benchmark baseline
npm run cli -- profiles --json
npm run bench -- --config '{"request":{"modelId":"MODEL_ID","cacheMode":"warm"},"run":{"surface":"browser","bench":{"save":true}}}' --runtime-profile profiles/throughput --json
Read from output:
model loaddecode tok/sprompt tok/s (TTFT)TTFTIf you need compare-engine or publication-grade evidence at this stage, switch to doppler-bench instead of expanding the squeeze loop.
Before changing kernels:
decodeMode, batchGuardReason, speculation state, decodeRecordMs, decodeSubmitWaitMs, decodeReadbackWaitMs, singleTokenReadbackWaitMs, and singleTokenOrchestrationMs.A fast kernel may already exist but the model's manifest pins a slower variant. This is the highest-ROI check before writing anything new. Seen twice now — once on Gemma 4 E2B global-attention layers (iter 24–25), once on Qwen 3.5 0.8B full-attention layers + matmuls (2026-04-18). In both cases the fast kernel was sitting in the repo for months; the manifest/conversion config was pinned to older slow choices.
How to detect (two steps):
runtime.shared.debug.profiler.enabled: true and runtime.shared.tooling.intent: investigate. Example: profiles/gemma4-e2b-prefill-profile.json, profiles/qwen-3-5-0-8b-prefill-profile.json. Run via doppler debug; per-chunk log.warn('Profile', ...) reports list each kernel's wall time.defaultLogLevel to info on one run to capture KernelSelect variant lines. Grep for attention variant= and matmul variant=.Red-flag variants (immediate suspicion at prefill):
prefill_streaming_f16kv — one-thread-per-workgroup fallback. If fast variants exist for this head_dim, the manifest is routing past them.q4_fused_batched_multicol_shared — older Q4K matmul, many small workgroups. WideTile (q4_fused_widetile / q4_fused_widetile_f16) is usually 2×+ faster on M≥4.prefill_small* on a model whose architecture.headDim has a dedicated *_head{N}_f16kv.wgsl (e.g. 256 or 512)._f16kv attention means f32 activations with f16
KV. The f16 lane should route attention to _f16 kernels and Q4 projections
to *_f16a kernels, then prove correctness with debug/verify output before
perf claims.How to fix (manifest-level, no new code):
src/config/conversion/<family>/<model-id>.json) AND the runtime manifest (models/local/<model-id>/manifest.json). Digest comes from src/config/kernels/kernel-ref-digests.js (the normalized content hash, NOT raw sha256sum of the .wgsl file).attn_stream / q4_prefill / similar label in the execution-graph steps to the new ref.general-style-guide.md §Contracts first).Recipe, concise:
*-prefill-profile.json with profiler.enabled: true + longer prompt (~80 tok).node src/cli/doppler-cli.js debug --config '{"request":{"modelId":"<id>","runtimeProfile":"profiles/<id>-prefill-profile"}}' --surface browser | grep -A1 '"module": "Profile"'attention, fused_ffn, matmul:*:L<i> lines. Top 1-3 by wall time.info, re-run, grep "matmul variant=\|attention variant=" — confirm which kernel is actually firing.src/gpu/kernels/ and registry. If a fast variant matches the op's head_dim/M/dtype, the fix is a manifest swap, not a new kernel.doppler verify. Measure with doppler bench.When this check does NOT apply:
Session receipts to crib from:
project_gemma4_iter23_attention_is_wall.md → project_gemma4_iter25_head512_shipped.md. Full-attn global layers were streaming; routed to (new) attention_head512_f16kv.wgsl for 1.74× prefill at prefill=81.project_qwen35_08b_attn_and_widetile_shipped_2026_04_18.md. No new kernel needed — attention_head256_f16kv.wgsl + fused_matmul_q4_widetile.wgsl already existed; Qwen manifest was just pinned to attn_stream + q4_fused_batched_multicol_shared. +57.5% prefill at prefill=80.# Single-token-style control
npm run bench -- \
--config '{"request":{"modelId":"MODEL_ID","cacheMode":"warm"},"run":{"surface":"browser","bench":{"save":true}}}' \
--runtime-config '{"inference":{"generation":{"maxTokens":128},"sampling":{"temperature":0,"topK":1,"topP":1,"repetitionPenalty":1,"greedyThreshold":0},"session":{"decodeLoop":{"batchSize":1,"stopCheckMode":"batch","readbackInterval":1,"ringTokens":1,"ringStop":1,"ringStaging":1,"disableCommandBatching":false}}}}' \
--json
# Moderate batched candidate
npm run bench -- \
--config '{"request":{"modelId":"MODEL_ID","cacheMode":"warm"},"run":{"surface":"browser","bench":{"save":true}}}' \
--runtime-config '{"inference":{"generation":{"maxTokens":128},"sampling":{"temperature":0,"topK":1,"topP":1,"repetitionPenalty":1,"greedyThreshold":0},"session":{"decodeLoop":{"batchSize":4,"stopCheckMode":"batch","readbackInterval":4,"ringTokens":1,"ringStop":1,"ringStaging":1,"disableCommandBatching":false}}}}' \
--json
# Higher-throughput candidate
npm run bench -- \
--config '{"request":{"modelId":"MODEL_ID","cacheMode":"warm"},"run":{"surface":"browser","bench":{"save":true}}}' \
--runtime-config '{"inference":{"generation":{"maxTokens":128},"sampling":{"temperature":0,"topK":1,"topP":1,"repetitionPenalty":1,"greedyThreshold":0},"session":{"decodeLoop":{"batchSize":8,"stopCheckMode":"batch","readbackInterval":8,"ringTokens":1,"ringStop":1,"ringStaging":1,"disableCommandBatching":false}}}}' \
--json
# Trace-heavy debug run
npm run debug -- --config '{"request":{"modelId":"MODEL_ID"},"run":{"surface":"auto"}}' --runtime-profile profiles/verbose-trace --json
# Logit-focused browser investigation
npm run debug -- --config '{"request":{"modelId":"MODEL_ID","runtimeProfile":"diagnostics/debug-logits"},"run":{"surface":"browser","browser":{"channel":"chrome","console":true}}}' --json
Rules:
bench is calibrate-only; do not override its intent in runtime config.debug is the investigate surface for traces, layer probes, and diagnostics.Correctness gate: If a performance change (kernel swap, transform, materialization path) produces incorrect output — garbled text, numerical divergence, repeated tokens — stop the perf workflow and invoke doppler-debug immediately. Do not continue tuning a path that produces wrong answers. The debug ladder must confirm correctness before perf measurement resumes.
Context budget: Investigation and diagnosis should consume no more than 30% of available context. If you have not formed a testable hypothesis after reading 5–8 files, stop reading code and run a diagnostic probe or write a minimal reproduction instead. Exhaustive static code tracing without running a test is an anti-pattern.
Common patterns:
Priority code hotspots:
src/inference/pipelines/text/logits/index.jssrc/inference/pipelines/text/generator.jssrc/inference/browser-harness.jssrc/memory/buffer-pool.js# Re-run the clean benchmark baseline after each material change
npm run bench -- --config '{"request":{"modelId":"MODEL_ID","cacheMode":"warm"},"run":{"surface":"browser","bench":{"save":true}}}' --runtime-profile profiles/throughput --json
When the final evidence will be published, compared across engines, or used to update vendor-facing numbers, hand off to doppler-bench instead of extending the squeeze loop.
For each perf iteration, capture:
baseline command + result filechange (runtime-only or code patch)after command + result filemodel load, decode tok/s, prompt tok/s (TTFT), TTFTdocs/agents/benchmark-protocol.md — vendor benchmark registry and update checklistdocs/agents/hardware-notes.md — GPU memory assumptionsdoppler-bench for publication-grade benchmark execution, compare-engine evidence, and vendor normalization.doppler-debug for correctness checks while tuning performance.doppler-kernel-reviewer for WGSL/JS kernel quality review on perf patches.documentation
Review kernels against DOPPLER style guide and propose style guide updates.
development
Diagnose inference regressions with Doppler's shared browser/Node command contract, runtime profiles, and report artifacts. (project)
testing
Convert GGUF or SafeTensors assets into Doppler RDRR manifests/shards using the current Node command surface, then verify load + inference. (project)
testing
Run Doppler and vendor benchmark workflows, capture reproducible JSON artifacts, and compare bench/profile coverage using the vendor registry. (project)