skills/dart-diffusion-inspired-speculative-decoding/SKILL.md
Set up and use DART (Diffusion-Inspired Speculative Decoding) for fast LLM inference. DART replaces autoregressive draft models with parallel masked-position prediction using a single transformer layer, combined with N-gram-enforced tree pruning. Triggers: 'speed up LLM inference with DART', 'set up speculative decoding with DART', 'integrate DART for faster generation', 'configure DART draft model', 'compare DART vs EAGLE3 speculative decoding', 'optimize LLM serving latency with parallel drafting'.
npx skillsauth add ndpvt-web/arxiv-claude-skills dart-diffusion-inspired-speculative-decodingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
DART (Diffusion-Inspired Speculative Decoding) enables Claude to help users set up, configure, and integrate a speculative decoding framework that achieves 2x-3.4x wall-clock speedup over standard autoregressive LLM inference. Unlike EAGLE3, which drafts tokens autoregressively (creating a bottleneck in the draft stage itself), DART predicts logits for multiple future masked positions in parallel through a single forward pass of a lightweight transformer layer, then constructs high-quality draft token trees using an N-gram-enforced pruning algorithm implemented in C++.
fvliang/DART GitHub repository, including installation, model download, and configurationThe Drafting Bottleneck Problem. Standard speculative decoding uses a small "draft" model to propose candidate tokens, which the larger "target" model then verifies in parallel. Methods like EAGLE3 improve draft accuracy but still generate candidates autoregressively -- each draft token depends on the previous one, requiring multiple sequential forward passes through the draft model. This makes the drafting stage itself a latency bottleneck, especially as draft sequence length grows.
DART's Parallel Drafting. Inspired by diffusion-based language models (dLLMs), DART eliminates autoregressive rollouts in the draft model entirely. It takes hidden states from an intermediate layer of the target model and feeds them into a single lightweight transformer layer that predicts logits for multiple future token positions simultaneously in one forward pass. This is analogous to how diffusion models denoise multiple positions in parallel -- the draft model "fills in" masked future positions all at once rather than left-to-right. The result is dramatically lower drafting latency: one forward pass through one transformer layer versus N sequential passes through a multi-layer draft model.
N-gram-Enforced Tree Pruning. Raw parallel predictions lack the sequential coherence of autoregressive generation. DART compensates with a C++-implemented tree pruning algorithm that takes the top-K candidates at each position and filters them using N-gram frequency statistics from a prebuilt N-gram model. This enforces semantic continuity -- candidate branches that form implausible N-gram sequences are pruned, producing compact, high-quality draft trees that the target model can verify efficiently. The combination of cheap parallel drafting + smart tree construction yields 30% higher throughput than EAGLE3 on average, with up to 65% improvement on code-centric workloads.
Install DART and dependencies. Clone the repository and use uv for dependency management:
git clone https://github.com/fvliang/DART.git
cd DART
curl -LsSf https://astral.sh/uv/install.sh | sh
uv sync
uv pip install -e .
Select the appropriate model trio. DART requires three components: a base model, a DART draft head, and an N-gram model. Match the DART head to the base model size:
| Base Model | DART Head | N-gram Model |
|---|---|---|
| Qwen/Qwen3-1.7B | fvliang/qwen1.7b-dart | fvliang/dart-qwen3-ngram |
| Qwen/Qwen3-4B | fvliang/qwen4b-dart | fvliang/dart-qwen3-ngram |
| Qwen/Qwen3-8B | fvliang/qwen8b-dart | fvliang/dart-qwen3-ngram |
| Qwen/Qwen3-14B | fvliang/qwen14b-dart | fvliang/dart-qwen3-ngram |
| Qwen/Qwen3-32B | fvliang/qwen32b-dart | fvliang/dart-qwen3-ngram |
Load the model in Python. Use DartModel.from_pretrained with all three paths:
import torch
from dart import DartModel
model = DartModel.from_pretrained(
base_model_name_or_path="Qwen/Qwen3-8B",
dart_model_name_or_path="fvliang/qwen8b-dart",
ngram_model_name_or_path="fvliang/dart-qwen3-ngram",
torch_dtype=torch.float16,
device_map="auto",
is_small_ngram=False # True for faster loading during testing
)
Format the input using the chat template registry. DART uses TEMPLATE_REGISTRY for prompt formatting:
from dart import TEMPLATE_REGISTRY
template = TEMPLATE_REGISTRY["qwen"]
prompt = template.format(messages=[{"role": "user", "content": "Explain speculative decoding."}])
Run inference with dart_generate. Configure sampling parameters and generation limits:
from dart import dart_generate
output = dart_generate(
model,
prompt=prompt,
temperature=0.7,
top_p=0.9,
top_k=50,
max_new_token_num=512,
max_length=2048,
)
Launch a Gradio demo for interactive testing. Use the provided shell scripts or the direct CLI:
uv run python dart/app/app.py \
--base-model-name-or-path Qwen/Qwen3-4B \
--dart-model-name-or-path fvliang/qwen4b-dart \
--ngram-model-name-or-path fvliang/dart-qwen3-ngram \
--device cuda \
--max-new-tokens 2048 \
--server-port 30000
Benchmark against EAGLE3. Add the --compare-eagle3 flag to the Gradio app to run side-by-side comparisons on the same prompts.
Tune tree construction parameters. Adjust the top-K candidate count and N-gram pruning aggressiveness based on your latency/accuracy tradeoff. Larger K gives more candidates but bigger verification trees; stricter N-gram filtering gives smaller trees but may miss valid continuations.
Use --use-small-ngram for rapid prototyping. This flag loads a reduced N-gram model that is faster to initialize, useful during development before running full benchmarks.
Profile end-to-end latency. Measure three components separately: (a) target model forward pass, (b) DART draft head forward pass, and (c) tree construction + verification. DART's advantage is specifically in reducing component (b) to a single pass.
Example 1: Setting up DART for a Qwen3-8B deployment
User: "I want to speed up my Qwen3-8B inference using DART speculative decoding. Help me set it up."
Approach:
uv sync && uv pip install -e .DartModel.from_pretrained with Qwen/Qwen3-8B, fvliang/qwen8b-dart, and fvliang/dart-qwen3-ngramdart_generate with standard sampling parametersOutput:
import torch, time
from dart import DartModel, dart_generate, TEMPLATE_REGISTRY
model = DartModel.from_pretrained(
base_model_name_or_path="Qwen/Qwen3-8B",
dart_model_name_or_path="fvliang/qwen8b-dart",
ngram_model_name_or_path="fvliang/dart-qwen3-ngram",
torch_dtype=torch.float16,
device_map="auto",
is_small_ngram=False,
)
template = TEMPLATE_REGISTRY["qwen"]
prompt = template.format(messages=[{"role": "user", "content": "Write a Python quicksort."}])
start = time.time()
output = dart_generate(model, prompt=prompt, temperature=0.0, top_k=1, max_new_token_num=256, max_length=2048)
elapsed = time.time() - start
print(f"Generated {len(output.tokens)} tokens in {elapsed:.2f}s ({len(output.tokens)/elapsed:.1f} tok/s)")
print(output.text)
Example 2: Launching a Gradio demo comparing DART vs EAGLE3
User: "I want to visually compare DART and EAGLE3 speed on my prompts."
Approach:
--compare-eagle3 flagOutput:
cd DART
uv run python dart/app/app.py \
--base-model-name-or-path Qwen/Qwen3-4B \
--dart-model-name-or-path fvliang/qwen4b-dart \
--ngram-model-name-or-path fvliang/dart-qwen3-ngram \
--device cuda \
--max-new-tokens 2048 \
--compare-eagle3 \
--server-port 30000
# Open http://localhost:30000 to see side-by-side generation
Example 3: Integrating DART into an existing serving script
User: "I have a batch inference script that processes prompts from a JSONL file. How do I swap in DART?"
Approach:
model.generate() call with dart_generate()is_small_ngram=True for initial testing, switch to False for productionOutput:
import json, torch
from dart import DartModel, dart_generate, TEMPLATE_REGISTRY
model = DartModel.from_pretrained(
base_model_name_or_path="Qwen/Qwen3-8B",
dart_model_name_or_path="fvliang/qwen8b-dart",
ngram_model_name_or_path="fvliang/dart-qwen3-ngram",
torch_dtype=torch.float16,
device_map="auto",
is_small_ngram=False,
)
template = TEMPLATE_REGISTRY["qwen"]
with open("prompts.jsonl") as f:
prompts = [json.loads(line) for line in f]
results = []
for item in prompts:
formatted = template.format(messages=[{"role": "user", "content": item["prompt"]}])
output = dart_generate(
model, prompt=formatted,
temperature=item.get("temperature", 0.7),
top_p=0.9, top_k=50,
max_new_token_num=item.get("max_tokens", 512),
max_length=2048,
)
results.append({"prompt": item["prompt"], "response": output.text})
with open("results.jsonl", "w") as f:
for r in results:
f.write(json.dumps(r) + "\n")
torch.float16 (or bfloat16 on Ampere+ GPUs) for both the base model and DART head to minimize memory and maximize throughput.is_small_ngram=True during development to speed up model loading, then switch to the full N-gram model for production benchmarks.temperature=0.0 and top_k=1 when benchmarking raw speedup, to isolate the speculative decoding performance from sampling variance.max_new_token_num without also increasing max_length, as the tree verification requires buffer space beyond the raw token count.| Problem | Cause | Fix |
|---|---|---|
| RuntimeError: CUDA out of memory | Base model + DART head + N-gram model exceed GPU VRAM | Use a smaller base model, enable is_small_ngram=True, or use device_map="auto" for multi-GPU sharding |
| Draft head path not found | Mismatched DART head for the chosen base model | Verify you are using the correct fvliang/qwen{size}b-dart matching your Qwen/Qwen3-{size}B |
| Slow tree construction | C++ tree search extension not compiled | Run uv sync again to ensure native extensions are built; check that a C++ compiler is available |
| Low acceptance rate (tau) | N-gram model not loaded or is_small_ngram=True in production | Switch to the full N-gram model (is_small_ngram=False) for higher-quality tree pruning |
| No speedup over vanilla | Very short outputs (< 20 tokens) | Speculative decoding overhead is amortized over longer sequences; test with max_new_token_num >= 128 |
| Template formatting errors | Wrong template name for model family | Use TEMPLATE_REGISTRY["qwen"] for all Qwen3 models |
is_small_ngram=True is necessary but reduces acceptance rates.Paper: DART: Diffusion-Inspired Speculative Decoding for Fast LLM Inference (Liu et al., 2026) -- Focus on Section 3 (method) for the parallel drafting mechanism and Section 3.3 for the N-gram-enforced tree pruning algorithm.
Code: https://github.com/fvliang/DART -- Apache 2.0 licensed, supports Qwen3 1.7B through 32B with pre-trained draft heads.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".