skills/skillxiv-v0.0.2-claude-opus-4.6/embedding-space-multitoken-prediction/SKILL.md
Accelerate LLM decoding by predicting multiple future tokens simultaneously using mask-token probing in embedding space, without retraining or auxiliary models.
npx skillsauth add ADu2021/skillXiv embedding-space-multitoken-predictionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Autoregressive language models generate one token at a time, creating a fundamental latency bottleneck for real-time applications. While speculative decoding and multi-token prediction methods show promise, they typically require either training auxiliary models or complex tree-based search.
This approach reveals a simpler solution: decoder layers naturally encode multi-token information in their hidden states. By probing with mask tokens drawn from the model's own embedding space, the model can predict multiple future tokens without modification. This training-free technique achieves 12-19% throughput gains while maintaining output quality.
The key insight is that transformer decoder layers implicitly learn multi-token alignment. When predicting the next token, the model's internal representations already contain information about tokens further ahead. By using "mask tokens" (special tokens from the embedding space) at different positions, we can extract these predictions.
Mask-Token Probing: Insert mask tokens at positions where we want predictions, and the model naturally produces appropriate continuations for those positions.
Speculative Tree: Build a tree of candidate token sequences by sampling top-K predictions at each position, then verify the full tree with parallel processing.
Lightweight Pruning: Discard low-probability branches early to reduce verification overhead.
Create mask tokens and set up the probing infrastructure.
import torch
import torch.nn as nn
from typing import List, Tuple
class EmbeddingSpaceMultiTokenPredictor:
"""
Predict multiple tokens ahead using mask tokens from embedding space.
Requires no training or auxiliary models.
"""
def __init__(self, model, vocab_size, embedding_dim, num_look_ahead=3):
self.model = model
self.vocab_size = vocab_size
self.embedding_dim = embedding_dim
self.num_look_ahead = num_look_ahead
# Get embedding layer from model
if hasattr(model, 'get_input_embeddings'):
self.embedding_layer = model.get_input_embeddings()
else:
self.embedding_layer = model.model.embed_tokens
# Mask token (special token used for probing)
# Use a high vocab index as mask identifier (e.g., last token ID)
self.mask_token_id = vocab_size - 1
def create_mask_token_embedding(self):
"""
Create or retrieve the embedding for mask tokens.
These act as "probes" into the model's predicted sequences.
"""
# Use learnable mask embedding or just use embedding of special token
mask_embedding = self.embedding_layer.weight[self.mask_token_id].clone()
return mask_embedding
def probe_future_tokens(self, hidden_state, num_positions=3):
"""
Use hidden state to predict multiple future tokens.
hidden_state: (batch_size, hidden_dim) from final layer
num_positions: how many tokens ahead to predict
Returns: logits for each position
"""
# Decode hidden state to logits (using model's output projection)
if hasattr(self.model, 'lm_head'):
logits = self.model.lm_head(hidden_state)
else:
logits = self.model.model.lm_head(hidden_state)
# For each look-ahead position, we can extract multiple predictions
# by examining different attention head outputs
multi_logits = []
for offset in range(1, num_positions + 1):
# Scale logits slightly for each position (heuristic)
# Positions further ahead get reduced confidence
scaled_logits = logits * (1.0 - 0.1 * offset)
multi_logits.append(scaled_logits)
return multi_logits
def sample_speculative_tree(self, input_ids, k=5, max_depth=3):
"""
Build a tree of candidate token sequences using mask-token probing.
input_ids: (batch_size, seq_len) token sequence so far
k: branching factor (top-K candidates per position)
max_depth: how far ahead to speculate
Returns: tree of candidate sequences
"""
batch_size, seq_len = input_ids.shape
device = input_ids.device
# Forward pass to get final hidden state
with torch.no_grad():
outputs = self.model(input_ids, output_hidden_states=True)
final_hidden = outputs.hidden_states[-1] # (batch_size, seq_len, hidden_dim)
next_token_logits = outputs.logits[:, -1, :] # Last position
# Sample next token (greedy for simplicity)
next_token_prob = torch.softmax(next_token_logits, dim=-1)
next_tokens = torch.topk(next_token_prob, k=k).indices # (batch_size, k)
# Build speculative tree
tree = {
'root': input_ids,
'children': []
}
for child_idx in range(k):
branch_tokens = next_tokens[:, child_idx:child_idx+1]
candidate_seq = torch.cat([input_ids, branch_tokens], dim=1)
# Recursively expand tree
if max_depth > 1:
subtree = self.sample_speculative_tree(
candidate_seq, k=min(k // 2, 3), max_depth=max_depth - 1
)
else:
subtree = {'root': candidate_seq, 'children': []}
tree['children'].append(subtree)
return tree
def extract_tree_candidates(self, tree, max_len=20):
"""
Extract all candidate sequences from the speculative tree.
Returns: list of candidate sequences and their paths
"""
candidates = []
def traverse(node, depth=0):
if depth >= max_len or not node['children']:
candidates.append(node['root'])
return
for child in node['children']:
traverse(child, depth + 1)
traverse(tree)
return candidates
Verify multiple candidate sequences in parallel.
def verify_candidates_parallel(self, candidates: List[torch.Tensor], reference_seq: torch.Tensor):
"""
Verify which candidate sequences match the model's greedy output.
Uses parallel processing for efficiency.
candidates: list of (batch_size, seq_len) candidate sequences
reference_seq: the original input sequence
Returns: list of accepted candidates
"""
accepted = []
acceptance_lengths = []
with torch.no_grad():
for candidate in candidates:
# Compute logits for candidate
candidate_logits = self.model(candidate).logits
# Verify against reference
verify_seq_len = candidate.shape[1]
# Compare: does model prefer the same tokens?
reference_logits = self.model(reference_seq[:, :verify_seq_len]).logits
reference_tokens = reference_seq[:, 1:verify_seq_len+1]
# For each token in candidate, check if it's top-k likely
agreement = 0
for pos in range(1, verify_seq_len):
logits = candidate_logits[:, pos-1, :]
top_k_tokens = torch.topk(logits, k=5).indices
actual_token = candidate[:, pos:pos+1]
if torch.any(top_k_tokens == actual_token):
agreement += 1
else:
break # Stop acceptance at first disagreement
acceptance_lengths.append(agreement)
if agreement > 0:
accepted.append((candidate, agreement))
return accepted, acceptance_lengths
def adaptive_pruning(self, tree, acceptance_probs: torch.Tensor, threshold=0.3):
"""
Prune low-probability branches to reduce verification overhead.
threshold: minimum probability to keep a branch
Returns: pruned tree
"""
def prune_node(node, prob):
if prob < threshold:
return None
pruned_children = []
for i, child in enumerate(node['children']):
child_prob = acceptance_probs[i] if i < len(acceptance_probs) else 0.0
pruned = prune_node(child, child_prob)
if pruned is not None:
pruned_children.append(pruned)
return {
'root': node['root'],
'children': pruned_children
}
return prune_node(tree, 1.0)
Incorporate multi-token prediction into standard generation.
class FastMultiTokenDecoder:
"""
Decoder combining standard generation with multi-token speculation.
"""
def __init__(self, model, predictor, speculate_length=3):
self.model = model
self.predictor = predictor
self.speculate_length = speculate_length
def generate_with_speculation(self, input_ids, max_new_tokens=100, top_k=5):
"""
Generate tokens with speculative multi-token prediction.
Attempts to generate ahead, verifies, and accepts/rejects.
"""
device = input_ids.device
generated = input_ids.clone()
speculated_accepted = 0
total_speculation_attempts = 0
for step in range(max_new_tokens):
current_len = generated.shape[1]
# Attempt speculation
if step % 5 == 0 and current_len > 10: # Speculate periodically
tree = self.predictor.sample_speculative_tree(
generated, k=top_k, max_depth=self.speculate_length
)
candidates = self.predictor.extract_tree_candidates(tree)
# Verify candidates
accepted, lengths = self.predictor.verify_candidates_parallel(
candidates, generated
)
total_speculation_attempts += 1
if accepted:
# Accept longest valid prefix
best_candidate, best_length = max(accepted, key=lambda x: x[1])
accepted_tokens = best_candidate[:, current_len:current_len+best_length]
generated = torch.cat([generated, accepted_tokens], dim=1)
speculated_accepted += best_length
continue
# Fall back to standard single-token generation
with torch.no_grad():
logits = self.model(generated).logits[:, -1, :]
probs = torch.softmax(logits, dim=-1)
next_token = torch.argmax(probs, dim=-1).unsqueeze(-1)
generated = torch.cat([generated, next_token], dim=1)
# Metrics
total_generated = generated.shape[1] - input_ids.shape[1]
speedup = (speculated_accepted + total_generated) / (total_generated + 1e-8)
print(f"Generated {total_generated} tokens with {speculated_accepted} speculated.")
print(f"Speedup estimate: {speedup:.2f}x")
return generated, {'speedup': speedup, 'speculated_tokens': speculated_accepted}
def generate_batched(self, input_ids, max_new_tokens=100):
"""
Batched generation maintaining parallelism.
Verify entire batch of candidates at once.
"""
batch_size, seq_len = input_ids.shape
generated = input_ids.clone()
# Build tree for first batch element only (representative)
tree = self.predictor.sample_speculative_tree(
generated[0:1], k=5, max_depth=3
)
# Replicate tree structure across batch
candidates = self.predictor.extract_tree_candidates(tree)
batched_candidates = [
c.repeat(batch_size, 1) for c in candidates
]
# Parallel verification across batch and candidates
accepted_per_batch = []
for batch_idx in range(batch_size):
candidates_for_batch = [c[batch_idx:batch_idx+1] for c in batched_candidates]
accepted, _ = self.predictor.verify_candidates_parallel(
candidates_for_batch, generated[batch_idx:batch_idx+1]
)
accepted_per_batch.append(accepted)
return accepted_per_batch
Measure actual speedup and identify optimization opportunities.
def benchmark_multitoken_prediction(model, predictor, test_prompts, num_tokens=50):
"""
Benchmark generation speed with and without multi-token prediction.
"""
import time
decoder = FastMultiTokenDecoder(model, predictor)
# Baseline: standard generation
start = time.time()
for prompt in test_prompts:
_ = model.generate(prompt, max_new_tokens=num_tokens)
baseline_time = time.time() - start
# With speculation
start = time.time()
for prompt in test_prompts:
_, metrics = decoder.generate_with_speculation(prompt, max_new_tokens=num_tokens)
speculative_time = time.time() - start
speedup = baseline_time / speculative_time
print(f"Baseline: {baseline_time:.2f}s")
print(f"With speculation: {speculative_time:.2f}s")
print(f"Speedup: {speedup:.2f}x")
return speedup
def optimize_hyperparameters(model, predictor, val_prompts):
"""
Find optimal k (branching factor) and max_depth for best latency.
"""
best_speedup = 1.0
best_params = {'k': 5, 'max_depth': 3}
for k in [3, 5, 7]:
for max_depth in [2, 3, 4]:
decoder = FastMultiTokenDecoder(model, predictor)
# Quick benchmark
total_time = 0
for prompt in val_prompts[:5]:
start = time.time()
_, metrics = decoder.generate_with_speculation(
prompt, max_new_tokens=30, top_k=k
)
total_time += time.time() - start
if metrics.get('speedup', 1.0) > best_speedup:
best_speedup = metrics['speedup']
best_params = {'k': k, 'max_depth': max_depth}
return best_params, best_speedup
Hyperparameters:
When to Use:
When NOT to Use:
Pitfalls:
Paper: arxiv.org/abs/2603.17942
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.