skills/skillxiv-v0.0.2-claude-opus-4.6/clasp-layer-skip-inference/SKILL.md
Accelerate LLM inference by dynamically skipping transformer layers based on in-context signals, without training auxiliary draft models or changing model weights.
npx skillsauth add ADu2021/skillXiv clasp-layer-skip-inferenceInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Speculative decoding accelerates LLM inference by using a lightweight draft model to generate candidate tokens, which are verified by the full model. The challenge is maintaining compatibility across diverse LLMs and avoiding the need to train and deploy separate draft models. CLaSp solves this through self-speculative decoding: the full model itself acts as its own draft by skipping layers during inference.
The key insight is that early transformer layers can often skip later layers without significant accuracy loss for high-confidence tokens. By dynamically deciding which layers to skip based on the input, you create an internal draft mechanism without extra models. This is compatible with any LLM and requires zero additional training.
CLaSp enables dynamic layer skipping through:
The mechanism is elegant: if the output logits stabilize at layer N, layers N+1 through M are redundant for this token. By detecting this in-context, you skip unnecessary computation.
Build a layer-skipping inference engine:
# CLaSp: In-Context Layer Skip for Self-Speculative Decoding
import torch
import torch.nn as nn
from transformers import AutoModelForCausalLM, AutoTokenizer
from typing import List, Tuple, Optional
import numpy as np
class LayerSkipInferenceEngine:
"""
Enable dynamic layer skipping for faster inference.
"""
def __init__(self, model_name: str, skip_threshold: float = 0.8):
self.model = AutoModelForCausalLM.from_pretrained(model_name)
self.tokenizer = AutoTokenizer.from_pretrained(model_name)
self.skip_threshold = skip_threshold
self.num_layers = self.model.config.num_hidden_layers
# Track statistics
self.skip_counts = [0] * self.num_layers
self.layer_importance = [1.0] * self.num_layers
def measure_logit_stability(self, logits_sequence: List[torch.Tensor]) -> torch.Tensor:
"""
Measure how stable predictions are across layers.
High stability = confident prediction = safe to skip remaining layers.
"""
# Convert logits to probabilities
probs_sequence = [torch.softmax(logits, dim=-1) for logits in logits_sequence]
# Compute KL divergence from layer i to layer i+1
kl_divs = []
for i in range(len(probs_sequence) - 1):
# KL(P_i || P_{i+1})
kl = torch.nn.functional.kl_div(
torch.log(probs_sequence[i] + 1e-10),
probs_sequence[i + 1],
reduction='none'
).mean()
kl_divs.append(kl)
# Stability = inverse of divergence
kl_divs = torch.tensor(kl_divs)
stability = torch.exp(-kl_divs) # Higher = more stable
return stability
def decide_layer_skip(self, stability: torch.Tensor, current_layer: int) -> bool:
"""
Decide whether to skip next layer based on stability.
"""
if current_layer >= len(stability):
return False
current_stability = stability[current_layer]
# Skip if confidence is high enough
should_skip = current_stability > self.skip_threshold
if should_skip:
self.skip_counts[current_layer] += 1
return should_skip
def generate_with_layer_skip(self, prompt: str, max_new_tokens: int = 50) -> dict:
"""
Generate text with dynamic layer skipping.
"""
input_ids = self.tokenizer.encode(prompt, return_tensors='pt')
device = input_ids.device
generated_ids = input_ids.clone()
all_logits = []
skip_decisions = []
total_skipped_layers = 0
total_computed_layers = 0
for token_idx in range(max_new_tokens):
# Store intermediate logits for stability measurement
layer_logits = []
# Forward pass with layer-by-layer access
with torch.no_grad():
current_hidden = self.model.get_input_embeddings()(generated_ids)
for layer_idx, layer_module in enumerate(self.model.transformer.h):
# Check if we should skip this layer
if layer_idx > 0 and len(layer_logits) > 0:
stability = self.measure_logit_stability(layer_logits)
if self.decide_layer_skip(stability, layer_idx - 1):
# Skip this layer - reuse previous hidden state
skip_decisions.append((token_idx, layer_idx, True))
total_skipped_layers += 1
continue
# Compute this layer
layer_output = layer_module(current_hidden)
current_hidden = layer_output[0]
total_computed_layers += 1
# Get logits at this layer
logits = self.model.lm_head(current_hidden)[:, -1, :]
layer_logits.append(logits)
# Final logits
final_logits = self.model.lm_head(current_hidden)[:, -1, :]
all_logits.append(final_logits)
# Sample next token
probs = torch.softmax(final_logits, dim=-1)
next_token = torch.multinomial(probs, num_samples=1)
generated_ids = torch.cat([generated_ids, next_token], dim=1)
# Compute statistics
generated_text = self.tokenizer.decode(generated_ids[0])
skip_rate = total_skipped_layers / max(1, total_skipped_layers + total_computed_layers)
return {
'text': generated_text,
'skip_rate': skip_rate,
'skipped_layers': total_skipped_layers,
'computed_layers': total_computed_layers,
'speedup': 1.0 / (1.0 - skip_rate) if skip_rate > 0 else 1.0
}
def adaptive_threshold_tuning(self, calibration_prompts: List[str],
target_skip_rate: float = 0.3):
"""
Automatically tune skip threshold to achieve target skip rate.
"""
skip_rates = []
# Try different thresholds
for threshold in np.linspace(0.5, 0.99, 10):
self.skip_threshold = threshold
skip_rates_for_threshold = []
for prompt in calibration_prompts:
result = self.generate_with_layer_skip(prompt, max_new_tokens=20)
skip_rates_for_threshold.append(result['skip_rate'])
avg_skip_rate = np.mean(skip_rates_for_threshold)
skip_rates.append((threshold, avg_skip_rate))
print(f"Threshold {threshold:.2f}: {avg_skip_rate:.2%} skip rate")
# Find threshold closest to target
skip_rates.sort(key=lambda x: abs(x[1] - target_skip_rate))
best_threshold = skip_rates[0][0]
self.skip_threshold = best_threshold
print(f"\nOptimal threshold: {best_threshold:.2f}")
return best_threshold
Implement verification and fallback mechanisms:
class VerifiedLayerSkipDecoding:
"""
Layer skipping with verification (like speculative decoding).
Ensure skip path produces same output as full path.
"""
def __init__(self, model_name: str, verify_every_n_tokens: int = 5):
self.skip_engine = LayerSkipInferenceEngine(model_name)
self.verify_every_n_tokens = verify_every_n_tokens
self.mismatch_count = 0
self.total_verify_count = 0
def generate_with_verification(self, prompt: str, max_new_tokens: int = 50) -> dict:
"""
Generate with layer skipping and periodic verification.
"""
input_ids = self.skip_engine.tokenizer.encode(prompt, return_tensors='pt')
generated_ids = input_ids.clone()
verification_failures = []
for token_idx in range(max_new_tokens):
# Generate with skip
skip_result = self.skip_engine.generate_with_layer_skip(prompt, max_new_tokens=1)
# Periodically verify
if (token_idx + 1) % self.verify_every_n_tokens == 0:
# Run full forward pass to verify
with torch.no_grad():
full_output = self.skip_engine.model(generated_ids)
full_logits = full_output.logits[:, -1, :]
# Get skip output logits (last from skip_result)
skip_logits = torch.tensor(skip_result['logits'][-1])
# Check if top-1 predictions match
skip_pred = torch.argmax(skip_logits)
full_pred = torch.argmax(full_logits)
if skip_pred != full_pred:
self.mismatch_count += 1
verification_failures.append(token_idx)
self.total_verify_count += 1
generated_ids = torch.cat([
generated_ids,
torch.tensor([[skip_result['next_token_id']]])
], dim=1)
generated_text = self.skip_engine.tokenizer.decode(generated_ids[0])
mismatch_rate = self.mismatch_count / max(1, self.total_verify_count)
return {
'text': generated_text,
'verification_mismatches': len(verification_failures),
'mismatch_rate': mismatch_rate,
'skip_rate': skip_result['skip_rate']
}
Implement layer importance analysis:
def analyze_layer_importance(model_name: str, test_prompts: List[str]) -> dict:
"""
Measure how important each layer is for final predictions.
"""
model = AutoModelForCausalLM.from_pretrained(model_name)
tokenizer = AutoTokenizer.from_pretrained(model_name)
num_layers = model.config.num_hidden_layers
layer_importance = np.zeros(num_layers)
for prompt in test_prompts:
input_ids = tokenizer.encode(prompt, return_tensors='pt')
# Full forward pass
with torch.no_grad():
full_output = model(input_ids)
full_logits = full_output.logits[:, -1, :]
# Skip each layer and measure impact
for skip_layer in range(num_layers):
# Forward pass skipping layer skip_layer
# (simplified; actual implementation needs to modify forward pass)
# Measure KL divergence or accuracy change
# layer_importance[skip_layer] = kl_div(full, skip_one_layer)
pass
return {
'layer_importance': layer_importance,
'most_important': np.argsort(layer_importance)[-5:],
'least_important': np.argsort(layer_importance)[:5]
}
def identify_skippable_patterns(skip_engine: LayerSkipInferenceEngine) -> dict:
"""
Analyze when and why layers are skipped.
"""
skip_stats = {
'total_skips': sum(skip_engine.skip_counts),
'per_layer': skip_engine.skip_counts,
'skip_likelihood': [count / max(1, 100) for count in skip_engine.skip_counts]
}
# Early layers more likely to be skipped than late layers?
early_skip_rate = np.mean(skip_stats['skip_likelihood'][:len(skip_engine.skip_counts)//2])
late_skip_rate = np.mean(skip_stats['skip_likelihood'][len(skip_engine.skip_counts)//2:])
return {
'skip_stats': skip_stats,
'early_layer_skip_rate': early_skip_rate,
'late_layer_skip_rate': late_skip_rate,
'pattern': 'skip early' if early_skip_rate > late_skip_rate else 'skip late'
}
| Parameter | Typical Range | Notes | |-----------|---------------|-------| | Skip threshold | 0.7 - 0.95 | Higher = more aggressive skipping, lower accuracy risk | | Verification frequency | Every 5-20 tokens | Balance safety with overhead | | Target skip rate | 20% - 50% | More is faster but riskier; sweet spot ~30% | | Initial threshold | Task-dependent | Calibrate on development set | | Fallback policy | Full forward if mismatch | Ensure correctness over speed |
When to use CLaSp:
When NOT to use:
Common pitfalls:
CLaSp: In-Context Layer Skip for Self-Speculative Decoding https://arxiv.org/abs/2505.24196
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.