skills/skillxiv-v0.0.2-claude-opus-4.6/dot-resize-optimal-transport-compression/SKILL.md
Compress LLMs by 20-30% in width while preserving functionality through optimal transport-based neuron merging. Instead of discarding neurons, redistribute their signal to retained neurons via learned transport maps. Use when you need to reduce model size with minimal accuracy loss and measurable computational speedup.
npx skillsauth add ADu2021/skillXiv dot-resize-optimal-transport-compressionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Traditional pruning discards low-importance neurons, wasting the signal they carry. DOTResize reframes width reduction as a discrete optimal transport problem: instead of deletion, the method computes optimal mappings from full-width neurons to a reduced subset, redistributing information through learned transformation matrices. This approach preserves functional equivalence by leveraging QR decomposition to maintain RMSNorm invariance in Transformers.
The key innovation is computing transport maps based on neuron activation similarities, then using QR factorization to decompose the transport matrix into components that preserve normalization properties in residual networks. This enables dropping 20-30% of neurons with measurable real-world speedup rather than just parameter reduction.
The method optimizes a discrete optimal transport problem where source neurons map to target neurons based on activation correlations. The Sinkhorn algorithm with entropy regularization solves this efficiently. A critical insight is that orthogonal mappings preserve RMSNorm, but more general transport maps require QR decomposition: decompose the transport matrix T = QR where Q is orthogonal (preserves norm) and R handles scaling.
The learned transformation pairs are then applied at sublayer outputs and absorbed into adjacent weight matrices, allowing the model to function with fewer neurons without retraining the LLM backbone.
The Sinkhorn algorithm solves the optimal transport problem with entropy regularization, balancing accuracy and computational efficiency.
import torch
import torch.nn.functional as F
from scipy.optimize import linear_sum_assignment
class OptimalTransportCompressor:
def __init__(self, source_dim, target_dim, regularization_lambda=0.1):
self.source_dim = source_dim
self.target_dim = target_dim
self.lambda_reg = regularization_lambda
def compute_cost_matrix(self, source_activations, target_activations):
"""
Compute pairwise L1 distances between source and target neuron activations.
Args:
source_activations: (calib_samples, source_dim) from full model
target_activations: (calib_samples, target_dim) from reduced model
Returns:
cost_matrix: (source_dim, target_dim) pairwise L1 distances
"""
batch_size, source_dim = source_activations.shape
_, target_dim = target_activations.shape
# Normalize for scale invariance
source_norm = F.normalize(source_activations, p=2, dim=0)
target_norm = F.normalize(target_activations, p=2, dim=0)
# Compute pairwise L1 distances
cost = torch.zeros(source_dim, target_dim)
for i in range(source_dim):
for j in range(target_dim):
cost[i, j] = torch.abs(source_norm[:, i:i+1] - target_norm[:, j:j+1]).sum()
return cost
def sinkhorn_algorithm(self, cost_matrix, num_iterations=100, tolerance=1e-6):
"""
Solve optimal transport via Sinkhorn-Knopp algorithm with entropy regularization.
Minimizes: <T, C> - λ*H(T)
where H(T) = -∑T_ij log(T_ij) is entropy, encouraging soft assignment.
Args:
cost_matrix: (source_dim, target_dim) cost matrix
num_iterations: iterations of Sinkhorn algorithm
tolerance: convergence threshold
Returns:
transport_matrix: (source_dim, target_dim) soft assignment matrix
"""
source_dim, target_dim = cost_matrix.shape
device = cost_matrix.device
# Initialize with uniform probabilities
T = torch.ones(source_dim, target_dim, device=device) / target_dim
# Entropy regularization constant
K = torch.exp(-self.lambda_reg * cost_matrix)
# Margin constraints (how much from each source)
a = torch.ones(source_dim, device=device) / source_dim
# How much to each target (uniform)
b = torch.ones(target_dim, device=device) / target_dim
for iteration in range(num_iterations):
# Sinkhorn scaling iterations
u = a / (K @ torch.ones(target_dim, device=device) + 1e-8)
v = b / (K.T @ u + 1e-8)
T = u.unsqueeze(1) * K * v.unsqueeze(0)
# Check convergence
if iteration % 10 == 0:
marginal_err = torch.abs(T.sum(dim=1) - a).max()
if marginal_err < tolerance:
break
return T
def compute_transport_map(self, source_acts, target_acts):
"""
Compute optimal transport mapping from source to target neurons.
"""
cost = self.compute_cost_matrix(source_acts, target_acts)
transport_map = self.sinkhorn_algorithm(cost)
return transport_map
The Sinkhorn algorithm efficiently finds soft assignments that balance cost minimization with smoothness, avoiding hard assignments that cause gradient issues.
Transport maps preserve functional equivalence by decomposing into orthogonal and scaling components.
def qr_decomposed_transport(transport_matrix):
"""
Decompose transport matrix T = QR for norm preservation.
Q (orthogonal) preserves norms through RMSNorm layers.
R (upper triangular) handles scaling adjustments.
Args:
transport_matrix: (source_dim, target_dim) optimal transport matrix
Returns:
Q: (source_dim, target_dim) orthogonal component
R: (target_dim, target_dim) upper triangular scaling
"""
# PyTorch QR decomposition
Q, R = torch.linalg.qr(transport_matrix)
# Q is orthogonal: preserves norms, safe for RMSNorm
# R is upper triangular: handles dimension reduction scaling
return Q, R
def apply_transport_with_qr(hidden_states, transport_matrix):
"""
Apply transport mapping while preserving norm properties.
Args:
hidden_states: (batch, seq_len, source_dim)
transport_matrix: (source_dim, target_dim)
Returns:
transformed: (batch, seq_len, target_dim) with norm properties preserved
"""
Q, R = qr_decomposed_transport(transport_matrix)
# Apply orthogonal transformation (preserves norm)
transformed = hidden_states @ Q # (batch, seq_len, target_dim)
# Scale by inverse of R to maintain activation magnitudes
# R is upper triangular, compute R^-1 efficiently
R_inv = torch.linalg.inv(R)
scaled = transformed @ R_inv
return scaled
Apply transformations at attention and FFN layer boundaries, absorbing matrices into adjacent weights.
class CompressedAttentionBlock(torch.nn.Module):
def __init__(self, original_attention, transport_map_attention):
super().__init__()
self.attention = original_attention
self.transport_map = transport_map_attention
self.Q_att, self.R_att = qr_decomposed_transport(transport_map_attention)
def forward(self, hidden_states):
# Original attention computation
attn_output = self.attention(hidden_states) # (batch, seq, orig_dim)
# Apply transport mapping at output
compressed = attn_output @ self.Q_att # (batch, seq, target_dim)
# Store for inverse application in feed-forward (skip connection)
return compressed
class CompressedFFNBlock(torch.nn.Module):
def __init__(self, original_ffn, transport_map_ffn):
super().__init__()
self.ffn = original_ffn
self.transport_map = transport_map_ffn
self.Q_ffn, self.R_ffn = qr_decomposed_transport(transport_map_ffn)
def forward(self, hidden_states):
# Original FFN computation
ffn_output = self.ffn(hidden_states) # (batch, seq, orig_dim)
# Apply transport mapping
compressed = ffn_output @ self.Q_ffn # (batch, seq, target_dim)
return compressed
def absorb_transport_into_weights(layer, transport_map, position='output'):
"""
Absorb transformation matrix into layer weights for efficient inference.
Args:
layer: torch.nn.Linear layer to modify
transport_map: (source_dim, target_dim) transport matrix
position: 'input' or 'output' where to apply transformation
Returns:
Modified layer with absorbed transformation
"""
Q, R = qr_decomposed_transport(transport_map)
if position == 'output':
# Modify weight: W_new = W_old @ Q @ R^-1
R_inv = torch.linalg.inv(R)
layer.weight = torch.nn.Parameter(
layer.weight @ Q @ R_inv
)
layer.out_features = Q.shape[1]
elif position == 'input':
# Modify weight: W_new = Q @ R @ W_old
layer.weight = torch.nn.Parameter(
Q @ R @ layer.weight
)
layer.in_features = Q.shape[0]
return layer
Use a small calibration dataset to compute transport maps without full retraining.
def compress_llm_with_optimal_transport(
model,
calibration_data,
target_width_ratio=0.75, # Keep 75% of neurons
calibration_samples=128
):
"""
Compress LLM using optimal transport without retraining.
Args:
model: Original LLM model
calibration_data: Small dataset (128-256 sequences) for calibration
target_width_ratio: Fraction of neurons to retain
calibration_samples: Tokens to use for activation collection
Returns:
Compressed model with transport maps absorbed
"""
device = next(model.parameters()).device
compressor = OptimalTransportCompressor(regularization_lambda=0.1)
# Collect activations on calibration data
model.eval()
all_source_acts = []
all_target_acts = []
with torch.no_grad():
for batch in calibration_data:
input_ids = batch['input_ids'].to(device)
# Forward pass, collecting layer outputs
outputs = model(input_ids, output_hidden_states=True)
hidden_states = outputs.hidden_states
all_source_acts.append(hidden_states[-1].reshape(-1, hidden_states[-1].shape[-1]))
source_activations = torch.cat(all_source_acts, dim=0)[:calibration_samples]
# Determine target dimension
source_dim = source_activations.shape[1]
target_dim = int(source_dim * target_width_ratio)
# Compute target neuron activations (select top variance neurons)
neuron_variance = source_activations.var(dim=0)
_, top_indices = torch.topk(neuron_variance, target_dim)
target_activations = source_activations[:, top_indices]
# Compute optimal transport map
transport_map = compressor.compute_transport_map(source_activations, target_activations)
# Apply to model
for layer in model.transformer.h:
# Compress attention output projection
absorb_transport_into_weights(layer.attn.out_proj, transport_map, position='output')
# Compress FFN output projection
absorb_transport_into_weights(layer.mlp.down_proj, transport_map, position='output')
return model
| Parameter | Value | Notes | |-----------|-------|-------| | Width Reduction Target | 20-30% | 70-80% neuron retention; larger reductions hurt accuracy | | Sinkhorn λ (Regularization) | 0.1 | Controls entropy regularization strength; 0.01-1.0 range works | | Calibration Samples | 128-256 | Minimum for stable transport computation; 128 token sequences | | Cost Metric | L1 distance | L1 more robust than L2 for outlier neurons | | Effective Rank Threshold | 99% variance | Select target neurons with cumulative 99% activation variance | | Sinkhorn Iterations | 100 | Convergence typically achieved in 50-100 iterations |
Chen, Y., Liu, S., Wang, Z., et al. (2024). DOTResize: Reducing LLM Width via Discrete Optimal Transport-based Neuron Merging. arXiv preprint arXiv:2507.04517.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.