skills/skillxiv-v0.0.2-claude-opus-4.6/bottom-up-policy/SKILL.md
Optimize language model policies layer-by-layer rather than monolithically to understand internal reasoning structure. Decompose models into per-layer and per-module policies via residual streams, analyze entropy patterns revealing exploration→convergence phases, and optimize layers sequentially—improving reasoning on math tasks by up to 4.69 points.
npx skillsauth add ADu2021/skillXiv bottom-up-policyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Bottom-up Policy Optimization (BuPO) treats language models as compositional reasoning systems rather than monolithic policies. By analyzing internal layer policies via residual streams, the framework reveals that models naturally exhibit a universal structure: early layers explore solution spaces while top layers converge to predictions. Sequential optimization respects this structure.
The key insight is that residual streams enable additive decomposition of layer and module policies.
Internal Policy Decomposition: Define policies at different architectural levels using hidden states and the unembedding matrix.
# Layer and module policy definition
class InternalPolicyDecomposition:
def __init__(self, model):
self.model = model
self.num_layers = len(model.layers)
self.unembedding = model.unembedding
def layer_policy(self, layer_idx):
"""
Define policy for individual layer via its residual contribution.
Policy: hidden_state @ unembedding → logits
"""
def pi_layer(residual_stream, target_idx):
# Extract this layer's residual contribution
layer_output = residual_stream[layer_idx]
# Convert to logits via unembedding
logits = layer_output @ self.unembedding.weight
return logits
return pi_layer
def module_policy(self, layer_idx, module_type):
"""
Define policy for individual module (attention vs FFN).
Isolate each module's contribution to reasoning.
"""
if module_type == 'attention':
return lambda x: self.model.layers[layer_idx].self_attn(x)
elif module_type == 'ffn':
return lambda x: self.model.layers[layer_idx].mlp(x)
Internal Policy Entropy Analysis: Entropy patterns reveal universal reasoning structure across models.
def analyze_entropy_structure(model, dataset):
"""
Measure entropy of each layer's policy across inputs.
High entropy: exploration of solution space
Low entropy: convergence to prediction
"""
entropy_by_layer = {}
for layer_idx in range(len(model.layers)):
layer_entropies = []
for batch in dataset:
residual_streams = model.get_residual_streams(batch)
layer_hidden = residual_streams[layer_idx]
# Compute logits for this layer
logits = layer_hidden @ model.unembedding.weight
probs = softmax(logits)
# Entropy of layer's policy
entropy = -sum(probs * log(probs))
layer_entropies.append(entropy)
entropy_by_layer[layer_idx] = np.mean(layer_entropies)
# Typical pattern:
# - Early layers: high entropy (exploring)
# - Middle layers: medium entropy
# - Top layers: low entropy (converged)
return entropy_by_layer
Sequential Layer-by-Layer Optimization: Optimize layers in order, establishing better foundations for upper layers.
def sequential_layer_optimization(model, dataset, target_task):
"""
Optimize each layer sequentially, respecting natural reasoning structure:
early layers → feature refinement
top layers → final prediction
"""
num_layers = len(model.layers)
for layer_idx in range(num_layers):
print(f"Optimizing layer {layer_idx}/{num_layers}")
# Freeze all other layers
for i in range(num_layers):
for param in model.layers[i].parameters():
param.requires_grad = (i == layer_idx)
# Compute layer-specific advantage
layer_advantages = []
for batch in dataset:
# Get baseline from frozen layers up to this point
baseline = model.forward_until_layer(batch, layer_idx - 1)
# Get predictions with this layer
with_layer = model.forward_until_layer(batch, layer_idx)
# Advantage: improvement from this layer
advantage = reward(with_layer) - reward(baseline)
layer_advantages.append(advantage)
# PPO update only for this layer
policy_loss = -mean(layer_advantages) * log_prob(model.layers[layer_idx])
policy_loss.backward()
# Update this layer only
optimizer.step()
optimizer.zero_grad()
Use Bottom-up Policy Optimization when:
Avoid this approach if:
The framework requires:
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.