Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ADu2021/bottom-up-policy

Name: bottom-up-policy
Author: ADu2021

skills/skillxiv-v0.0.2-claude-opus-4.6/bottom-up-policy/SKILL.md

npx skillsauth add ADu2021/skillXiv bottom-up-policy

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Overview

Bottom-up Policy Optimization (BuPO) treats language models as compositional reasoning systems rather than monolithic policies. By analyzing internal layer policies via residual streams, the framework reveals that models naturally exhibit a universal structure: early layers explore solution spaces while top layers converge to predictions. Sequential optimization respects this structure.

Core Technique

The key insight is that residual streams enable additive decomposition of layer and module policies.

Internal Policy Decomposition: Define policies at different architectural levels using hidden states and the unembedding matrix.

# Layer and module policy definition
class InternalPolicyDecomposition:
    def __init__(self, model):
        self.model = model
        self.num_layers = len(model.layers)
        self.unembedding = model.unembedding

    def layer_policy(self, layer_idx):
        """
        Define policy for individual layer via its residual contribution.
        Policy: hidden_state @ unembedding → logits
        """
        def pi_layer(residual_stream, target_idx):
            # Extract this layer's residual contribution
            layer_output = residual_stream[layer_idx]
            # Convert to logits via unembedding
            logits = layer_output @ self.unembedding.weight
            return logits
        return pi_layer

    def module_policy(self, layer_idx, module_type):
        """
        Define policy for individual module (attention vs FFN).
        Isolate each module's contribution to reasoning.
        """
        if module_type == 'attention':
            return lambda x: self.model.layers[layer_idx].self_attn(x)
        elif module_type == 'ffn':
            return lambda x: self.model.layers[layer_idx].mlp(x)

Internal Policy Entropy Analysis: Entropy patterns reveal universal reasoning structure across models.

def analyze_entropy_structure(model, dataset):
    """
    Measure entropy of each layer's policy across inputs.
    High entropy: exploration of solution space
    Low entropy: convergence to prediction
    """
    entropy_by_layer = {}

    for layer_idx in range(len(model.layers)):
        layer_entropies = []

        for batch in dataset:
            residual_streams = model.get_residual_streams(batch)
            layer_hidden = residual_streams[layer_idx]

            # Compute logits for this layer
            logits = layer_hidden @ model.unembedding.weight
            probs = softmax(logits)

            # Entropy of layer's policy
            entropy = -sum(probs * log(probs))
            layer_entropies.append(entropy)

        entropy_by_layer[layer_idx] = np.mean(layer_entropies)

    # Typical pattern:
    # - Early layers: high entropy (exploring)
    # - Middle layers: medium entropy
    # - Top layers: low entropy (converged)

    return entropy_by_layer

Sequential Layer-by-Layer Optimization: Optimize layers in order, establishing better foundations for upper layers.

def sequential_layer_optimization(model, dataset, target_task):
    """
    Optimize each layer sequentially, respecting natural reasoning structure:
    early layers → feature refinement
    top layers → final prediction
    """
    num_layers = len(model.layers)

    for layer_idx in range(num_layers):
        print(f"Optimizing layer {layer_idx}/{num_layers}")

        # Freeze all other layers
        for i in range(num_layers):
            for param in model.layers[i].parameters():
                param.requires_grad = (i == layer_idx)

        # Compute layer-specific advantage
        layer_advantages = []
        for batch in dataset:
            # Get baseline from frozen layers up to this point
            baseline = model.forward_until_layer(batch, layer_idx - 1)

            # Get predictions with this layer
            with_layer = model.forward_until_layer(batch, layer_idx)

            # Advantage: improvement from this layer
            advantage = reward(with_layer) - reward(baseline)
            layer_advantages.append(advantage)

        # PPO update only for this layer
        policy_loss = -mean(layer_advantages) * log_prob(model.layers[layer_idx])
        policy_loss.backward()

        # Update this layer only
        optimizer.step()
        optimizer.zero_grad()

When to Use This Technique

Use Bottom-up Policy Optimization when:

Reasoning tasks with interpretability requirements
Math and coding problem-solving
Understanding internal model structure is valuable
Sequential optimization aligns with your hardware/training

When NOT to Use This Technique

Avoid this approach if:

Single monolithic policy is most efficient
Sequential layer optimization adds unacceptable overhead
Interpretability not required (end-to-end faster)
Model architecture doesn't support residual stream analysis

Implementation Notes

The framework requires:

Access to residual streams at each layer
Unembedding matrix for policy conversion
Layer-wise gradient control and optimization
Entropy analysis infrastructure for structure understanding

Key Performance

Improvements up to 4.69 points on mathematical reasoning (AIME24)
Consistent gains across Qwen and Llama models
Interpretable reasoning structure
Foundation for further optimization strategies

References

Layer and module policy decomposition via residual streams
Internal policy entropy analysis
Sequential optimization respecting reasoning structure
Universal exploration→convergence pattern across models

ADu2021/bottom-up-policy

skills/skillxiv-v0.0.2-claude-opus-4.6/bottom-up-policy/SKILL.md

Optimize language model policies layer-by-layer rather than monolithically to understand internal reasoning structure. Decompose models into per-layer and per-module policies via residual streams, analyze entropy patterns revealing exploration→convergence phases, and optimize layers sequentially—improving reasoning on math tasks by up to 4.69 points.

2 stars

testing

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add ADu2021/skillXiv bottom-up-policy

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 24, 2026, 8:55 PM1.9s1 file scanned

SKILL.md

name:: bottom-up-policy
title:: Bottom-up Policy Optimization: Internal Policies in LMs
version:: 0.0.2
engine:: skillxiv-v0.0.2-claude-opus-4.6
license:: MIT
url:: https://arxiv.org/abs/2512.19673
keywords:: [reinforcement-learning, interpretability, layer-wise, reasoning, llm]
description:: Optimize language model policies layer-by-layer rather than monolithically to understand internal reasoning structure. Decompose models into per-layer and per-module policies via residual streams, analyze entropy patterns revealing exploration→convergence phases, and optimize layers sequentially—improving reasoning on math tasks by up to 4.69 points.

Overview

Core Technique

The key insight is that residual streams enable additive decomposition of layer and module policies.

Internal Policy Decomposition: Define policies at different architectural levels using hidden states and the unembedding matrix.

# Layer and module policy definition
class InternalPolicyDecomposition:
    def __init__(self, model):
        self.model = model
        self.num_layers = len(model.layers)
        self.unembedding = model.unembedding

    def layer_policy(self, layer_idx):
        """
        Define policy for individual layer via its residual contribution.
        Policy: hidden_state @ unembedding → logits
        """
        def pi_layer(residual_stream, target_idx):
            # Extract this layer's residual contribution
            layer_output = residual_stream[layer_idx]
            # Convert to logits via unembedding
            logits = layer_output @ self.unembedding.weight
            return logits
        return pi_layer

    def module_policy(self, layer_idx, module_type):
        """
        Define policy for individual module (attention vs FFN).
        Isolate each module's contribution to reasoning.
        """
        if module_type == 'attention':
            return lambda x: self.model.layers[layer_idx].self_attn(x)
        elif module_type == 'ffn':
            return lambda x: self.model.layers[layer_idx].mlp(x)

Internal Policy Entropy Analysis: Entropy patterns reveal universal reasoning structure across models.

def analyze_entropy_structure(model, dataset):
    """
    Measure entropy of each layer's policy across inputs.
    High entropy: exploration of solution space
    Low entropy: convergence to prediction
    """
    entropy_by_layer = {}

    for layer_idx in range(len(model.layers)):
        layer_entropies = []

        for batch in dataset:
            residual_streams = model.get_residual_streams(batch)
            layer_hidden = residual_streams[layer_idx]

            # Compute logits for this layer
            logits = layer_hidden @ model.unembedding.weight
            probs = softmax(logits)

            # Entropy of layer's policy
            entropy = -sum(probs * log(probs))
            layer_entropies.append(entropy)

        entropy_by_layer[layer_idx] = np.mean(layer_entropies)

    # Typical pattern:
    # - Early layers: high entropy (exploring)
    # - Middle layers: medium entropy
    # - Top layers: low entropy (converged)

    return entropy_by_layer

Sequential Layer-by-Layer Optimization: Optimize layers in order, establishing better foundations for upper layers.

def sequential_layer_optimization(model, dataset, target_task):
    """
    Optimize each layer sequentially, respecting natural reasoning structure:
    early layers → feature refinement
    top layers → final prediction
    """
    num_layers = len(model.layers)

    for layer_idx in range(num_layers):
        print(f"Optimizing layer {layer_idx}/{num_layers}")

        # Freeze all other layers
        for i in range(num_layers):
            for param in model.layers[i].parameters():
                param.requires_grad = (i == layer_idx)

        # Compute layer-specific advantage
        layer_advantages = []
        for batch in dataset:
            # Get baseline from frozen layers up to this point
            baseline = model.forward_until_layer(batch, layer_idx - 1)

            # Get predictions with this layer
            with_layer = model.forward_until_layer(batch, layer_idx)

            # Advantage: improvement from this layer
            advantage = reward(with_layer) - reward(baseline)
            layer_advantages.append(advantage)

        # PPO update only for this layer
        policy_loss = -mean(layer_advantages) * log_prob(model.layers[layer_idx])
        policy_loss.backward()

        # Update this layer only
        optimizer.step()
        optimizer.zero_grad()

When to Use This Technique

Use Bottom-up Policy Optimization when:

Reasoning tasks with interpretability requirements
Math and coding problem-solving
Understanding internal model structure is valuable
Sequential optimization aligns with your hardware/training

When NOT to Use This Technique

Avoid this approach if:

Single monolithic policy is most efficient
Sequential layer optimization adds unacceptable overhead
Interpretability not required (end-to-end faster)
Model architecture doesn't support residual stream analysis

Implementation Notes

The framework requires:

Access to residual streams at each layer
Unembedding matrix for policy conversion
Layer-wise gradient control and optimization
Entropy analysis infrastructure for structure understanding

Key Performance

Improvements up to 4.69 points on mathematical reasoning (AIME24)
Consistent gains across Qwen and Llama models
Interpretable reasoning structure
Foundation for further optimization strategies

References

Layer and module policy decomposition via residual streams
Internal policy entropy analysis
Sequential optimization respecting reasoning structure
Universal exploration→convergence pattern across models

Related Skills

ADu2021/flow-map-trajectory-tilting

testing

VerifiedTrustedCommunity

Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

testing

VerifiedTrustedCommunity

Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

data-ai

VerifiedTrustedCommunity

Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

devops

VerifiedTrustedCommunity

Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flex-continuous-agent-evolution

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ADu2021/skillXiv.git

# Copy into Claude Code skills folder (global)
cp -r skillXiv/skills/skillxiv-v0.0.2-claude-opus-4.6/bottom-up-policy ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ADu2021/skillXiv

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT