Adoption

Agent Skills are supported by leading AI development tools.

VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory VS Code Gemini CLI GitHub Goose Amp Cursor Claude Code Letta OpenCode Claude OpenAI Codex Factory

ADu2021/ace-confidence-penalty

Name: ace-confidence-penalty
Author: ADu2021

skills/skillxiv-v0.0.2-claude-opus-4.6/ace-confidence-penalty/SKILL.md

npx skillsauth add ADu2021/skillXiv ace-confidence-penalty

Clean

TrivyContainer and dependency vulnerability scanner

Clean

SemgrepStatic code analysis for vulnerabilities

Clean

mcp-scan (Snyk)Model Context Protocol security validation

Skipped

Snyk (dep)Open source security scanning

Skipped

Socket.devSupply chain security analysis

Skipped

VirusTotalMulti-engine malware detection

Skipped

CrowdStrikeAdvanced threat intelligence

Skipped

OSV-ScannerOpen Source Vulnerability database check

Skipped

OWASP Dep-Check

Technique: Asymmetric Confidence Penalty for RL Error Correction

Large language models trained with reinforcement learning often suffer from spurious reasoning paths that the model confidently commits to, yet these are factually wrong. The challenge: uniform error penalization treats all mistakes equally, allowing the model to reinforce incorrect high-confidence predictions. This leads to persistent errors on mathematical and reasoning tasks, where confidence calibration is crucial.

ACE addresses this by modulating negative advantages based on a confidence shift metric. Instead of penalizing all errors uniformly, it applies stronger penalties to mistakes where the model was confidently wrong—exactly the errors most damaging to final performance.

Core Concept

The core insight is that not all errors are equal in RL training:

Low-confidence mistakes: Often contain useful learning signal; standard penalty suffices
High-confidence mistakes: Actively reinforce incorrect patterns; need stronger correction

ACE quantifies confidence as the shift between the policy being trained and a reference policy (typically the base model). A confidence shift metric captures how much the model is diverging from its original behavior—high divergence on wrong answers suggests spurious learning.

Architecture Overview

Confidence Shift Metric: Compute log-probability ratio between current and reference policy
Advantage Modulation: Scale negative advantages proportionally to confidence shift
Integration: Drop-in replacement for existing PPO/GRPO training
No Overhead: Requires only log-probability tracking, no extra forward passes

Implementation Steps

ACE integrates seamlessly into standard RL training loops. Here's how to add it to GRPO or PPO training:

Calculate the confidence shift for each training example. The shift metric compares the current policy's likelihood against a reference policy:

# After computing logits from current policy and reference policy
# log_prob_current: shape [batch_size]
# log_prob_ref: shape [batch_size]

confidence_shift = log_prob_current - log_prob_ref  # Shift toward current policy

# For GRPO/PPO, scale the advantages by this shift when computing loss
# For incorrect outputs (where reward is negative):
advantages = reward_advantage  # from RL algorithm (GRPO/PPO)

# Asymmetric penalty: stronger correction for high-confidence errors
# Scale factor increases with confidence when advantage is negative
scale_factor = 1.0 + torch.clamp(confidence_shift, min=0.0) * alpha
scaled_advantages = torch.where(
    advantages < 0,  # For negative advantages (errors)
    advantages * scale_factor,  # Amplify penalty for confident errors
    advantages  # Keep positive advantages unchanged
)

# Use scaled_advantages in standard PPO/GRPO loss computation
# policy_loss = -scaled_advantages * log_prob_current

Integrate this into your existing training loop by replacing the raw advantages with scaled advantages in the loss computation. The parameter alpha controls sensitivity (typical range 0.5–2.0).

Practical Guidance

When to Use:

Training LLMs on verifiable reasoning tasks (math, logic, code)
When the base model has reasonable reference performance
When you observe high-confidence errors persisting through training
Works best with RLVR (verifiable reward) paradigms

When NOT to Use:

Open-ended generation tasks without clear correctness signal
When most errors are already low-confidence
Very early training stages where you want maximum exploration

Hyperparameters:

alpha: Sensitivity to confidence shift (0.5–2.0). Higher values = stronger penalty for overconfident errors
Reference policy: Use base model weights or exponential moving average of weights
Only apply to negative advantages to avoid suppressing learning from correct outputs

Common Pitfalls:

Setting alpha too high causes training instability; start conservative (0.5) and tune upward
Reference policy must be stable; don't update it too frequently
ACE amplifies penalties, so ensure your base reward signals are well-calibrated
Monitor loss curves; sharp divergence indicates miscalibration

Integration: The method requires no architectural changes—add confidence shift computation and advantage scaling to existing PPO/GRPO implementations. Compatible with all model sizes (8B to 685B tested).

Reference: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for RL

ADu2021/ace-confidence-penalty

skills/skillxiv-v0.0.2-claude-opus-4.6/ace-confidence-penalty/SKILL.md

Asymmetric Confidence-aware Error Penalty (ACE) dynamically penalizes overconfident mistakes in RL training, improving reasoning quality without requiring additional computation.

2 stars

testing

Updated Apr 16, 2026

$ install --global

skillsauth

npx skillsauth add ADu2021/skillXiv ace-confidence-penalty

Install this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.

Security Scan Results

3 of 9 scanners reported clean

Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.

Scanners Passed

Scanners in report

Clean

TrivyContainer and dependency vulnerability scanner

95%

Clean

SemgrepStatic code analysis for vulnerabilities

95%

Clean

mcp-scan (Snyk)Model Context Protocol security validation

95%

Skipped

Snyk (dep)Open source security scanning

50%

Skipped

Socket.devSupply chain security analysis

50%

Skipped

VirusTotalMulti-engine malware detection

50%

Skipped

CrowdStrikeAdvanced threat intelligence

50%

Skipped

OSV-ScannerOpen Source Vulnerability database check

50%

Skipped

OWASP Dep-Check

50%

Last scanned: Apr 16, 2026, 3:02 PM4.7s1 file scanned

SKILL.md

name:: ace-confidence-penalty
title:: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for RL
version:: 0.0.2
engine:: skillxiv-v0.0.2-claude-opus-4.6
license:: MIT
url:: https://arxiv.org/abs/2602.21420
keywords:: [Reinforcement Learning, Confidence Calibration, Error Penalty, LLM Reasoning, RLVR]
description:: Asymmetric Confidence-aware Error Penalty (ACE) dynamically penalizes overconfident mistakes in RL training, improving reasoning quality without requiring additional computation.

Technique: Asymmetric Confidence Penalty for RL Error Correction

Core Concept

The core insight is that not all errors are equal in RL training:

Low-confidence mistakes: Often contain useful learning signal; standard penalty suffices
High-confidence mistakes: Actively reinforce incorrect patterns; need stronger correction

Architecture Overview

Confidence Shift Metric: Compute log-probability ratio between current and reference policy
Advantage Modulation: Scale negative advantages proportionally to confidence shift
Integration: Drop-in replacement for existing PPO/GRPO training
No Overhead: Requires only log-probability tracking, no extra forward passes

Implementation Steps

ACE integrates seamlessly into standard RL training loops. Here's how to add it to GRPO or PPO training:

Calculate the confidence shift for each training example. The shift metric compares the current policy's likelihood against a reference policy:

# After computing logits from current policy and reference policy
# log_prob_current: shape [batch_size]
# log_prob_ref: shape [batch_size]

confidence_shift = log_prob_current - log_prob_ref  # Shift toward current policy

# For GRPO/PPO, scale the advantages by this shift when computing loss
# For incorrect outputs (where reward is negative):
advantages = reward_advantage  # from RL algorithm (GRPO/PPO)

# Asymmetric penalty: stronger correction for high-confidence errors
# Scale factor increases with confidence when advantage is negative
scale_factor = 1.0 + torch.clamp(confidence_shift, min=0.0) * alpha
scaled_advantages = torch.where(
    advantages < 0,  # For negative advantages (errors)
    advantages * scale_factor,  # Amplify penalty for confident errors
    advantages  # Keep positive advantages unchanged
)

# Use scaled_advantages in standard PPO/GRPO loss computation
# policy_loss = -scaled_advantages * log_prob_current

Integrate this into your existing training loop by replacing the raw advantages with scaled advantages in the loss computation. The parameter alpha controls sensitivity (typical range 0.5–2.0).

Practical Guidance

When to Use:

Training LLMs on verifiable reasoning tasks (math, logic, code)
When the base model has reasonable reference performance
When you observe high-confidence errors persisting through training
Works best with RLVR (verifiable reward) paradigms

When NOT to Use:

Open-ended generation tasks without clear correctness signal
When most errors are already low-confidence
Very early training stages where you want maximum exploration

Hyperparameters:

alpha: Sensitivity to confidence shift (0.5–2.0). Higher values = stronger penalty for overconfident errors
Reference policy: Use base model weights or exponential moving average of weights
Only apply to negative advantages to avoid suppressing learning from correct outputs

Common Pitfalls:

Setting alpha too high causes training instability; start conservative (0.5) and tune upward
Reference policy must be stable; don't update it too frequently
ACE amplifies penalties, so ensure your base reward signals are well-calibrated
Monitor loss curves; sharp divergence indicates miscalibration

Reference: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for RL

Related Skills

ADu2021/flow-map-trajectory-tilting

testing

VerifiedTrustedCommunity

Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flow-map-trajectory-tilting

ADu2021/flexible-data-mixture-of-experts

testing

VerifiedTrustedCommunity

Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexible-data-mixture-of-experts

ADu2021/flexibility-trap-diffusion-reasoning

data-ai

VerifiedTrustedCommunity

Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flexibility-trap-diffusion-reasoning

ADu2021/flex-continuous-agent-evolution

devops

VerifiedTrustedCommunity

Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.

2SKILL.mdUpdated Apr 17, 2026

ADu2021/flex-continuous-agent-evolution

Download

For Claude Desktop. Download once, then upload the file in the app — no terminal needed.

Need help? View full Cowork setup guide →

Install manually

Choose your platform

# Clone the repo
git clone https://github.com/ADu2021/skillXiv.git

# Copy into Claude Code skills folder (global)
cp -r skillXiv/skills/skillxiv-v0.0.2-claude-opus-4.6/ace-confidence-penalty ~/.claude/skills/

Claude Code Skills — official skills path docs.

Repository

ADu2021/skillXiv

2 stars

Compatible with

Claude Code

OpenAI Codex CLI

ChatGPT