skills/skillxiv-v0.0.2-claude-opus-4.6/ace-confidence-penalty/SKILL.md
Asymmetric Confidence-aware Error Penalty (ACE) dynamically penalizes overconfident mistakes in RL training, improving reasoning quality without requiring additional computation.
npx skillsauth add ADu2021/skillXiv ace-confidence-penaltyInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Large language models trained with reinforcement learning often suffer from spurious reasoning paths that the model confidently commits to, yet these are factually wrong. The challenge: uniform error penalization treats all mistakes equally, allowing the model to reinforce incorrect high-confidence predictions. This leads to persistent errors on mathematical and reasoning tasks, where confidence calibration is crucial.
ACE addresses this by modulating negative advantages based on a confidence shift metric. Instead of penalizing all errors uniformly, it applies stronger penalties to mistakes where the model was confidently wrong—exactly the errors most damaging to final performance.
The core insight is that not all errors are equal in RL training:
ACE quantifies confidence as the shift between the policy being trained and a reference policy (typically the base model). A confidence shift metric captures how much the model is diverging from its original behavior—high divergence on wrong answers suggests spurious learning.
ACE integrates seamlessly into standard RL training loops. Here's how to add it to GRPO or PPO training:
Calculate the confidence shift for each training example. The shift metric compares the current policy's likelihood against a reference policy:
# After computing logits from current policy and reference policy
# log_prob_current: shape [batch_size]
# log_prob_ref: shape [batch_size]
confidence_shift = log_prob_current - log_prob_ref # Shift toward current policy
# For GRPO/PPO, scale the advantages by this shift when computing loss
# For incorrect outputs (where reward is negative):
advantages = reward_advantage # from RL algorithm (GRPO/PPO)
# Asymmetric penalty: stronger correction for high-confidence errors
# Scale factor increases with confidence when advantage is negative
scale_factor = 1.0 + torch.clamp(confidence_shift, min=0.0) * alpha
scaled_advantages = torch.where(
advantages < 0, # For negative advantages (errors)
advantages * scale_factor, # Amplify penalty for confident errors
advantages # Keep positive advantages unchanged
)
# Use scaled_advantages in standard PPO/GRPO loss computation
# policy_loss = -scaled_advantages * log_prob_current
Integrate this into your existing training loop by replacing the raw advantages with scaled advantages in the loss computation. The parameter alpha controls sensitivity (typical range 0.5–2.0).
When to Use:
When NOT to Use:
Hyperparameters:
alpha: Sensitivity to confidence shift (0.5–2.0). Higher values = stronger penalty for overconfident errorsCommon Pitfalls:
Integration: The method requires no architectural changes—add confidence shift computation and advantage scaling to existing PPO/GRPO implementations. Compatible with all model sizes (8B to 685B tested).
Reference: Overconfident Errors Need Stronger Correction: Asymmetric Confidence Penalties for RL
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.