skills/skillxiv-v0.0.2-claude-opus-4.6/dupo-dual-preference-optimization/SKILL.md
Implement dual preference optimization to generate self-supervised feedback without manual annotation by decomposing tasks into known/unknown components and reconstructing hidden information from model outputs.
npx skillsauth add ADu2021/skillXiv dupo-dual-preference-optimizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
DuPO enables large language models to self-verify outputs without manual feedback by leveraging dual learning frameworks. The technique decomposes task inputs into known and unknown components, then constructs complementary reconstruction tasks. The model's ability to reconstruct hidden information from its primary output serves as an intrinsic reward signal. This approach eliminates dependency on expensive labeled data while improving reasoning quality across mathematical, translation, and general reasoning domains.
Create paired representations separating known and unknown components:
def decompose_task(task_input: str, task_type: str) -> tuple[str, str]:
"""
Decompose task into observable and hidden components.
For math: full equation becomes observable setup + hidden variables
For translation: source+target becomes observable pairs + masked segments
"""
if task_type == "math":
# Extract equation structure, mask coefficients/variables
observable = extract_equation_skeleton(task_input)
hidden = extract_hidden_variables(task_input)
elif task_type == "translation":
# Preserve source, mask target segments
observable = task_input.split("|||")[0] # source language
hidden = task_input.split("|||")[1] # target language
return observable, hidden
Create complementary tasks to reconstruct the hidden information:
def create_dual_task(observable: str, task_type: str) -> str:
"""
Construct dual task that reconstructs hidden information from primary output.
"""
if task_type == "math":
# Task: Given solution, recover hidden variables
prompt = f"Solution: {observable}\nRecover the original variables/coefficients"
elif task_type == "translation":
# Task: Given translated output, reconstruct source
prompt = f"Translated text: {observable}\nRetrieve original source language"
return prompt
Evaluate how well the model reconstructs hidden information:
def compute_reconstruction_score(
original_hidden: str,
reconstructed: str,
similarity_metric: str = "exact_match"
) -> float:
"""
Measure reconstruction quality as self-supervised reward.
"""
if similarity_metric == "exact_match":
return 1.0 if reconstructed.strip() == original_hidden.strip() else 0.0
elif similarity_metric == "semantic":
# Use embedding similarity or parsing-based comparison
original_embedding = encode(original_hidden)
reconstructed_embedding = encode(reconstructed)
return cosine_similarity(original_embedding, reconstructed_embedding)
return 0.0
Use reconstruction scores as reward signals in preference learning:
def apply_preference_optimization(
model: LLM,
training_examples: list[dict],
reconstruction_scores: list[float],
learning_rate: float = 1e-5
) -> LLM:
"""
Optimize model preferences based on reconstruction quality rewards.
"""
# Sort examples by reconstruction score
ranked_pairs = create_preference_pairs(
training_examples,
reconstruction_scores
)
# Apply DPO-style loss: prefer higher-scored outputs
for batch in ranked_pairs:
preferred_output = batch["high_score"]
dispreferred_output = batch["low_score"]
loss = compute_dpo_loss(
model,
preferred_output,
dispreferred_output,
beta=0.5
)
optimizer.zero_grad()
loss.backward()
optimizer.step()
return model
Extend framework to tasks without direct mathematical inverses:
def handle_non_invertible_task(
primary_output: str,
task_type: str,
learned_inverter: LLM
) -> tuple[str, float]:
"""
For non-invertible tasks, use learned approximation to reconstruct hidden info.
"""
# Use auxiliary model trained to approximate the inverse
reconstructed = learned_inverter.generate(
f"Given output: {primary_output}\nApproximate original input:"
)
# Score based on output validity and consistency checks
validity_score = check_output_consistency(reconstructed, task_type)
return reconstructed, validity_score
She, S., Bao, Y., Lu, Y., Xu, L., Li, T., Zhu, W., Huang, S., Cheng, S., Lu, L., & Wang, Y. (2024). DuPO: Enabling Reliable LLM Self-Verification via Dual Preference Optimization. arXiv preprint arXiv:2508.14460.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.