skills/skillxiv-v0.0.2-claude-opus-4.6/cure-coevolving-llm-testing/SKILL.md
Improve code and test generation through co-evolution where LLMs generate both solutions and tests, optimizing each based on mutual evaluation and discriminative testing performance.
npx skillsauth add ADu2021/skillXiv cure-coevolving-llm-testingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
CURE introduces "mutual supervision"—a mechanism where code generators and unit test generators co-evolve through RL without requiring ground-truth solutions. Rather than static code-test pairs, the framework generates candidate code, evaluates it against generated tests, and optimizes both components based on test discriminative power. The innovation is theoretically grounded: reward precision (probability generated tests rank correct code above incorrect code) converges to 1 as more tests are generated, enabling self-supervised learning.
This eliminates dependency on expensive labeled datasets and achieves 5.3% improvement in one-shot accuracy and 9.0% in Best-of-N performance while reducing inference by 64.8% on long-CoT variants.
# Pseudo-code for reward calculation
def compute_joint_rewards(candidate_codes, candidate_tests, ground_truth_solution):
"""
Score code solutions and unit tests based on mutual evaluation.
Tests rewarded by discriminative power; code by correctness and test agreement.
"""
# Execute all code solutions against all tests
execution_matrix = [] # [num_solutions, num_tests]
for solution in candidate_codes:
results = []
for test in candidate_tests:
try:
passed = execute_test(test, solution)
results.append(passed)
except:
results.append(False)
execution_matrix.append(results)
# Compute code rewards: correct solutions + those passing more tests
code_rewards = []
for i, solution in enumerate(candidate_codes):
passes_ground_truth = solution_correct(solution, ground_truth_solution)
num_passing_tests = sum(execution_matrix[i])
code_reward = 1.0 if passes_ground_truth else 0.0
code_reward += 0.1 * (num_passing_tests / len(candidate_tests))
code_rewards.append(code_reward)
# Compute test rewards: reward based on discriminative power
# Higher reward if test separates correct from incorrect solutions
test_rewards = []
for j, test in enumerate(candidate_tests):
test_results = [execution_matrix[i][j] for i in range(len(candidate_codes))]
# Reward precision: probability test ranks correct above incorrect
correct_indices = [i for i, code in enumerate(candidate_codes)
if solution_correct(code, ground_truth_solution)]
incorrect_indices = [i for i, code in enumerate(candidate_codes)
if not solution_correct(code, ground_truth_solution)]
discrimination_score = 0.0
if correct_indices and incorrect_indices:
correct_pass_rate = sum(test_results[i] for i in correct_indices) / len(correct_indices)
incorrect_pass_rate = sum(test_results[i] for i in incorrect_indices) / len(incorrect_indices)
discrimination_score = correct_pass_rate - incorrect_pass_rate
test_rewards.append(max(0.0, discrimination_score))
return code_rewards, test_rewards
def apply_length_penalty(reward, response_tokens, target_length=None):
"""
Penalize overly long responses while preserving correctness signal.
Useful for chain-of-thought models.
"""
if target_length is None:
target_length = 512 # Typical reasoning length
if response_tokens > target_length:
# Exponential penalty for excessive length
excess = response_tokens - target_length
length_penalty = 1.0 - min(0.5, excess / (2 * target_length))
else:
# Slight bonus for conciseness
length_penalty = 1.0 + 0.05 * (1 - response_tokens / target_length)
adjusted_reward = reward * length_penalty
return max(0.0, adjusted_reward)
def training_step(coder, tester, task_batch, optimizer_code, optimizer_test):
"""
Single training iteration co-optimizing code and test generators.
Generates multiple rollouts per task for stable estimates.
"""
total_code_loss = 0.0
total_test_loss = 0.0
for task in task_batch:
# Generate candidate solutions and tests
code_rollouts = coder.sample(task, num_samples=16, temperature=1.0)
test_rollouts = tester.sample(task, num_samples=16, temperature=1.0)
# Compute mutual rewards
code_rewards, test_rewards = compute_joint_rewards(
code_rollouts, test_rollouts, task['solution']
)
# GRPO losses: relative policy optimization
code_loss = grpo_loss(coder, code_rollouts, code_rewards)
test_loss = grpo_loss(tester, test_rollouts, test_rewards)
# Apply KL penalty for stability
code_loss += 0.01 * kl_divergence(coder, reference_model)
test_loss += 0.01 * kl_divergence(tester, reference_model)
total_code_loss += code_loss
total_test_loss += test_loss
# Update both models
optimizer_code.zero_grad()
total_code_loss.backward()
optimizer_code.step()
optimizer_test.zero_grad()
total_test_loss.backward()
optimizer_test.step()
return total_code_loss.item(), total_test_loss.item()
When to Apply:
Setup Requirements:
Expected Results:
Key Tuning Decisions:
Transferability:
Implemented on Qwen2.5 (7B, 14B) and Qwen3-4B. Evaluated across LiveBench, MBPP, LiveCodeBench, CodeContests, and CodeForces. Training uses 16 A100 GPUs with GRPO algorithm and mutual supervision. Demonstrates label-free optimization without dependency on proprietary model outputs.
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.