skills/skillxiv-v0.0.2-claude-opus-4.6/code-a1-adversarial-rl-code/SKILL.md
Train code and test generators through adversarial co-evolution where test LLM generates adversarial test cases to expose code defects. Prevent self-collusion by separating models and enabling white-box test generation.
npx skillsauth add ADu2021/skillXiv code-a1-adversarial-rl-codeInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Standard reinforcement learning for code generation suffers from self-collusion: a single model learns to generate trivial tests to easily pass, or creates generic tests that miss implementation-specific bugs. Code-A1 solves this through architectural separation—a Code LLM generates implementations while a Test LLM generates adversarial tests designed to expose defects. This adversarial co-evolution enables the Code LLM to improve robustly against high-quality tests without gaming the reward signal, matching or exceeding performance on human-annotated test benchmarks.
The technique combines architectural separation with two stabilizing mechanisms: a mistake book preserving high-quality test cases, and composite rewards balancing test validity against difficulty.
The adversarial co-evolution loop operates as:
The architectural separation ensures the Test LLM cannot trivially satisfy itself, forcing generation of genuinely challenging test cases.
Start by setting up the adversarial reward structure that scores both code and test quality.
import ast
import subprocess
from typing import Tuple
class AdversarialReward:
"""Compute rewards for code and test quality in adversarial setting."""
def __init__(self, validity_weight=0.6, difficulty_weight=0.4):
self.validity_weight = validity_weight
self.difficulty_weight = difficulty_weight
def score_code(self, code: str, tests: list) -> float:
"""Score code by tests passed / failed."""
if not self._is_valid_code(code):
return -1.0 # Invalid code
passed = 0
failed = 0
for test in tests:
result = self._run_test(code, test)
if result['success']:
passed += 1
else:
failed += 1
# Code reward: higher for passing more tests
return passed / max(passed + failed, 1)
def score_test(self, test: str, code: str, reference_code: str = None) -> float:
"""Score test by validity and difficulty (finding bugs)."""
if not self._is_valid_test(test, code):
return -0.5 # Invalid test
# Validity: does it run without error on correct code?
validity_result = self._run_test(reference_code or code, test)
validity_score = 1.0 if validity_result['success'] else 0.0
# Difficulty: does it catch bugs in candidate code?
candidate_result = self._run_test(code, test)
difficulty_score = 1.0 if not candidate_result['success'] else 0.0
# Composite: value tests that are valid and expose bugs
return (self.validity_weight * validity_score +
self.difficulty_weight * difficulty_score)
def _is_valid_code(self, code: str) -> bool:
"""Check if code is syntactically valid Python."""
try:
ast.parse(code)
return True
except SyntaxError:
return False
def _is_valid_test(self, test: str, code: str) -> bool:
"""Check if test is syntactically valid and runs."""
try:
ast.parse(test)
# Test should import or reference the code
return 'assert' in test or 'self.assert' in test
except SyntaxError:
return False
def _run_test(self, code: str, test: str, timeout=5) -> dict:
"""Execute test against code in sandbox."""
full_script = f"{code}\n\n{test}"
try:
result = subprocess.run(['python', '-c', full_script],
capture_output=True,
timeout=timeout,
text=True)
return {
'success': result.returncode == 0,
'output': result.stdout,
'error': result.stderr
}
except subprocess.TimeoutExpired:
return {'success': False, 'output': '', 'error': 'Timeout'}
Next, implement the Mistake Book—a memory buffer storing high-quality failing test cases to maintain training stability.
from collections import deque
import heapq
class MistakeBook:
"""Store high-quality test cases that expose bugs."""
def __init__(self, max_size=1000):
self.tests = deque(maxlen=max_size)
self.quality_scores = []
def add_test(self, test: str, quality_score: float):
"""Add test if it has sufficient quality."""
if quality_score > 0.3: # Threshold for "good" test
self.tests.append(test)
self.quality_scores.append(quality_score)
def sample_batch(self, batch_size: int) -> list:
"""Sample tests, biasing toward high-quality ones."""
if not self.tests:
return []
# Probability proportional to quality
total_quality = sum(self.quality_scores)
if total_quality == 0:
# Uniform sampling if all quality is zero
sampled_indices = np.random.choice(len(self.tests), batch_size,
replace=True)
else:
probabilities = [q / total_quality for q in self.quality_scores]
sampled_indices = np.random.choice(len(self.tests), batch_size,
p=probabilities, replace=True)
return [self.tests[i] for i in sampled_indices]
def get_top_k(self, k: int) -> list:
"""Return top-k quality tests for evaluation."""
indexed = list(enumerate(zip(self.tests, self.quality_scores)))
top_k = heapq.nlargest(k, indexed, key=lambda x: x[1][1])
return [test for _, (test, _) in top_k]
Now implement the main adversarial training loop coordinating code and test generation.
import torch
from torch.optim import AdamW
class AdversarialCodeTrainer:
"""Co-train Code LLM and Test LLM with adversarial objectives."""
def __init__(self, code_llm, test_llm, reward_fn, max_code_len=1024):
self.code_llm = code_llm
self.test_llm = test_llm
self.reward_fn = reward_fn
self.mistake_book = MistakeBook(max_size=2000)
self.max_code_len = max_code_len
def step(self, problem_specs: list, num_samples=4):
"""One training step with adversarial loop."""
results = {
'code_rewards': [],
'test_rewards': [],
'code_loss': 0,
'test_loss': 0
}
for spec in problem_specs:
# Phase 1: Code LLM generates implementations
code_samples = self.code_llm.generate_batch(
spec['prompt'],
num_samples=num_samples,
max_length=self.max_code_len
)
# Filter valid code
valid_code = [c for c in code_samples
if self.reward_fn._is_valid_code(c)]
if not valid_code:
continue
# Phase 2: Test LLM generates tests with white-box access
# Concatenate code as context for Test LLM
test_context = f"Code under test:\n{valid_code[0]}\n\nGenerate adversarial tests:"
test_samples = self.test_llm.generate_batch(
test_context,
num_samples=num_samples,
max_length=512
)
# Filter valid tests
valid_tests = [t for t in test_samples
if self.reward_fn._is_valid_test(t, valid_code[0])]
# Phase 3: Score implementations
code_rewards = []
for code in valid_code:
# Score against all valid tests
code_reward = self.reward_fn.score_code(code, valid_tests)
code_rewards.append(code_reward)
results['code_rewards'].append(code_reward)
# Phase 4: Score tests
test_rewards = []
for test in valid_tests:
# Test quality on best code sample
best_code = valid_code[np.argmax(code_rewards)]
test_reward = self.reward_fn.score_test(test, best_code)
test_rewards.append(test_reward)
results['test_rewards'].append(test_reward)
# Add high-quality tests to mistake book
if test_reward > 0.5:
self.mistake_book.add_test(test, test_reward)
# Phase 5: Update Code LLM with policy gradient
code_rewards_tensor = torch.tensor(code_rewards, dtype=torch.float32)
code_loss = self._compute_policy_loss(code_samples, code_rewards_tensor)
self._update_model(self.code_llm, code_loss)
results['code_loss'] += code_loss.item()
# Phase 6: Update Test LLM with policy gradient
test_rewards_tensor = torch.tensor(test_rewards, dtype=torch.float32)
test_loss = self._compute_policy_loss(test_samples, test_rewards_tensor)
self._update_model(self.test_llm, test_loss)
results['test_loss'] += test_loss.item()
return results
def _compute_policy_loss(self, samples: list, rewards: torch.Tensor):
"""Policy gradient loss: maximize reward * log_prob."""
# Normalize rewards for stability
rewards = (rewards - rewards.mean()) / (rewards.std() + 1e-8)
# Get log probabilities of samples
logprobs = self.code_llm.compute_logprob_batch(samples)
# Policy gradient objective
loss = -(rewards * logprobs).mean()
return loss
def _update_model(self, model, loss):
"""Gradient update step."""
optimizer = AdamW(model.parameters(), lr=1e-5)
optimizer.zero_grad()
loss.backward()
torch.nn.utils.clip_grad_norm_(model.parameters(), 1.0)
optimizer.step()
def train(self, problem_specs, num_steps=100):
"""Run full adversarial training."""
for step in range(num_steps):
results = self.step(problem_specs)
if (step + 1) % 10 == 0:
avg_code_reward = np.mean(results['code_rewards'])
avg_test_reward = np.mean(results['test_rewards'])
print(f"Step {step+1}: Code Reward={avg_code_reward:.3f}, "
f"Test Reward={avg_test_reward:.3f}")
Hyperparameters and When to Use:
When NOT to use:
Common Pitfalls:
Paper: Code-A1: Adversarial Evolving of Code LLM and Test LLM via RL
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.