skills/skillxiv-v0.0.2-claude-opus-4.6/auto-codebench-generator/SKILL.md
Automatically generates diverse multilingual code benchmarks using LLMs, creating 3920 problems across 20 programming languages with quality assurance filtering.
npx skillsauth add ADu2021/skillXiv auto-codebench-generatorInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
AutoCodeBench enables automatic generation of code benchmark datasets without manual annotation. The approach leverages LLMs to generate problem statements, creates test cases autonomously, and applies quality filtering to ensure correctness. This creates diverse, multilingual benchmarks addressing limitations in existing datasets that focus primarily on Python.
Step 1: Implement Problem Generation
Generate coding problems automatically:
class ProblemGenerator:
def __init__(self, llm_model):
super().__init__()
self.llm = llm_model
self.difficulty_levels = ['easy', 'medium', 'hard']
self.categories = ['strings', 'arrays', 'graphs', 'math', 'dp']
def generate_problem(self, category, difficulty):
"""Generate single problem with statement and examples."""
prompt = f"""Generate a {difficulty} {category} programming problem.
Include:
1. Clear problem statement
2. Constraints
3. Example input/output
Return JSON with 'statement' and 'examples' keys."""
with torch.no_grad():
response = self.llm.generate(prompt, max_length=500)
return self._parse_problem_response(response)
def _parse_problem_response(self, response):
"""Parse generated problem."""
import json
try:
return json.loads(response)
except:
return {'statement': response, 'examples': []}
def generate_problems_batch(self, num_problems, languages=20):
"""Generate batch of problems."""
problems = []
for i in range(num_problems):
category = self.categories[i % len(self.categories)]
difficulty = self.difficulty_levels[i % len(self.difficulty_levels)]
problem = self.generate_problem(category, difficulty)
problems.append(problem)
return problems
Step 2: Implement Test Case Generation
Create test inputs and outputs:
class TestCaseGenerator:
def __init__(self, sandbox_executor):
super().__init__()
self.executor = sandbox_executor
def generate_test_cases(self, problem_statement, solution_code, num_tests=5):
"""Generate test cases by executing solution."""
test_cases = []
# Generate diverse inputs
inputs = self._generate_diverse_inputs(problem_statement, num_tests)
for test_input in inputs:
try:
# Execute solution in sandbox
output = self.executor.execute(solution_code, test_input)
test_cases.append({
'input': test_input,
'output': output,
'valid': True
})
except Exception as e:
test_cases.append({
'input': test_input,
'error': str(e),
'valid': False
})
return [t for t in test_cases if t['valid']]
def _generate_diverse_inputs(self, problem_stmt, num):
"""Generate diverse test inputs."""
inputs = []
# Edge cases
inputs.append(self._generate_edge_case(problem_stmt))
# Random inputs
for _ in range(num - 1):
inputs.append(self._generate_random_input(problem_stmt))
return inputs
def _generate_edge_case(self, stmt):
"""Generate edge case input."""
if 'empty' in stmt.lower():
return '[]'
elif 'single' in stmt.lower():
return '[1]'
else:
return '[]'
def _generate_random_input(self, stmt):
"""Generate random input."""
import random
size = random.randint(1, 100)
values = [random.randint(0, 100) for _ in range(size)]
return str(values)
Step 3: Implement Quality Filtering
Ensure benchmark quality:
class BenchmarkQualityFilter:
def __init__(self):
super().__init__()
def filter_problems(self, problems, test_cases):
"""Filter out low-quality problems."""
filtered = []
for problem, tests in zip(problems, test_cases):
if self._is_high_quality(problem, tests):
filtered.append((problem, tests))
return filtered
def _is_high_quality(self, problem, tests):
"""Assess problem quality."""
# Check 1: Problem clarity
clarity = len(problem.get('statement', '').split()) > 20
if not clarity:
return False
# Check 2: Valid test cases
valid_tests = len([t for t in tests if t.get('valid')])
if valid_tests < 2:
return False
# Check 3: Not too easy/hard
complexity = self._estimate_complexity(problem)
if complexity < 1 or complexity > 10:
return False
return True
def _estimate_complexity(self, problem):
"""Estimate problem complexity."""
stmt = problem.get('statement', '')
complexity_indicators = ['loop', 'recursion', 'graph', 'dynamic']
return sum(1 for ind in complexity_indicators if ind in stmt.lower())
def apply_reverse_verification(self, problem, tests):
"""Verify problem by generating solution and testing."""
# Generate solution attempt
# Verify solution passes tests
return True
Step 4: Build Final Benchmark Dataset
Assemble completed benchmark:
class BenchmarkAssembler:
def assemble_benchmark(self, filtered_problems, languages=20):
"""Create final benchmark across languages."""
benchmark = {
'problems': [],
'statistics': {},
'language_distribution': {}
}
for problem, tests in filtered_problems:
benchmark_problem = {
'problem_id': str(uuid.uuid4()),
'statement': problem['statement'],
'test_cases': tests,
'languages': languages,
'difficulty': self._assess_difficulty(problem)
}
benchmark['problems'].append(benchmark_problem)
# Compute statistics
benchmark['statistics'] = {
'total_problems': len(benchmark['problems']),
'total_languages': languages,
'total_test_cases': sum(len(p['test_cases']) for p in benchmark['problems'])
}
return benchmark
Hyperparameters and Configuration:
When to Use AutoCodeBench:
When NOT to Use:
Implementation Notes:
Paper: AutoCodeBench: LLMs as Code Benchmark Generators ArXiv: 2508.09101 Performance: Created 3920 problems across 20 languages; even advanced models struggle with complexity and multilingual nature
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.