skills/skillxiv-v0.0.2-claude-opus-4.6/critic-guided-formalization/SKILL.md
Improve formal theorem proofs by treating criticism—evaluation of semantic correctness—as a learning signal. Train critic models to distinguish correct from incorrect formalizations, then use their feedback to guide RL-based proof generation.
npx skillsauth add ADu2021/skillXiv critic-guided-formalizationInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Translating informal mathematics into formal, executable code (e.g., Lean 4) requires not just generating syntactically correct proofs but ensuring they capture the original mathematical intent. Prior work focused on generation and compilation; CriticLean shifts focus to the critic phase—the evaluation of whether a formalization is semantically correct. By training critic models to assess semantic accuracy and using their feedback as a reinforcement learning signal, CriticLean improves both the quality of generated proofs and the reliability of the evaluation process itself.
The core problem is that compiling without semantic verification produces proofs that are technically valid but miss the mathematical meaning. A proof might compile and be "correct" in isolation yet fail to capture what the original problem asked for.
CriticLean operates on three interconnected components:
The framework elevates criticism from a post-hoc filter to an active learning component that guides proof generation toward semantic fidelity.
Build the critic model by fine-tuning on semantic correctness judgments:
import torch
import torch.nn as nn
from transformers import AutoModelForSequenceClassification, AutoTokenizer
from lean.parser import LeanCodeParser
# Load base model for critic training
critic_base = AutoModelForSequenceClassification.from_pretrained(
"meta-llama/Llama-2-7b",
num_labels=2 # Binary: semantically correct or incorrect
)
tokenizer = AutoTokenizer.from_pretrained("meta-llama/Llama-2-7b")
# Load dataset with human-verified semantic correctness labels
from criticlean.data import FineLeanCorpus
corpus = FineLeanCorpus(split="train")
def prepare_criticism_example(problem, formal_proof, label):
"""Prepare input for semantic correctness assessment."""
# Concatenate problem statement and formal proof
text = f"""
Problem: {problem['statement']}
Formal Proof:
{formal_proof}
Is this formalization semantically correct? (captures the problem intent)
"""
encoding = tokenizer(
text,
max_length=2048,
truncation=True,
return_tensors="pt"
)
return {
"input_ids": encoding["input_ids"],
"attention_mask": encoding["attention_mask"],
"labels": torch.tensor(label) # 1 if correct, 0 if incorrect
}
# Fine-tune critic on semantic correctness
from transformers import Trainer, TrainingArguments
training_args = TrainingArguments(
output_dir="./critic_model",
num_train_epochs=3,
per_device_train_batch_size=8,
learning_rate=2e-5,
eval_strategy="steps",
eval_steps=500,
save_strategy="steps",
save_steps=500,
)
train_dataset = [
prepare_criticism_example(p, p["proof"], p["semantic_correctness"])
for p in corpus
]
trainer = Trainer(
model=critic_base,
args=training_args,
train_dataset=train_dataset,
)
critic_model = trainer.train()
print(f"Critic model trained; validation accuracy: {trainer.evaluate()['eval_accuracy']:.2%}")
Create a benchmark to assess critic performance on challenging cases:
from criticlean.benchmarks import CriticLeanBench
bench = CriticLeanBench()
# Load test set with pairs: correct proof, subtle incorrect variants
test_pairs = bench.load_challenge_pairs()
# Examples of subtle incorrectness the critic must distinguish:
# 1. Proof is valid but proves slightly different theorem
# 2. Proof uses different assumptions than problem statement
# 3. Proof omits crucial mathematical constraint
# 4. Proof redefines terms ambiguously
critic_predictions = []
ground_truth = []
for problem, correct_proof, incorrect_proof in test_pairs:
# Evaluate both proofs
correct_score = evaluate_semantic_correctness(
critic=critic_model,
problem=problem,
proof=correct_proof
)
incorrect_score = evaluate_semantic_correctness(
critic=critic_model,
problem=problem,
proof=incorrect_proof
)
# Critic should rank correct > incorrect
critic_correct = correct_score > incorrect_score
critic_predictions.append(critic_correct)
ground_truth.append(True)
accuracy = sum(critic_predictions) / len(ground_truth)
print(f"Critic accuracy on challenge pairs: {accuracy:.2%}")
Use critic feedback to guide proof generation via reinforcement learning:
from lean.prover import LeanProver
from criticlean.rl import CriticGuidedRL
prover = LeanProver()
rl_trainer = CriticGuidedRL(
critic_model=critic_model,
prover=prover
)
def critic_reward(problem, generated_proof):
"""Score proof generation based on semantic correctness."""
try:
# Step 1: Does proof compile?
compilation_result = prover.compile(generated_proof)
if not compilation_result["success"]:
return -1.0 # Failed to compile
# Step 2: Is it semantically correct?
semantic_score = evaluate_semantic_correctness(
critic=critic_model,
problem=problem,
proof=generated_proof
)
# Combine signals: compilation (hard constraint) + semantics (soft signal)
# Semantic correctness is normalized to [-1, 1]
return semantic_score
except Exception:
return -1.0 # Failure case
# RL training loop
problems = corpus.load_problems(split="train")
for epoch in range(10):
total_reward = 0
for problem in problems:
# Generate multiple proof candidates
candidates = prover.generate_proof_candidates(
problem=problem,
num_candidates=5
)
# Score each candidate
rewards = [critic_reward(problem, cand) for cand in candidates]
# Update proof generation model based on rewards
# High reward: this candidate is semantically sound
# Low reward: regenerate with different strategy
rl_trainer.update(
problem=problem,
candidates=candidates,
rewards=rewards
)
total_reward += max(rewards)
avg_reward = total_reward / len(problems)
print(f"Epoch {epoch} average critic reward: {avg_reward:.3f}")
Use this approach for:
Avoid CriticLean for:
The benchmark distinguishes subtle failure modes:
| Category | Example | Criticality | |----------|---------|------------| | Compilation mismatch | Proof uses undefined identifier | Critical | | Assumption divergence | Proof assumes A≥0 but problem didn't state it | High | | Constraint omission | Proof ignores uniqueness requirement | High | | Scope drift | Proof proves different theorem entirely | Critical | | Subtle interpretation | Proof uses non-standard definition | Medium |
FineLeanCorpus contains:
| Metric | Value | |--------|-------| | Total problems | 285,000+ | | Domains covered | Math, CS, Logic, Statistics | | Human annotations | Correctness labels from domain experts | | Proof variants | Multiple proofs per problem (some incorrect) | | Difficulty range | Beginner to research-level |
| Parameter | Typical Range | Guidance | |-----------|---------------|----------| | Critic learning rate | 1e-5 to 5e-5 | Standard; monitor for divergence | | RL reward discount | 0.99 | Standard for RL | | Proof generation temperature | 0.7 | Controls diversity of candidates | | Num candidates per problem | 3-10 | More candidates improve exploration |
When critic errors occur, categorize:
False negatives: Rejects semantically correct proofs
False positives: Accepts semantically incorrect proofs
False positives are more dangerous; prioritize reducing them.
"CriticLean: Critic-Guided Reinforcement Learning for Mathematical Formalization" - arXiv:2507.06181
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.