skills/skillxiv-v0.0.2-claude-opus-4.6/agent0-vl-self-evolution/SKILL.md
Enable vision-language agents to self-evolve by grounding verification in tool outputs rather than text: implement nested loops where Solver+Verifier generate trajectories and tool-based feedback, then optimize via GRPO using self-generated rewards without external supervision.
npx skillsauth add ADu2021/skillXiv agent0-vl-self-evolutionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Vision-language agents typically struggle with self-evaluation—they can easily hallucinate confidence scores or generate false critiques without external grounding. This skill demonstrates how to build agents that self-evolve by grounding their verification process in tool-generated evidence, creating a unified loop where reasoning, verification, and self-repair happen through the same agentic mechanism, all optimized via reinforcement learning.
The key innovation is tool-grounded verification: instead of text-based self-evaluation (prone to hallucination), the Verifier evaluates reasoning steps by checking tool outputs, enabling genuine self-correction before policy updates.
Agent0-VL implements a Self-Evolving Reasoning Cycle (SERC) with two nested loops:
Inner Loop (Generation & Verification): Solver generates reasoning trajectories with tool calls; Verifier evaluates using tool-generated evidence and produces structured feedback. When confidence is low, self-repair mechanisms correct reasoning before re-execution.
Outer Loop (Policy Update): GRPO optimizes the unified policy using self-generated rewards, requiring zero external reward supervision.
The system operates entirely on tool-grounded evidence, avoiding evaluation hallucination common in text-only LLM self-evaluation.
The self-evolution process cycles through reasoning generation, tool-grounded verification, optional repair, and policy updates.
1. Initialize Unified Solver-Verifier Model
Create model infrastructure supporting both generation (Solver) and verification (Verifier) roles.
def create_solver_verifier_model(base_model_name, max_tokens=2048):
"""
Initialize unified model capable of both reasoning generation and verification.
Both roles output structured tokens for tool calls and confidence scores.
"""
model = load_vlm_model(base_model_name)
# Solver prompt template
solver_template = """
Image: [image]
Question: [question]
Reason step by step:
1. Observe the image carefully
2. Plan which tools to call
3. Call tools and interpret results
4. Generate final answer
Your reasoning:
"""
# Verifier prompt template
verifier_template = """
Image: [image]
Question: [question]
Reasoning trajectory: [trajectory]
Tool outputs:
[tool_results]
Evaluate the reasoning step. Provide:
- score (0-100): How correct is this step based on tool outputs?
- confidence (0-1): How certain are you in this score?
- feedback: Specific issues if score < 80
"""
return {
'model': model,
'solver_template': solver_template,
'verifier_template': verifier_template,
'max_tokens': max_tokens
}
2. Implement Solver: Generate Reasoning with Tool Calls
Solver generates multi-step reasoning trajectories with explicit tool call instructions.
def solver_generate_trajectory(image, question, model_config, tools_available):
"""
Generate reasoning trajectory with tool calls.
Solver outputs structured reasoning including tool invocations.
"""
prompt = model_config['solver_template'].replace('[image]', image).replace('[question]', question)
trajectory = {
'steps': [],
'tool_calls': [],
'image': image,
'question': question
}
# Generate step-by-step with tool calls
for step_idx in range(5): # Max 5 reasoning steps
response = model_config['model'].generate(
prompt + trajectory_to_text(trajectory['steps']),
max_tokens=512,
stop_tokens=['[TOOL_CALL_END]']
)
# Parse for tool calls
if '[TOOL_CALL]' in response:
tool_call = extract_tool_call(response)
trajectory['tool_calls'].append(tool_call)
# Execute tool
tool_result = execute_tool(tool_call, tools_available)
trajectory['steps'].append({
'type': 'reasoning',
'content': response,
'tool_result': tool_result
})
else:
# Final answer
trajectory['steps'].append({
'type': 'answer',
'content': response
})
break
return trajectory
3. Implement Tool-Grounded Verifier
Verifier evaluates reasoning steps using tool outputs as ground truth evidence.
def verifier_evaluate_trajectory(trajectory, model_config, max_confidence=0.95):
"""
Verify reasoning using tool-generated evidence.
Returns score and confidence purely based on tool outputs, not subjective text.
"""
verifier_prompt = model_config['verifier_template'].replace('[image]', trajectory['image']).replace('[question]', trajectory['question'])
verifier_prompt = verifier_prompt.replace('[trajectory]', trajectory_to_text(trajectory['steps']))
verifier_prompt = verifier_prompt.replace('[tool_results]', format_tool_results(trajectory['tool_calls']))
verification = model_config['model'].generate(
verifier_prompt,
max_tokens=256,
response_format={'score': 'int', 'confidence': 'float', 'feedback': 'str'}
)
return {
'score': verification['score'],
'confidence': min(verification['confidence'], max_confidence),
'feedback': verification['feedback'],
'step_index': len(trajectory['steps']) - 1
}
4. Implement Self-Repair Loop
When verification confidence is low, automatically regenerate the problematic step.
def self_repair_step(trajectory, verification_result, model_config, tools_available, max_repairs=2):
"""
Repair low-confidence reasoning steps.
Re-generates the problematic step with explicit guidance from verification feedback.
"""
if verification_result['confidence'] >= 0.75 or max_repairs <= 0:
return trajectory # No repair needed or max repairs reached
# Generate repair prompt
repair_prompt = f"""
Previous reasoning had issues:
{verification_result['feedback']}
Image: [image]
Question: [question]
Previous steps: [steps]
Please re-reason about the next step, addressing the feedback:
"""
# Regenerate the step
repair_response = model_config['model'].generate(repair_prompt, max_tokens=512)
# Parse and execute tools if present
if '[TOOL_CALL]' in repair_response:
tool_call = extract_tool_call(repair_response)
tool_result = execute_tool(tool_call, tools_available)
# Replace the problematic step
trajectory['steps'][verification_result['step_index']] = {
'type': 'reasoning',
'content': repair_response,
'tool_result': tool_result,
'repaired': True
}
# Re-verify
new_verification = verifier_evaluate_trajectory(trajectory, model_config)
if new_verification['confidence'] > verification_result['confidence']:
return trajectory
else:
# Recursive repair if still low confidence
return self_repair_step(trajectory, new_verification, model_config, tools_available, max_repairs - 1)
5. Design Reward Signal from Tool-Grounded Verification
Create rewards purely from tool outputs and correctness indicators.
def compute_self_generated_reward(trajectory, final_answer, ground_truth_answer):
"""
Compute reward signal from tool-based verification without external labels.
Rewards are based on correctness indicators and verification scores.
"""
reward = 0.0
# Base reward: answer correctness
if final_answer == ground_truth_answer:
reward += 1.0
else:
# Partial credit for close answers
similarity = compute_answer_similarity(final_answer, ground_truth_answer)
reward += similarity * 0.5
# Step-wise rewards from tool correctness
for step in trajectory['steps']:
if 'tool_result' in step:
# Tool executed successfully
reward += 0.1
# Penalty for excessive repairs
repair_count = sum(1 for s in trajectory['steps'] if s.get('repaired', False))
reward -= repair_count * 0.05
# Efficiency bonus for shorter trajectories
num_steps = len(trajectory['steps'])
if num_steps <= 3:
reward += 0.2
return max(0.0, reward) # Clip to [0, inf)
6. GRPO Optimization with Self-Generated Rewards
Update policy using self-generated rewards through gradient-based optimization.
def grpo_step(batch_trajectories, model_config, optimizer, lr=1e-5):
"""
Single GRPO step optimizing policy with self-generated rewards.
No external reward model needed; uses trajectory-internal verification and tool correctness.
"""
losses = []
for trajectory in batch_trajectories:
# Get self-generated reward
reward = trajectory['self_reward']
# Compute policy gradient
log_prob = compute_log_probability(trajectory, model_config['model'])
# GRPO loss: maximize expected reward
loss = -log_prob * reward
losses.append(loss)
# Backpropagate and update
total_loss = sum(losses) / len(losses)
optimizer.zero_grad()
total_loss.backward()
optimizer.step()
return {'loss': total_loss.item(), 'mean_reward': np.mean([t['self_reward'] for t in batch_trajectories])}
When to Use Agent0-VL:
When NOT to Use:
Key Hyperparameters:
confidence_threshold: When to trigger repair (0.70-0.80 typical)max_repairs_per_trajectory: Prevent infinite repair loops (1-3)num_grpo_steps: Training iterations before evaluation (100-1000)batch_size: Trajectories per GRPO update (16-64)learning_rate: Gradient step size (1e-6 to 1e-4 typical)Optimization Tips:
Pitfalls to Avoid:
Research paper: https://arxiv.org/abs/2511.19900
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
devops
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.