skills/skillxiv-v0.0.2-claude-opus-4.6/flex-continuous-agent-evolution/SKILL.md
Enable LLM agents to improve continuously during deployment by constructing structured experience libraries through self-reflection on successes and failures—achieving 23% improvement on reasoning without gradient-based parameter updates or external training.
npx skillsauth add ADu2021/skillXiv flex-continuous-agent-evolutionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
Deployed language model agents are typically static—once trained, they don't improve from real-world interactions. FLEX solves this through gradient-free continuous learning: agents maintain a structured experience library recording successes, failures, and their contexts. During subsequent interactions, the agent retrieves and reflects on relevant past experiences, incorporating these lessons into prompting without retraining.
The approach demonstrates substantial gains: 23% improvement on mathematical reasoning (AIME25), 10% on chemical synthesis, 14% on protein engineering—all from self-refinement during deployment, not additional training.
FLEX treats deployed agent improvement as a problem of structured experience management rather than parameter optimization. The system maintains three components:
This approach is particularly powerful because it requires no gradient computation, model retraining, or API calls to external LLMs during learning—only structured reflection during inference.
Step 1: Experience Data Structure
Define a structured format for recording and retrieving agent interactions.
from dataclasses import dataclass
from typing import List, Dict, Any
import json
from datetime import datetime
@dataclass
class Experience:
"""Single interaction record in the experience library."""
problem: str # Problem description
solution_attempt: str # Agent's attempted solution
ground_truth: str # Correct answer (if available)
is_correct: bool # Did the solution succeed?
domain: str # Problem domain (math, coding, etc.)
difficulty: str # Estimated difficulty
timestamp: str # When this occurred
techniques_used: List[str] # Techniques employed (e.g., 'divide-and-conquer')
failure_reason: str # Why it failed (if applicable)
reflection: str # Agent's own analysis of the attempt
metadata: Dict[str, Any] # Additional context (tokens used, latency, etc.)
def to_dict(self):
return {
'problem': self.problem,
'solution': self.solution_attempt,
'correct': self.is_correct,
'domain': self.domain,
'difficulty': self.difficulty,
'timestamp': self.timestamp,
'techniques': self.techniques_used,
'failure_reason': self.failure_reason,
'reflection': self.reflection,
'metadata': self.metadata
}
class ExperienceLibrary:
"""Maintains structured experience collection."""
def __init__(self, storage_path='./experience_library.jsonl'):
self.storage_path = storage_path
self.experiences: List[Experience] = []
self.load_from_disk()
def add_experience(self, exp: Experience):
"""Record a new experience."""
self.experiences.append(exp)
# Append to disk for persistence
with open(self.storage_path, 'a') as f:
f.write(json.dumps(exp.to_dict()) + '\n')
def retrieve_relevant(self, problem: str, domain: str, k=3) -> List[Experience]:
"""
Find most relevant past experiences for a given problem.
Args:
problem: Current problem description
domain: Problem domain
k: Number of experiences to retrieve
Returns:
relevant_experiences: Top-k similar past experiences
"""
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.metrics.pairwise import cosine_similarity
# Filter by domain first
domain_exps = [e for e in self.experiences if e.domain == domain]
if len(domain_exps) < k:
return domain_exps
# Compute similarity between current problem and past problems
all_problems = [e.problem for e in domain_exps] + [problem]
vectorizer = TfidfVectorizer(max_features=100)
tfidf = vectorizer.fit_transform(all_problems)
# Similarity of current problem to all past problems
similarities = cosine_similarity(tfidf[-1:], tfidf[:-1])[0]
# Sort by similarity and return top-k
top_indices = similarities.argsort()[-k:][::-1]
return [domain_exps[i] for i in top_indices]
def load_from_disk(self):
"""Load experiences from persistent storage."""
try:
with open(self.storage_path, 'r') as f:
for line in f:
exp_dict = json.loads(line)
self.experiences.append(Experience(**exp_dict))
except FileNotFoundError:
pass # First run: empty library
Step 2: Self-Reflection Engine
Generate structured reflections on why attempts succeeded or failed.
def generate_reflection(problem: str, solution: str, is_correct: bool,
ground_truth: str = None, llm_api=None) -> str:
"""
Generate agent's reflection on an attempt.
Args:
problem: Original problem
solution: Agent's attempted solution
is_correct: Whether solution was correct
ground_truth: Correct solution (if available)
llm_api: LLM API for generating reflection (e.g., GPT-4, Claude)
Returns:
reflection: Natural language analysis
"""
if is_correct:
prompt = f"""Analyze why this solution was correct:
Problem: {problem}
Solution: {solution}
Provide a brief reflection on what techniques made this solution work:"""
else:
prompt = f"""Analyze why this solution failed:
Problem: {problem}
Your solution: {solution}
Correct solution: {ground_truth}
Identify the key mistake or misconception:"""
# Call LLM to generate reflection
if llm_api:
reflection = llm_api.generate(prompt, max_tokens=200)
else:
# Fallback: simple pattern matching
if "ValueError" in solution or "TypeError" in solution:
reflection = "Code had syntax or type error"
elif is_correct:
reflection = "Solution approach was sound"
else:
reflection = "Solution logic was flawed"
return reflection
Step 3: Experience-Augmented Prompting
Incorporate retrieved experiences into prompts during inference.
def augment_prompt_with_experiences(
original_prompt: str,
relevant_experiences: List[Experience],
include_failures: bool = True) -> str:
"""
Create augmented prompt including relevant past experiences.
Args:
original_prompt: User's problem description
relevant_experiences: Retrieved past experiences
include_failures: Whether to include negative examples
Returns:
augmented_prompt: Enhanced prompt with examples
"""
augmented = "You have access to relevant past experiences. Use insights from successes:\n\n"
successful_exps = [e for e in relevant_experiences if e.is_correct]
for i, exp in enumerate(successful_exps):
augmented += f"Example {i+1} - Success:\n"
augmented += f"Problem: {exp.problem}\n"
augmented += f"Solution: {exp.solution_attempt}\n"
augmented += f"Key insight: {exp.reflection}\n\n"
if include_failures:
failed_exps = [e for e in relevant_experiences if not e.is_correct]
if failed_exps:
augmented += "Learn from past mistakes:\n\n"
for i, exp in enumerate(failed_exps):
augmented += f"Past Mistake {i+1}:\n"
augmented += f"Problem: {exp.problem}\n"
augmented += f"Failed attempt: {exp.solution_attempt}\n"
augmented += f"Why it failed: {exp.failure_reason}\n\n"
augmented += f"Now solve this new problem:\n{original_prompt}"
return augmented
Step 4: Agent Deployment Loop
Main loop integrating experience capture and retrieval during deployment.
class DeployedAgent:
"""LLM agent that learns from deployment experiences."""
def __init__(self, base_model, experience_library, domain='general'):
self.model = base_model
self.library = experience_library
self.domain = domain
self.success_count = 0
self.total_attempts = 0
def solve_problem(self, problem: str, ground_truth: str = None) -> Dict[str, Any]:
"""
Solve a problem, recording experience for future learning.
Args:
problem: Problem description
ground_truth: Correct answer (if available for offline validation)
Returns:
result: {solution, is_correct, experience_recorded}
"""
# Step 1: Retrieve relevant past experiences
relevant_exps = self.library.retrieve_relevant(problem, self.domain, k=3)
# Step 2: Augment prompt with relevant experiences
augmented_prompt = augment_prompt_with_experiences(
problem, relevant_exps, include_failures=True
)
# Step 3: Generate solution
solution = self.model.generate(augmented_prompt, max_tokens=2048)
# Step 4: Validate (if ground truth available)
is_correct = False
if ground_truth:
is_correct = self._validate_solution(solution, ground_truth)
self.total_attempts += 1
if is_correct:
self.success_count += 1
# Step 5: Generate reflection
reflection = generate_reflection(problem, solution, is_correct, ground_truth)
# Step 6: Record experience
failure_reason = None
if not is_correct:
failure_reason = self._analyze_failure(solution, ground_truth)
experience = Experience(
problem=problem,
solution_attempt=solution,
ground_truth=ground_truth or '',
is_correct=is_correct,
domain=self.domain,
difficulty=self._estimate_difficulty(problem),
timestamp=datetime.now().isoformat(),
techniques_used=self._extract_techniques(solution),
failure_reason=failure_reason,
reflection=reflection,
metadata={
'model': self.model.name,
'num_examples': len(relevant_exps),
'accuracy_rate': self.success_count / self.total_attempts
}
)
self.library.add_experience(experience)
return {
'solution': solution,
'is_correct': is_correct,
'experience_recorded': True
}
def _validate_solution(self, solution: str, ground_truth: str) -> bool:
"""Check if solution matches ground truth."""
# Simple string matching; extend for domain-specific validation
return solution.strip() == ground_truth.strip()
def _analyze_failure(self, solution: str, ground_truth: str) -> str:
"""Identify type of failure."""
if "ValueError" in solution or "TypeError" in solution:
return "Syntax/Type Error"
elif len(solution) == 0:
return "No Output Generated"
else:
return "Incorrect Logic"
def _estimate_difficulty(self, problem: str) -> str:
"""Estimate problem difficulty."""
word_count = len(problem.split())
if word_count < 50:
return "easy"
elif word_count < 200:
return "medium"
else:
return "hard"
def _extract_techniques(self, solution: str) -> List[str]:
"""Identify reasoning techniques used."""
techniques = []
if "divide" in solution.lower() or "split" in solution.lower():
techniques.append("divide-and-conquer")
if "recursion" in solution.lower():
techniques.append("recursion")
if "dynamic" in solution.lower():
techniques.append("dynamic-programming")
if "greedy" in solution.lower():
techniques.append("greedy")
return techniques
Step 5: Continuous Monitoring and Adaptation
Track learning over time and identify when improvements plateau.
def monitor_agent_learning(agent: DeployedAgent, window_size: int = 100):
"""
Monitor improvement trends in agent performance.
Args:
agent: Deployed agent instance
window_size: Number of recent attempts to analyze
Yields:
metrics: Performance statistics
"""
while True:
recent_exps = agent.library.experiences[-window_size:]
if len(recent_exps) > 0:
success_rate = sum(1 for e in recent_exps if e.is_correct) / len(recent_exps)
avg_reflection_length = sum(
len(e.reflection.split()) for e in recent_exps
) / len(recent_exps)
metrics = {
'success_rate': success_rate,
'sample_count': agent.total_attempts,
'improvement': success_rate, # Compare to baseline if available
'avg_reflection_length': avg_reflection_length,
'unique_domains': len(set(e.domain for e in recent_exps))
}
yield metrics
# If plateau detected, could trigger additional strategies
if success_rate > 0.9:
print("Agent has reached high performance; consider expanding domain")
import time
time.sleep(60) # Monitor every minute
When to Use FLEX:
When NOT to Use:
Hyperparameters and Configuration:
Pitfalls to Avoid:
Reference: https://arxiv.org/abs/2511.06449
testing
Uses flow maps as look-ahead operators to enable principled reward-guided diffusion by predicting trajectory endpoints at any denoising step. Deploy when applying rewards or preferences to diffusion trajectories with meaningful gradients throughout generation.
testing
Train language models where each expert learns independently on closed datasets, enabling flexible inference with selective data inclusion or exclusion. 41% performance improvement while allowing users to opt out of specific data sources without retraining.
data-ai
Understand how token generation flexibility in diffusion LMs paradoxically constrains reasoning, as models exploit ordering flexibility to avoid uncertain tokens, and apply simplified approaches that preserve parallel decoding benefits. Use when optimizing diffusion-based language models for reasoning tasks.
tools
Reduce agent execution steps by 35% and latency by parallelizing sequential tool calls through task dependency graphs (DAGs). Use when deploying information-retrieval agents where tool execution ordering is flexible.