skills/empirical-mcts-continuous-agent-evolution/SKILL.md
Applies Empirical-MCTS dual-loop reasoning: structured tree search with persistent memory that accumulates experience across problems. Use when asked to 'solve this step by step with learning', 'use MCTS reasoning', 'try multiple approaches and remember what works', 'evolve your strategy as you go', 'search for the best solution systematically', or 'use tree search with memory'.
npx skillsauth add ndpvt-web/arxiv-claude-skills empirical-mcts-continuous-agent-evolutionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to apply the Empirical-MCTS framework from Lu et al. (2026) to complex reasoning and coding tasks. Instead of treating each problem attempt as stateless, Claude maintains a dual-loop architecture: an inner loop that explores solution candidates via tree search with dynamically evolving evaluation criteria, and an outer loop that distills successful reasoning patterns into a persistent memory store. This transforms brute-force retry strategies into a disciplined search that gets smarter with each iteration, accumulating empirical wisdom across related problems.
Empirical-MCTS replaces stateless LLM inference with a dual-loop search process. The inner loop is a Monte Carlo Tree Search where each node represents a candidate solution or reasoning state. Nodes are selected via UCB (Upper Confidence Bound), expanded by generating new candidate solutions, and scored using Pairwise-Experience-Evolutionary Meta-Prompting (PE-EMP). PE-EMP does not simply rate a solution in isolation -- it compares two candidates head-to-head, dynamically generates evaluation criteria specific to the problem context, assigns weighted scores across those criteria, and distills the comparison into actionable insights and an evolved meta-prompt. This means the search criteria themselves adapt as the search progresses, unlike fixed heuristics.
The outer loop maintains a global experience memory -- a structured repository of distilled insights tagged by task type. After each problem (or search iteration), a Memory Optimization Agent performs atomic operations on this store: Add new high-quality insights, Modify existing ones with richer understanding, Merge fragmented observations into cohesive principles, and Delete obsolete or low-quality entries. Retrieved experiences serve as a "policy prior" for subsequent problems, biasing the search toward strategies that have worked before while preserving the ability to explore novel approaches.
The critical insight is that neither loop alone is sufficient. Stateless MCTS discards hard-won reasoning patterns. Pure experience replay without structured search drifts toward confirmation bias. The coupling of disciplined exploration with empirical accumulation is what produces the 10-20 percentage point gains observed on benchmarks like AIME25 (73.3% vs 63.3% for LLaMA-Berry) and MathArena Apex.
Create a root node containing the problem statement and an initial system prompt (generic or domain-appropriate). Initialize an empty experience memory (or load prior experiences if solving a series of related problems). Set search parameters: rollout budget (T=4-8 for most tasks), retrieval count (k=3-5 prior experiences), and decay factor (gamma=0.7 for backpropagation).
Classify the problem by task type (e.g., "graph algorithm", "string parsing", "geometry proof"). Query the experience memory for the top-k most semantically relevant insights. These become the experience_prior that biases the current search.
Apply Upper Confidence Bound to balance exploitation (high-scoring nodes) with exploration (under-visited nodes). For the first iteration, this is trivially the root.
Using the current node's meta-prompt, the retrieved experiences, and the problem statement, generate a new candidate solution. This is a full attempt at solving the problem -- not a partial step.
Compare the new candidate against the current best (or baseline) using the 7-stage PE-EMP process:
Update Q-values from the scored child node back to the root using decay: Q(parent) = (1 - gamma) * Q(parent) + gamma * Q(child). This lets strong insights propagate efficiently.
Loop steps 3-6 for the remaining rollout budget. Each iteration uses the evolved meta-prompt from the previous PE-EMP cycle, so the search narrows toward high-quality solutions while the criteria themselves sharpen.
After all rollouts, select the node with the highest Q-value as the final answer.
Feed the best insights from this problem's search into the Memory Optimization Agent. Execute atomic operations:
Present the best solution to the user. Store the updated experience memory for use in subsequent problems.
Example 1: Debugging a series of failing tests
User: I have 5 failing tests in my payment processing module. Help me fix them systematically.
Approach:
1. Initialize memory store (empty). Classify task: "unit test debugging / payment logic."
2. For Test 1, generate an initial fix attempt (Rollout 1).
3. PE-EMP: Compare fix against the failing state. Criteria generated:
- "Correctness of currency rounding" (weight: 40%)
- "Handling of null/zero amounts" (weight: 35%)
- "Consistency with existing payment model" (weight: 25%)
4. Rollout 2: Evolved meta-prompt now includes "verify rounding to 2 decimal
places before comparing amounts." Generate improved fix. Score higher.
5. After 4 rollouts, best fix passes Test 1. Distill insight to memory:
ADD: "Payment amount comparisons require epsilon-based floating point
comparison or integer cent conversion. Direct equality fails."
6. For Test 2, retrieve the insight from Test 1 as experience_prior.
First rollout already avoids the rounding trap. Search converges faster.
7. Continue through Tests 3-5, accumulating insights. By Test 5, the memory
contains 4 distilled principles that make the fix nearly immediate.
Output:
- 5 targeted code fixes, each informed by lessons from previous fixes
- A distilled set of payment-module heuristics for future debugging
Example 2: Solving a complex algorithmic problem
User: Find the minimum number of operations to transform string A into string B,
where allowed operations are: insert, delete, substitute, and transpose adjacent.
Approach:
1. Classify: "string algorithm / edit distance variant (Damerau-Levenshtein)."
No prior experiences. Initialize with generic meta-prompt.
2. Rollout 1: Generate a standard edit distance DP solution.
3. PE-EMP scores against baseline (brute force). Criteria:
- "Handles transposition correctly" (45%) -- candidate scores 3/10
(missing transpose case)
- "Time complexity" (30%) -- scores 8/10
- "Edge cases (empty strings, single char)" (25%) -- scores 7/10
4. Evolved meta-prompt: "Ensure the recurrence relation includes a
d[i-2][j-2] + cost term for adjacent transpositions when s[i]==t[j-1]
and s[i-1]==t[j]."
5. Rollout 2: Generates correct Damerau-Levenshtein with transpose handling.
PE-EMP scores: transpose 9/10, complexity 8/10, edge cases 6/10.
Insight: "Off-by-one in boundary check for i<2 or j<2."
6. Rollout 3: Fixes boundary. All criteria score 9+/10.
7. Memory update:
ADD: "Damerau-Levenshtein requires a separate check for i>=2 and j>=2
before accessing d[i-2][j-2]. Initialize boundary with INF sentinel."
Output:
- Correct Damerau-Levenshtein implementation with O(nm) time and space
- Documented reasoning trace showing how each rollout improved on the last
Example 3: Iterative prompt engineering for a code generation task
User: Write a function that parses arbitrary nested JSON-like config files
with comments, environment variable interpolation, and include directives.
Approach:
1. Classify: "parser implementation / recursive descent with extensions."
2. Rollout 1: Generate a basic recursive JSON parser.
3. PE-EMP comparison against requirements. Criteria:
- "Comment stripping (// and /* */)" (25%) -- 0/10, not implemented
- "Env var interpolation ${VAR}" (25%) -- 0/10, not implemented
- "Include directive handling" (25%) -- 0/10, not implemented
- "Core JSON correctness" (25%) -- 7/10
4. Evolved meta-prompt: "Build a two-pass parser: first pass resolves
includes and strips comments, second pass handles interpolation and
JSON parsing. Use a tokenizer to avoid breaking strings."
5. Rollout 2: Implements two-pass approach. Scores improve to 7/10 across
all criteria. Insight: "Nested includes can cause circular references."
6. Rollout 3: Adds include cycle detection via visited-path set.
Evolved meta-prompt adds: "Track interpolation depth to prevent
infinite ${VAR} -> ${VAR} recursion."
7. Rollout 4: Final solution handles all cases. Memory distilled:
ADD: "Config parsers need cycle detection for both includes and
variable interpolation. Maintain a visited set for each."
ADD: "Two-pass architecture (structural then semantic) prevents
comment/string ambiguity in config formats."
Output:
- Production-quality config parser with all requested features
- Clear separation of concerns across parsing phases
Do:
Avoid:
Lu, H., Huang, H., Zhou, Y., Li, C., & Zhu, N. (2026). Empirical-MCTS: Continuous Agent Evolution via Dual-Experience Monte Carlo Tree Search. arXiv:2602.04248v1. https://arxiv.org/abs/2602.04248v1
Key sections to study: Algorithm 1 (full search loop), Section on PE-EMP's 7-stage cognitive process, and the Memory Optimization Agent's atomic operations (Add/Modify/Merge/Delete). The ablation study confirms both components are essential -- removing either PE-EMP or memory drops AIME25 performance from 76.7% to 56.7%.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".