skills/deltaevolve-accelerating-scientific-discovery/SKILL.md
Iteratively evolve code solutions using momentum-driven semantic deltas instead of full-code histories. Use when: 'evolve a better heuristic for bin packing', 'optimize this algorithm iteratively', 'use LLM-driven evolution to improve this function', 'find a better solution through evolutionary search', 'iteratively refine this solver', 'apply DeltaEvolve to discover a better algorithm'.
npx skillsauth add ndpvt-web/arxiv-claude-skills deltaevolve-accelerating-scientific-discoveryInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to iteratively evolve code solutions toward optimal performance using the DeltaEvolve framework. Instead of tracking full code snapshots between iterations (as in AlphaEvolve/FunSearch), DeltaEvolve captures structured semantic deltas -- concise descriptions of what changed and why -- between successive program versions. These deltas accumulate into a momentum vector that steers the LLM toward productive search directions while consuming ~37% fewer tokens than full-code approaches. The technique applies to any problem where a candidate program can be scored by an evaluator: heuristic design, symbolic regression, optimization solvers, algorithm discovery, and more.
DeltaEvolve formalizes LLM-driven evolution as an Expectation-Maximization loop. In the E-step, the LLM samples candidate programs by applying delta-based modifications to a parent solution. In the M-step, the system evaluates candidates, updates a database of scored deltas, and selects context for the next iteration. The critical insight is that the optimizable context -- what the LLM sees in its prompt -- matters far more than scalar fitness scores. Ablation studies show that removing numerical scores barely hurts performance, but random context selection causes collapse. This means the system works primarily through in-context learning from well-chosen examples, not regression on scores.
The core innovation is replacing full-code histories with a three-level delta representation. Level 1 is a one-line summary ("FROM: greedy-by-weight TO: greedy-by-density-with-lookahead", ~30 tokens). Level 2 is a structured delta plan listing each modified component with old logic, new logic, and the hypothesis behind the change. Level 3 is the full executable code. A progressive disclosure mechanism feeds older iterations as Level-1 summaries ("ancient history"), recent iterations as Level-2 plans ("recent insights"), and only the parent node as Level-3 full code ("immediate context"). This mirrors how momentum in SGD accumulates gradient differences (x_i - x_{i-1}): the semantic deltas capture program differences (p_i - p_{i-1}) and their accumulated trajectory encodes the prevailing direction of improvement.
The population is managed via an island model with multiple subpopulations. Parent selection is stochastic but biased toward high-reward nodes. Elite nodes (top-k scorers) and diverse nodes (selected via a MAP-Elites grid over behavioral features) are included in context to balance exploitation and exploration.
Define the problem interface. Write a solve(input) -> output function signature and an evaluate(output) -> float scoring function. The scoring function must be deterministic and fast -- it runs on every candidate. Clearly specify what "higher is better" means.
Create the seed program. Write a simple baseline implementation of solve(). This can be a naive greedy heuristic, a random approach, or even a stub. It must be runnable and produce a valid score. Store this as iteration 0 in the evolution database.
Initialize the delta database. Create a structured store (list or dict) with entries of the form:
{
"iteration": 0,
"level1_summary": "Baseline: naive greedy by first-fit",
"level2_plan": [],
"code": "<full source of seed>",
"score": <baseline_score>,
"parent": null
}
Select context nodes for the prompt. For each new iteration, pick:
Construct the evolution prompt with progressive disclosure. Build the LLM prompt as:
## Problem: <problem description and evaluation criteria>
## Ancient History (Level 1 summaries):
- Iter 2: FROM greedy-first-fit TO greedy-best-fit | Score: 0.72 -> 0.78
- Iter 5: FROM greedy-best-fit TO density-aware-best-fit | Score: 0.78 -> 0.83
## Recent Insights (Level 2 delta plans):
- Iter 8: Changed bin selection from min-remaining to max-density.
Hypothesis: denser packing leaves larger contiguous gaps. Result: +0.03
## Current Parent (Level 3 full code):
<complete source code of parent, score=0.86>
## Elite Scores: [0.91, 0.89, 0.86]
## Task: Generate a new variant. Output a delta summary, delta plan, then full code.
Generate a candidate. Invoke the LLM with the constructed prompt. Parse the response into three levels using delimiters:
#Delta-Summary-Start ... #Delta-Summary-End -> Level 1#Delta-Plan-Start ... #Delta-Plan-End -> Level 2#Code-Start ... #Code-End -> Level 3 (executable code)Evaluate the candidate. Run the extracted code through the evaluation function. Record the score. Compute the delta: delta_score = new_score - parent_score. Tag the delta as "Improved" or "Degraded".
Update the database. Store the new node with all three levels, the score, and a link to its parent. If the candidate improves on the parent, it becomes eligible for future parent selection. If it degrades, it still enters the database (negative results inform future context selection by showing what not to do).
Repeat for N iterations. Run steps 4-8 for a budget of iterations (typically 20-100). Monitor for convergence: if the top score hasn't improved in 10 iterations, increase diversity selection weight or reset one island with a fresh random seed.
Extract the best solution. Return the highest-scoring program from the database along with its full ancestry chain of Level-1 summaries, which serves as a human-readable explanation of how the solution was discovered.
Example 1: Evolving a Bin Packing Heuristic
User: "I have an online bin packing problem. Items arrive one at a time with sizes in (0,1]. I need a function that decides which bin to place each item in. Help me evolve a good heuristic."
Approach:
place_item(item_size, bins) -> bin_index and evaluator that measures total bins used on 1000 random instances.Output:
def place_item(item_size, bins):
"""Evolved heuristic: adaptive threshold with density scoring.
Discovery path: first-fit -> best-fit -> lookahead -> density -> adaptive-threshold
"""
THRESHOLD = 0.85
best_bin, best_score = -1, float('inf')
for i, remaining in enumerate(bins):
if remaining >= item_size:
utilization = 1.0 - remaining
if utilization > THRESHOLD:
score = remaining - item_size # tight packing mode
else:
score = abs(remaining - item_size * 2.2) # density mode
if score < best_score:
best_score, best_bin = score, i
if best_bin == -1:
bins.append(1.0 - item_size)
return len(bins) - 1
bins[best_bin] -= item_size
return best_bin
Example 2: Symbolic Regression
User: "I have (x, y) data from a physics experiment. Find a compact formula that fits it. Use evolutionary search."
Approach:
formula(x) -> y and evaluator = negative mean squared error on held-out data.y = x (MSE = 14.3).Output:
import numpy as np
def formula(x):
"""Discovered: damped oscillator with phase shift.
y = 2.31 * exp(-0.15*x) * sin(1.87*x + 0.42) + 0.73*x
"""
return 2.31 * np.exp(-0.15 * x) * np.sin(1.87 * x + 0.42) + 0.73 * x
Example 3: Optimizing a Sort Comparator
User: "I have a custom sort for ranking search results with multiple signals (relevance, recency, popularity). Evolve a better ranking function."
Approach:
score = 0.5*relevance + 0.3*recency + 0.2*popularity. NDCG = 0.71.Output: the best-scoring ranking function with its delta ancestry as documentation.
| Issue | Symptom | Resolution | |-------|---------|------------| | Generated code doesn't parse | Syntax errors in extracted code block | Re-prompt with the error message appended. Use stricter delimiters. Fall back to parent. | | Score regression across all candidates | Every candidate in a batch scores lower than parent | Widen diversity selection. Include more elite nodes. Reduce mutation aggressiveness by showing more Level-2 detail. | | Evaluation timeout | Candidate contains infinite loops or exponential complexity | Wrap evaluation in a timeout. Assign worst-possible score to timed-out candidates. Add "must terminate in O(n log n)" to prompt constraints. | | Prompt exceeds context window | Too many history nodes selected | Aggressively compress: show only Level-1 for all but the 3 most recent nodes. Reduce elite/diverse counts. | | Premature convergence | All candidates become near-copies of the best solution | Reset one island with a random or adversarial seed. Temporarily increase the temperature of parent selection. | | Delta parse failure | LLM doesn't produce delimited output | Retry with explicit few-shot examples of the delimiter format. If persistent, extract delta post-hoc by diffing parent and child code. |
Paper: DeltaEvolve: Accelerating Scientific Discovery through Momentum-Driven Evolution by Jiachen Jiang, Tianyu Ding, Zhihui Zhu (2026). Look for: the EM formalization (Section 3), the three-level delta database (Section 4.2), progressive disclosure rendering (Section 4.3), and ablation results in Table 1 showing context selection dominates over scalar feedback.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".