skills/darwin-dynamic-agentically-rewriting/SKILL.md
Evolutionary multi-agent code optimization using genetic algorithms. Agents mutate each other's training/configuration code, benchmark results, and select survivors across generations. Use when: 'evolve my training config', 'optimize this code with genetic search', 'set up evolutionary hyperparameter tuning', 'multi-agent code mutation pipeline', 'self-improving training loop', 'darwin-style evolutionary optimization'.
npx skillsauth add ndpvt-web/arxiv-claude-skills darwin-dynamic-agentically-rewritingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to implement DARWIN-style evolutionary optimization pipelines where multiple independent agents iteratively mutate, benchmark, and select code variants using a genetic algorithm structure. Rather than tuning parameters by hand or running grid search, you set up a population of code variants that evolve: agents rewrite each other's source code (training scripts, configs, pipelines), run benchmarks in isolated environments, and a selection step retains only the top performers for the next generation. Persistent JSON memory tracks what changes correlated with improvements, so mutations become increasingly informed over iterations.
DARWIN treats code optimization as an evolutionary process. A population of N code variants (typically 10) is maintained, each in an isolated working directory. Each generation, variants are paired and an LLM agent rewrites targeted sections of one variant's code — adjusting hyperparameters, refactoring data loading, changing optimization strategies, or restructuring model architecture code. This is the mutation step. The mutation is not random: the LLM receives the full code context, chunked by regex into imports, class definitions, and function bodies, along with a memory log of what previous changes helped or hurt. Each chunk has a mutation probability (default 0.3), and module-level imports are preserved to maintain dependency consistency.
After mutation, all variants train in parallel in isolated containers or directories. A standardized benchmark evaluates each variant on metrics like perplexity and model FLOPS utilization (MFU). The selection step retains the top K performers (typically 4 out of 10) as parents for the next generation. Failed variants get one error-correction attempt before being dropped. The population is replenished to N by sampling from survivors before the next mutation cycle begins.
The critical differentiator is the persistent JSON memory system. Every mutation is logged with timestamps, file paths, LLM-generated summaries of changes, and the resulting performance delta. This memory is fed back into mutation prompts, creating an increasingly informed search. Ablation experiments showed removing memory degraded performance by ~3%. A bidirectional human-in-the-loop interface lets agents request structural changes (new datasets, libraries, file reorganization) that require human approval, keeping the system safe while allowing it to break out of local optima.
Define the baseline: Identify the code to optimize (training script, config file, pipeline). Establish a working version that runs successfully and produces measurable benchmark metrics (loss, perplexity, throughput, accuracy, latency, etc.).
Set up the population structure: Create N isolated working directories (e.g., agents/agent_00/ through agents/agent_09/), each containing a copy of the baseline code. Use Docker containers or separate virtualenvs if mutations might introduce dependency changes.
Implement the code chunker: Write a regex-based parser that segments each target source file into chunks — module-level imports (frozen by default), class definitions, and function bodies. This allows selective mutation within token limits and preserves structural integrity.
Build the mutation prompt: For each chunk eligible for mutation (probability 0.3), construct a prompt containing: (a) the chunk source code, (b) global variables and file structure metadata, (c) summaries of other already-mutated chunks in this pass, and (d) relevant entries from the memory log showing what prior changes improved or degraded performance.
Execute mutations via LLM: Send mutation prompts to the LLM (Claude, GPT-4o-mini, etc.). The LLM returns modified code for each chunk. Reassemble the full source file, preserving import blocks. Write the mutated code to the agent's working directory.
Run parallel training/benchmarking: Execute each agent's code in its isolated directory with standardized parameters. Capture benchmark metrics (perplexity, MFU, loss, accuracy, wall-clock time). If a variant crashes, send the error traceback back to the LLM for one correction attempt; drop the variant if it fails again.
Log results to memory: For each agent, write a JSON entry recording: timestamp, file path modified, LLM summary of changes made, benchmark metrics achieved, and the delta from baseline/parent. Append to the persistent memory file.
Select survivors: Rank all agents by the target metric. Retain the top K (e.g., 4 out of 10) as parents for the next generation. Replenish the population to N by copying code from randomly sampled survivors into empty agent slots.
Iterate: Repeat steps 4-8 for the target number of generations. Monitor for convergence (diminishing improvement deltas) or divergence (increasing error rates).
Extract and report the best variant: After all generations complete, identify the top-performing agent. Diff its code against the original baseline to produce a human-readable summary of all beneficial changes discovered.
Example 1: Evolving a nanoGPT training script
User: "I have a nanoGPT training script that gets 38.5 perplexity. Can you set up an evolutionary loop to optimize it?"
Approach:
train.py into 10 agent directoriestrain.py into chunks: imports, get_batch(), model config block, training loop, optimizer setupmemory.jsonOutput structure:
darwin_workspace/
memory.json # Persistent mutation/performance log
baseline/train.py # Original unmodified script
agents/
agent_00/train.py # Mutated variant
agent_01/train.py
...
agent_09/train.py
results/
generation_0.json # Per-generation benchmark results
generation_1.json
...
best_diff.patch # Diff of best variant vs baseline
summary.md # Human-readable optimization report
Sample memory.json entry:
{
"generation": 2,
"agent_id": "agent_03",
"timestamp": "2026-02-13T14:22:01Z",
"file": "train.py",
"chunk": "get_batch",
"change_summary": "Replaced dual np.memmap calls with single shared memmap, added prefetch buffer of 4 batches",
"parent_perplexity": 38.12,
"result_perplexity": 37.85,
"delta": -0.27,
"mfu": 0.401
}
Example 2: Optimizing a data pipeline config
User: "I have a Spark ETL pipeline config that takes 45 minutes. Set up a DARWIN-style evolutionary search over the config parameters."
Approach:
pipeline_config.yamlOutput:
Generation 0: Best runtime 4m32s (baseline 4m30s on 10% sample)
Generation 1: Best runtime 4m18s — increased parallelism in shuffle stage
Generation 2: Best runtime 3m55s — switched to broadcast join for small tables
Generation 3: Best runtime 3m48s — reduced partition count post-filter, added compression
Final improvement: ~15% reduction in runtime on benchmark sample
Example 3: Multi-agent prompt optimization
User: "I have a classification prompt that gets 72% accuracy. Can we evolve it?"
Approach:
Output:
Baseline accuracy: 72%
Gen 1 best: 74% — added explicit negative examples to few-shot section
Gen 3 best: 77% — restructured instructions as numbered constraints
Gen 5 best: 79% — replaced vague "classify appropriately" with decision tree
Final prompt saved to best_prompt.txt with full mutation history
DARWIN: Dynamic Agentically Rewriting Self-Improving Network — Henry Jiang, 2026. Focus on Section 3 (System Design) for the mutation/selection architecture, Section 4 (Experiments) for benchmarking methodology, and the memory system design for implementing informed mutations.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".