skills/contextevolve-multi-agent-context-compression/SKILL.md
Multi-agent iterative code optimization using context compression. Decomposes optimization into three agents (Summarizer, Navigator, Sampler) that mirror RL state/policy/replay to evolve better code across iterations. Trigger phrases: 'optimize this algorithm iteratively', 'evolve better code', 'multi-agent code optimization', 'compress optimization context', 'iterative code improvement with agents', 'ContextEvolve approach'
npx skillsauth add ndpvt-web/arxiv-claude-skills contextevolve-multi-agent-context-compressionInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to iteratively optimize code using the ContextEvolve framework -- a three-agent system that achieves reinforcement-learning-level search efficiency without parameter updates. Instead of brute-force prompting or naive evolutionary search, it decomposes the optimization context into three orthogonal dimensions: semantic state compression (Summarizer), optimization direction distillation (Navigator), and prioritized exemplar retrieval (Sampler). These agents collaborate to pack maximum information into limited context windows, enabling principled code evolution that outperforms baselines by 33% while using 29% fewer tokens.
ContextEvolve establishes a functional isomorphism between reinforcement learning and a training-free multi-agent framework. Three agents decompose the optimization context:
Summarizer Agent (State Representation): Converts raw code into concise natural language abstracts that capture both inherited traits from the parent solution and novel modifications in the offspring. This code-to-language abstraction compresses high-dimensional code into dense semantic descriptions, freeing context space. Critically, the summarizer must preserve ancestral traits -- summarizing only novel changes causes an "amnesia effect" that loses valuable inherited properties (10%+ performance drop in ablations).
Navigator Agent (Policy Gradient): Analyzes trajectories of parent-child pairs with their score deltas to distill high-level optimization directions. It samples from three trajectory categories: consistent improvement, mixed fluctuation, and consistent decline. The navigator outputs ambiguous directional guidance rather than specific implementation steps -- over-specificity narrows the solution space prematurely (32.7% performance collapse when too specific). Think of it as estimating which direction to explore, not dictating the exact path.
Sampler Agent (Experience Replay): Curates a small set of high-value exemplars from the full population buffer based on relevance to the current parent state and navigator guidance. It prioritizes informative semantics over raw score -- failed candidates with novel logic often drive breakthroughs. Restricting to only top-scoring exemplars caused 27.8% performance loss by preventing heuristic discovery. This mirrors prioritized experience replay in RL, where surprising transitions teach more than predictable ones.
Define the evaluation function. Write or identify a concrete scoring function that takes generated code and returns a numeric fitness score. This must be deterministic and capture the optimization objective (e.g., execution time, correctness rate, combined weighted metrics). Without measurable feedback, the framework cannot guide search.
Initialize the Evolve Buffer. Generate 2-3 initial candidate solutions and evaluate each. Store tuples of (code, score, semantic_abstract) in a buffer data structure. The semantic abstract is a natural language summary of what the code does and its key design choices.
Select a parent from the buffer. Choose a candidate from the evolve buffer as the starting point for this iteration. Prefer higher-scoring candidates but maintain diversity -- don't always pick the top scorer.
Run the Summarizer Agent. Given the parent's existing abstract and any offspring code from the previous iteration, produce an updated natural language summary. The prompt should ask: "Given the parent's description [z_parent] and this new code [c_child], summarize the complete solution -- both inherited design choices and new modifications." Preserve ancestral traits explicitly.
Run the Navigator Agent. Sample 3-5 trajectory pairs from the buffer -- each pair consisting of (parent_abstract, child_abstract, score_delta). Include a mix of improving, declining, and mixed trajectories. Prompt the navigator: "Analyze these optimization trajectories and describe high-level directions that tend to improve performance. Be directional, not prescriptive -- suggest what kinds of changes help, not specific code." Output is a short paragraph of optimization guidance.
Run the Sampler Agent. Present the full buffer contents (abstracts and scores) along with the current parent abstract and navigator guidance to the sampler. Prompt: "Select 2-3 exemplars from this population that would be most informative as few-shot references for the next code generation. Prioritize diversity of approach and informativeness over raw score." Return the selected code examples.
Compose the generation context. Assemble the prompt for code generation by combining: (a) the parent's semantic abstract (not raw code -- this is the compression), (b) the navigator's directional guidance, (c) the sampler's curated exemplars with their scores. This composed context replaces naive concatenation of all previous code.
Generate offspring code. Prompt the code generator with the composed context and the task specification. The generator produces a new candidate solution informed by compressed state, directional guidance, and curated examples.
Evaluate and update the buffer. Run the evaluation function on the offspring. Generate a semantic abstract for the new code via the Summarizer. Store (offspring_code, score, abstract) in the evolve buffer. Track the best score seen so far.
Iterate or terminate. Repeat steps 3-9 for a fixed budget of iterations (typically 30-100). Terminate early if the score plateaus for 10+ consecutive iterations. Return the highest-scoring solution from the buffer.
Example 1: Optimizing a load balancer assignment algorithm
User: "I have a GPU load balancer that assigns tasks to GPUs using round-robin. I need to optimize it for both speed and balance. Here's my evaluation function that scores solutions 0-100."
Approach:
Output (best solution summary):
Score: 91/100 (speed: 95, balance: 87)
Algorithm: Largest-remainder proportional apportionment with snake-order
GPU assignment using vectorized numpy operations. O(1) amortized per
task batch. Discovered at iteration 66 by reintegrating speed techniques
from iteration 10 lineage via exemplar retrieval.
Example 2: Iteratively improving a SQL query optimizer heuristic
User: "Optimize my query reordering heuristic to maximize KV cache hit rate while minimizing reordering latency."
Approach:
Output (iteration log excerpt):
Iter 1: score=45 (baseline)
Iter 5: score=52 Navigator: "dependency-aware ordering improves locality"
Iter 12: score=67 Merged dependency-graph + batch-frequency approaches
Iter 25: score=74 Navigator: "pre-computed access matrices reduce runtime"
Iter 40: score=81 Best: clause clustering with precomputed affinity matrix
Buffer: 40 candidates, context per iteration: ~3.2K tokens (vs ~18K raw)
Example 3: Setting up the three-agent pipeline for a custom optimization task
User: "I want to use the ContextEvolve approach to optimize my packet scheduling algorithm."
Approach:
SUMMARIZER_PROMPT: "Given the parent solution description: {parent_abstract}
And this new implementation: {offspring_code}
Write a 3-5 sentence summary covering: (a) inherited design choices,
(b) new modifications, (c) key algorithmic properties (complexity, data structures).
Preserve description of ancestral traits even if unchanged."
NAVIGATOR_PROMPT: "Analyze these optimization trajectories:
{for each trajectory: parent_desc -> child_desc, score_change: +/-N}
What high-level directions tend to improve performance?
Be directional (e.g., 'reduce memory allocations') not prescriptive
(e.g., 'use array pooling on line 42'). Output 2-3 sentences."
SAMPLER_PROMPT: "From this population of solutions:
{for each: abstract, score}
Current parent: {parent_abstract}
Current guidance: {navigator_output}
Select 2-3 exemplars as few-shot references. Prioritize:
- Diverse approaches (not just highest scores)
- Novel logic that could inspire new directions
- Relevance to current guidance
Return the selected solution codes."
The power of ContextEvolve comes from mapping RL concepts to text-space operations. Understanding this mapping helps you tune the framework:
| RL Concept | ContextEvolve Agent | What It Means in Practice | |---|---|---| | State encoder | Summarizer | Compresses code into a latent representation (natural language abstract) that the "policy" can act on | | Policy gradient | Navigator | Estimates which direction to move in solution space by analyzing which changes correlated with score improvement | | Experience replay buffer | Sampler + Evolve Buffer | Stores and retrieves past experiences, prioritizing informative ones over merely successful ones | | Policy network | Code Generator | Produces actions (new code) conditioned on the composed state | | Reward signal | Evaluation function | Provides the scalar feedback that drives the entire loop |
The key insight: you get RL-like sample efficiency (learning from past experience, directed search) without any gradient computation or parameter updates. The "gradients" are natural language directions. The "state encoding" is summarization. The "replay" is exemplar selection. All of it happens in text.
Paper: ContextEvolve: Multi-Agent Context Compression for Systems Code Optimization (Su, Zheng, Li, 2026). Look for Algorithm 1 (the main loop), Table 2 (ablation showing each agent's contribution), and Figure 3 (the load balancing case study showing how separate optimization lineages merge via exemplar retrieval).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".