skills/experience-driven-multi-agent-systems-training-fre/SKILL.md
Build self-evolving multi-agent systems that accumulate tool-level expertise through structured interaction without model fine-tuning. Uses GeoEvolver's architecture: retrieval-augmented orchestration, parallel sub-goal exploration, contrastive memory distillation, and root-cause failure attribution. Triggers: 'build a self-evolving agent pipeline', 'create an experience-driven multi-agent system', 'add memory to my agent workflow', 'implement tool exploration with failure learning', 'make agents learn from execution history', 'build a GeoEvolver-style system'
npx skillsauth add ndpvt-web/arxiv-claude-skills experience-driven-multi-agent-systems-training-freInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design and implement multi-agent systems that progressively acquire tool-level expertise through structured interaction — without any model parameter updates. Based on the GeoEvolver architecture (Dai et al., 2026), the core insight is that LLM agents can learn fine-grained tool configuration patterns and failure recovery strategies by decomposing complex tasks into independent sub-goals, exploring diverse tool-parameter configurations in parallel, and distilling both successes and root-cause failure attributions into an evolving memory bank that provides in-context demonstrations for future queries.
The core problem: LLM agents generating multi-step tool-calling pipelines frequently fail in specialized domains not because the high-level plan is wrong, but because subtle tool-parameter misconfigurations propagate silently through the pipeline. Existing approaches either require expensive fine-tuning or rely on generic retry logic that doesn't learn from past failures.
GeoEvolver's solution uses four cooperating agents in a closed loop: (1) a Retriever that queries an embedding-indexed memory bank for relevant past execution patterns, (2) an Orchestrator that decomposes queries into N independent sub-goals with explicit input/output contracts, (3) K parallel Executor variants that each independently attempt the full pipeline with up to A corrective retries per sub-goal, and (4) a Judge that emits binary success labels per sub-goal and selects the best overall execution trace. The key architectural insight is decoupling planning from execution and making sub-goals independently evaluable, which localizes failures to specific tool calls rather than the entire pipeline.
Experience accumulation happens through two distillation mechanisms after each task. Single-variant extraction captures analysis patterns (successful tool orderings and parameter choices) or error attributions (failure symptoms and corrective guardrails) from the best trace. Contrastive memory distillation compares all K parallel variants to synthesize workflow invariants that held across successes and failure modes with recovery strategies. Entries are deduplicated by (source_id, pattern_type, title) keys and stored in a persistent memory bank. Future queries retrieve top-k similar entries via embedding similarity, providing the orchestrator with execution-grounded demonstrations rather than generic instructions.
Define the agent roles and their separation of concerns. Create four distinct agent components: Retriever (memory lookup), Orchestrator (task decomposition), Executor (tool-calling sub-goal execution), and Judge (outcome evaluation). Each agent gets a focused system prompt describing only its responsibility.
Design the memory bank schema. Create a persistent store (JSON file, SQLite, or vector DB) with entries containing: source_id, pattern_type (one of analysis_pattern or error_attribution), title, embedding (for retrieval), content (the actual pattern or failure description), tool_sequence (ordered tool calls), parameter_snapshot (key parameter values), and timestamp. Include a deduplication function keyed on (source_id, pattern_type, title).
Implement the retrieval mechanism. On each new query, embed the query text using a sentence embedding model (or LLM-based embedding), retrieve top-k entries from the memory bank by cosine similarity, and aggregate them into a "strategy context" string. Apply a leakage filter: exclude entries whose content overlaps with expected outputs to prevent shortcutting.
Build the orchestrator's sub-goal decomposition. The orchestrator receives the query plus strategy context and produces N independent sub-goals. Each sub-goal must specify: (a) a natural language description, (b) input format contract (what data it receives), (c) output format contract (what it produces), (d) success criteria (how to verify correctness). The global task succeeds only when all sub-goals succeed: Y = AND(Y_1, Y_2, ..., Y_N).
Implement parallel exploration with K variants. For each query, spawn K independent execution pipelines. Each variant runs the full retrieve-plan-execute-judge loop independently. Within each variant, each sub-goal executor gets up to A corrective retry attempts. Use asyncio.gather() or thread pools to run variants in parallel where possible.
Build the executor agents with working memory. Each executor maintains a compressed working memory: H_t = summarize(H_{t-1}) || tail(trajectory, L) where older interactions are summarized and the L most recent raw tool calls are retained verbatim. This prevents context overflow during long-horizon execution while preserving recent actionable detail.
Implement the judge with sub-goal-level evaluation. The judge inspects each sub-goal's output against its success criteria, emitting per-sub-goal binary labels Y_n and a validity signal v. Select the best variant: best = argmax_k Score(trajectory_k | Y_k, v_k), prioritizing verified success with validity as tiebreaker.
Distill experience after each task. Run two extraction passes: (a) Single-variant extraction: from the best trace, produce an analysis_pattern if successful or an error_attribution if failed, capturing the specific tool sequence, parameter values, and outcome. (b) Contrastive distillation: compare all K variants to identify what was consistent across successes (workflow invariants) and what distinguished failures (divergence points with recovery strategies).
Consolidate into the memory bank. Merge new entries with deduplication: memory_bank = memory_bank UNION deduplicate(new_entries, memory_bank). Re-index embeddings for the new entries. Optionally prune old low-utility entries based on retrieval frequency and recency.
Tune hyperparameters for your domain. Start with K=2 variants, N=3 max sub-goals, A=2 retry attempts, and top-k=5 for memory retrieval. The paper shows strong interaction effects between K and N — jointly increasing both yields substantial gains, but moderate settings balance performance and compute cost. Memory size scales positively with retrieval effectiveness.
Example 1: Building a Self-Evolving Data Pipeline Agent
User: "I need a system that processes geospatial satellite images — it has to select the right bands, apply atmospheric correction, run NDVI calculation, and classify land cover. The tools keep failing with wrong parameters."
Approach:
{
"entries": [
{
"source_id": "task_042",
"pattern_type": "analysis_pattern",
"title": "Sentinel-2 NDVI band selection",
"content": "For Sentinel-2 L2A, use Band 4 (Red, 10m) and Band 8 (NIR, 10m). Do NOT use Band 8A (20m) as resolution mismatch causes silent errors in NDVI calculation.",
"tool_sequence": ["band_selector", "ndvi_calculator"],
"parameter_snapshot": {"red_band": "B04", "nir_band": "B08", "resolution": "10m"},
"embedding": [0.12, -0.34, ...]
},
{
"source_id": "task_037",
"pattern_type": "error_attribution",
"title": "Atmospheric correction CRS mismatch",
"content": "sen2cor fails silently when input CRS is WGS84 geographic. Must reproject to UTM zone first. Symptom: output raster has all-zero reflectance values.",
"tool_sequence": ["reproject", "sen2cor"],
"parameter_snapshot": {"target_crs": "EPSG:32633"},
"embedding": [0.45, 0.11, ...]
}
]
}
Output: A Python system with retriever.py, orchestrator.py, executor.py, judge.py, and memory_bank.json that self-improves with each processed image.
Example 2: Adding Experience Memory to an Existing LangChain Agent
User: "I have a LangChain agent that calls APIs to process financial data but it keeps misconfiguring date formats and pagination parameters. Can you add learning from failures?"
Approach:
ExperienceMemory class wrapping a vector store (FAISS or ChromaDB):class ExperienceMemory:
def __init__(self, db_path: str):
self.store = load_or_create_vectorstore(db_path)
def retrieve(self, query: str, k: int = 5) -> list[MemoryEntry]:
results = self.store.similarity_search(query, k=k)
return [MemoryEntry.from_document(doc) for doc in results]
def distill_success(self, query: str, trajectory: list[ToolCall]) -> MemoryEntry:
pattern = extract_analysis_pattern(query, trajectory)
self.store.add_documents([pattern.to_document()])
return pattern
def distill_failure(self, query: str, trajectory: list[ToolCall], error: str) -> MemoryEntry:
attribution = extract_root_cause(query, trajectory, error)
self.store.add_documents([attribution.to_document()])
return attribution
invoke() to run K=2 parallel attempts, each injecting retrieved experience into the system prompt.Output: The agent learns entries like "Bloomberg API requires ISO-8601 dates with timezone (2024-01-15T00:00:00Z), not YYYY-MM-DD" and "pagination cursor must be passed as header X-Next-Page, not query parameter" — and stops making these mistakes on future queries.
Example 3: Contrastive Memory Distillation Across Parallel Variants
User: "My CI/CD pipeline agent tries different deployment configurations but I want it to learn which patterns work reliably."
Approach:
{
"pattern_type": "analysis_pattern",
"title": "Kubernetes deployment rollout invariant",
"content": "Gradual rollout strategy (maxSurge=1, maxUnavailable=0) is required for services with startup latency >10s. Both successful variants used gradual rollout; the failure used immediate rollout causing crash loops during health check window.",
"workflow_invariants": ["strategy=RollingUpdate", "maxUnavailable=0"],
"failure_modes": [{"symptom": "CrashLoopBackOff", "cause": "immediate rollout + slow startup", "fix": "use gradual rollout with maxSurge=1"}]
}
Output: Future deployments retrieve this invariant and the orchestrator plans gradual rollouts by default for services with known startup latency.
H_t) is episode-specific compressed context that prevents context overflow; the memory bank is the persistent cross-episode knowledge store.summarize(H_{t-1}) and keep only the L most recent raw tool calls.Paper: Experience-Driven Multi-Agent Systems Are Training-free Context-aware Earth Observers (Dai et al., 2026). Look for: Algorithm 1 (full GeoEvolver pseudocode), Table 2 (cross-backbone results showing the 12% average gain), the ablation in Table 4 (component contributions — contrastive distillation is the single most impactful component at +21.87pp), and the hyperparameter sensitivity analysis showing K*N interaction effects.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".