skills/cowork-x-experience-optimized-co-evolution-multi-a/SKILL.md
Build multi-agent collaboration systems with experience-driven co-evolution using HTN skill libraries and post-episode optimization. Use when: 'build a multi-agent system that improves over episodes', 'create agents that coordinate in real-time with low latency', 'design a skill library for collaborative agents', 'implement experience-based co-evolution for agent teams', 'optimize multi-agent token budget while improving performance', 'set up HTN-based task decomposition for cooperative agents'.
npx skillsauth add ndpvt-web/arxiv-claude-skills cowork-x-experience-optimized-co-evolution-multi-aInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design and implement multi-agent collaboration systems that improve across episodes through structured skill libraries and post-episode optimization. The core technique from CoWork-X separates fast in-episode execution (via hierarchical task network skill retrieval with zero online LLM calls) from slow post-episode reasoning (via a Co-Optimizer that patches the skill library based on diagnosed failures). This produces agents that coordinate in real time, accumulate performance gains episode over episode, and steadily reduce token cost.
CoWork-X treats multi-agent collaboration as a closed-loop optimization problem inspired by fast-slow memory separation from neuroscience. During execution, a Skill-Agent retrieves and invokes pre-compiled skills from a structured HTN skill library -- no LLM calls happen in-episode. After each episode, a Co-Optimizer analyzes execution logs, diagnoses failures (runtime errors, stagnation, action inefficiencies), and synthesizes targeted patches to the skill library. The loop is: Execute(S_k) -> Diagnose(D_k) -> Update(S_k -> S_{k+1}) -> Execute(S_{k+1}).
The skill library is a Python module with three layers: (1) State representation -- compact abstractions of task-relevant features, omitting spatial noise; (2) Operators -- atomic primitives with precondition checks and state-update effects; (3) Methods -- task decompositions that map high-level goals to ordered operator sequences. This HTN structure makes skills interpretable, composable, and cheap to execute. The library starts deliberately impaired (syntactically valid but semantically incomplete) to force the Co-Optimizer to learn corrections from real trajectories.
Drift regularization prevents catastrophic edits: the Co-Optimizer receives both the current library and the best-performing historical library, discouraging radical rewrites and enabling rollback when a patch degrades performance. Budget constraints are enforced by design -- the Skill-Agent uses zero online tokens, and the Co-Optimizer operates offline with validation retries (up to 3). In experiments, CoWork-X achieved 15.6x the score of baselines while using 27% fewer total tokens over 30 episodes.
Define the state representation. Identify the minimal set of task-relevant features agents need to reason about. Strip spatial details and ephemeral data. Encode as a Python class or dictionary with typed fields (e.g., ingredient counts, queue status, agent assignments).
Implement atomic operators. Write Python functions for each primitive action. Each operator must: (a) check preconditions against the current state, (b) return the unchanged state on precondition failure, (c) apply state updates and return the modified state on success. Name them descriptively (e.g., op_assemble_burger, op_serve_order).
Compose methods as task decompositions. For each high-level goal (e.g., "fulfill order X"), write a method that decomposes it into an ordered sequence of operators. Methods should check applicability conditions and return the subtask list. Register operators and methods with an HTN planner (e.g., Pyhop).
Initialize the skill library as deliberately impaired. Start with syntactically valid but semantically incomplete code -- operators that skip precondition checks or return unchanged state. This forces the optimization loop to discover corrections from real execution data rather than relying on a priori assumptions.
Build the execution harness. Run episodes where multiple agents share the same skill library. Each agent independently invokes the HTN planner to decompose goals into primitives, then a low-level controller executes the primitives. Log every timestep: actions taken, state transitions, errors, and stagnation intervals.
Implement the diagnostic pipeline. After each episode, analyze logs to extract three categories of signal: (a) Runtime failures -- exact error locations with surrounding context (2 timesteps before and after), (b) Stagnation -- intervals of 100+ consecutive inactive timesteps with the state at stagnation onset, (c) Action distribution -- frequency counts by action category to identify inefficient patterns.
Build the Co-Optimizer prompt. Construct an LLM prompt containing: the current skill library source, the original (initial) library for reference, the best-performing historical library for rollback context, the diagnostic report, and performance metrics from the last 5 episodes. Instruct the LLM to output a patched Python skill library file.
Validate and rollback. Parse the Co-Optimizer's output as Python, run syntax validation and up to 3 retry attempts if validation fails. Execute one episode with the patched library. If performance drops below the historical best, rollback to that best library. Track a rolling window of the last 5 episode scores.
Iterate the co-evolution loop. Repeat steps 5-8 across episodes. Monitor three convergence signals: rising episode scores, decreasing failure counts in diagnostics, and stabilizing library diffs (fewer patches needed per iteration).
Extract and freeze the mature skill library. Once scores plateau and patches become minimal, freeze the library as the production artifact. The frozen library executes with zero LLM tokens and deterministic behavior.
Example 1: Multi-agent order fulfillment system
User: "Build a multi-agent kitchen system where two agents collaboratively prepare burger orders, improving their coordination over time."
Approach:
{beef_patties: int, lettuce: int, assembled_burgers: dict, pending_orders: list, agent_assignments: dict}def op_grill_patty(state, agent):
if state.beef_patties <= 0:
return state # precondition failure
state.beef_patties -= 1
state.grilled_patties += 1
return state
def op_assemble_beef_burger(state, agent):
if state.grilled_patties < 1 or state.buns < 1:
return state
state.grilled_patties -= 1
state.buns -= 1
state.assembled_burgers["beef"] += 1
return state
def method_fulfill_beef_order(state, agent):
if state.pending_orders and "beef" in state.pending_orders[0]:
return [("op_grill_patty", agent),
("op_assemble_beef_burger", agent),
("op_serve", agent, "beef")]
return False
Example 2: Document processing pipeline with co-evolving agents
User: "I have two agents -- one extracts data from PDFs, one validates and stores it. They keep stepping on each other. Can you make them improve their coordination automatically?"
Approach:
{pending_docs: list, extracted: dict, validated: dict, errors: list, agent_locks: dict}op_claim_document, op_extract_fields, op_validate_record, op_store_record -- each with lock-checking preconditionsmethod_process_document decomposes into claim -> extract -> handoff -> validate -> storeOutput after co-evolution:
# Patched operator with learned precondition
def op_claim_document(state, agent):
if not state.pending_docs:
return state
if state.agent_locks.get(state.pending_docs[0]) is not None:
return state # another agent already claimed this doc
doc = state.pending_docs.pop(0)
state.agent_locks[doc] = agent
state.in_progress[agent] = doc
return state
Example 3: Applying CoWork-X to an existing agent codebase
User: "I have a multi-agent system but agents waste tokens re-reasoning every step. How do I apply the CoWork-X pattern to reduce cost?"
Approach:
ast.parse() on the output. On failure, feed the syntax error back to the LLM with the original library and retry (up to 3 times). If all retries fail, keep the current library for the next episode.Paper: CoWork-X: Experience-Optimized Co-Evolution for Multi-Agent Collaboration System (Lin et al., 2026) Key insight: Separating fast skill-based execution from slow post-episode optimization lets multi-agent systems achieve real-time coordination with zero in-episode token cost while accumulating performance gains across episodes through structured, patch-style library evolution.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".