skills/autonomous-multi-agent-ai-high-throughput/SKILL.md
Build multi-agent AI systems for high-throughput scientific workflows with metacognitive self-assessment. Implements the Polymer Research Lifecycle (PRL) architecture: a Planner Agent decomposes complex scientific tasks into subtasks assigned to specialized domain agents (Research, Characterization, ML Model, Safety, Synthesis, Execution, Reporting), which produce consensus predictions with uncertainty estimates and continuously self-optimize via three-layer metacognitive reflection. Trigger phrases: - "Build a multi-agent pipeline for materials property prediction" - "Create a high-throughput screening system with agent consensus" - "Implement metacognitive self-assessment for an agent swarm" - "Design an autonomous scientific workflow with specialized agents" - "Set up a polymer informatics pipeline with uncertainty quantification" - "Orchestrate domain-specific agents for computational chemistry"
npx skillsauth add ndpvt-web/arxiv-claude-skills autonomous-multi-agent-ai-high-throughputInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to architect and implement multi-agent systems that follow the Polymer Research Lifecycle (PRL) pipeline pattern from Roy et al. (2026). The core idea: a central Planner Agent decomposes complex scientific or engineering tasks into subtasks dispatched to specialized domain agents, each with distinct tools and models. Agents return independent predictions that are aggregated via consensus for uncertainty quantification, while a three-layer metacognitive framework (tactical, strategic, meta-strategic) monitors agent effectiveness and dynamically adjusts execution strategies. This pattern generalizes beyond polymers to any domain requiring high-throughput prediction, multi-model consensus, and self-improving agent orchestration.
Hierarchical Agent Orchestration with Consensus. The PRL architecture uses a four-layer pipeline: (1) Data Ingestion, where a centralized repository integrates heterogeneous sources; (2) Preprocessing and Feature Engineering, where domain-specific tokenizers and encoders produce embeddings; (3) Agent Processing, where specialized agents (Research, Characterization, ML Model, Safety, Synthesis, Execution, Reporting) perform domain computations; and (4) Output Integration, where results are aggregated, visualized, and reported. The Planner Agent sits at the top, decomposing user requests into subtasks, assigning them to the appropriate specialist, and merging outputs. For prediction tasks, multiple independent agents (e.g., a GNN agent, a descriptor-based predictor, and a simulation agent) produce separate estimates. The consensus prediction is the mean, and uncertainty is the standard deviation across agents, giving calibrated confidence intervals without expensive Bayesian methods.
Metacognitive Self-Assessment. The system implements three reflection layers that run after each task cycle. Tactical reflection evaluates individual agent operations (Did the Research Agent find relevant papers? Did the ML Agent's predictions fall within expected error bounds?). Strategic reflection evaluates overall progress toward the research objective and pipeline efficiency. Meta-strategic reflection tracks learning patterns across multiple cycles, identifying persistent weaknesses and triggering corrective actions such as curriculum-like retraining objectives for underperforming agents. Each agent receives an effectiveness score; agents scoring below population average are flagged for adjustment. In the paper's polystyrene case study, this mechanism detected that the Synthesis Agent (score 0.30) and Research Agent (score 0.57) were underperforming and dynamically generated improvement objectives.
Linear Scalability via Parallel Dispatch. By decomposing work into independent subtasks, the system achieves O(n) time complexity scaling to 10,000+ items. Parallel agent execution yields ~5x speedup on multi-core systems. This makes the pattern suitable for high-throughput screening where thousands of candidates must be evaluated cheaply.
Define the agent roster and their tool access. Create a configuration specifying each specialist agent's name, role description, available tools (e.g., RDKit for molecular descriptors, a GNN model for graph-based prediction, an API for literature retrieval), and the LLM backing each agent. Use a JSON or YAML manifest:
{
"agents": [
{"name": "research", "role": "Retrieve and summarize scientific literature", "tools": ["arxiv_search", "semantic_scholar"]},
{"name": "ml_model", "role": "Run property predictions via trained models", "tools": ["polygnn", "property_predictor"]},
{"name": "safety", "role": "Screen candidates against safety and feasibility criteria", "tools": ["safety_db", "toxicity_checker"]},
{"name": "execution", "role": "Orchestrate task flow, handle errors, verify consistency", "tools": ["task_queue", "logger"]}
]
}
Implement the Planner Agent as a task decomposer. The Planner receives the user's high-level request, breaks it into atomic subtasks with explicit inputs/outputs, assigns each to the appropriate specialist agent, and defines the dependency graph. Use structured output (JSON) so downstream agents receive typed inputs:
def plan_task(request: str) -> list[Subtask]:
# LLM call to decompose request into subtasks
# Each subtask has: id, agent_name, input_schema, output_schema, depends_on
...
Build the four-layer data pipeline. Layer 1: Ingest raw data (SMILES strings, experimental measurements, literature references) into a unified store. Layer 2: Preprocess into model-ready features (molecular graphs, fingerprint vectors, normalized descriptors). Layer 3: Dispatch to specialist agents. Layer 4: Aggregate and format outputs.
Implement consensus prediction with uncertainty. For any prediction target, run at least two independent agents (different model architectures or data representations). Compute the consensus as the mean of agent predictions and uncertainty as the standard deviation:
predictions = [agent.predict(input) for agent in prediction_agents]
consensus = np.mean(predictions)
uncertainty = np.std(predictions)
result = {"value": consensus, "uncertainty": uncertainty, "agent_predictions": predictions}
Add safety and feasibility filtering. Before any candidate reaches the output stage, route it through the Safety Agent, which checks against domain constraints (physical plausibility, toxicity thresholds, regulatory compliance, cost bounds). Reject or flag candidates that fail.
Implement the three-layer metacognitive loop. After each task cycle, run reflection:
{"agent": "research", "effectiveness": 0.57}.efficiency = (accuracy * success_rate) / normalized_time. Compare against target thresholds.Enable parallel dispatch for throughput. For independent subtasks (e.g., predicting properties for a batch of 1,000 molecules), dispatch them in parallel across agent instances. Use async execution or a task queue:
async def screen_batch(items: list[str]) -> list[Result]:
tasks = [predict_with_consensus(item) for item in items]
return await asyncio.gather(*tasks)
Wire up the generative design loop (if applicable). For design tasks, chain: user requirements -> candidate generation (LLM-based) -> property prediction (ML agents) -> safety screening -> scoring (novelty, feasibility, creativity) -> ranked output. Implement as a closed loop where top candidates can be fed back for refinement.
Add structured reporting. The Reporting Agent produces a summary with: predictions table, uncertainty intervals, agent agreement metrics, flagged issues, and metacognitive scores. Output as markdown, JSON, or a visualization.
Test with a reference case. Validate the pipeline end-to-end on a known input (e.g., polystyrene, SMILES: CC(c1ccccc1)) where ground truth is available. Verify that consensus predictions fall within experimental ranges and that the metacognitive loop correctly identifies agent performance issues.
Example 1: High-throughput polymer property screening
User: "I have a CSV of 500 polymer SMILES strings. Predict glass transition temperature and density for each, with uncertainty estimates."
Approach:
smiles, tg_predicted, tg_uncertainty, density_predicted, density_uncertainty, flagged.Output:
smiles,tg_predicted_K,tg_uncertainty_K,density_predicted_gcc,density_uncertainty_gcc,flagged
CC(c1ccccc1),378.2,12.7,1.021,0.027,false
C(=O)(O)CC(=O)O,285.4,8.3,1.312,0.015,false
...
Example 2: Metacognitive agent self-improvement
User: "My multi-agent pipeline has a research retrieval agent that keeps returning irrelevant papers. How do I add self-assessment?"
Approach:
{"cycle": 12, "agent": "research", "effectiveness": 0.42, "task": "retrieve Tg data for polyamides"}.Output (metacognitive dashboard):
{
"cycle": 15,
"agent_scores": {
"research": {"effectiveness": 0.42, "trend": "declining", "action": "switching to semantic_scholar API + adding reranker"},
"ml_model": {"effectiveness": 0.87, "trend": "stable", "action": "none"},
"safety": {"effectiveness": 0.91, "trend": "stable", "action": "none"}
},
"pipeline_efficiency": 0.73,
"population_average": 0.68
}
Example 3: Generative polymer design with multi-objective constraints
User: "Design biodegradable polymers with Tg between 70-90C and low production cost."
Approach:
Output:
Rank | SMILES | Tg (C) | Density | Novelty | Feasibility | Score
1 | OC(=O)C(O)C(=O)O... | 78.3 ± 4.1 | 1.24±0.02 | 0.85 | 0.92 | 0.88
2 | CC(O)C(=O)OC(C)... | 82.1 ± 5.7 | 1.18±0.03 | 0.79 | 0.88 | 0.84
...
Roy, M., Bazgir, A., Santos, A. d. S. S., & Zhang, Y. (2026). Autonomous Multi-Agent AI for High-Throughput Polymer Informatics: From Property Prediction to Generative Design Across Synthetic and Bio-Polymers. arXiv:2602.00103v1. https://arxiv.org/abs/2602.00103v1
Key sections to study: the eight-agent roster and Planner Agent decomposition pattern (Section 2), the consensus uncertainty mechanism (Section 3, Table 6), the metacognitive self-assessment framework with tactical/strategic/meta-strategic layers (Section 4, Tables 8-9), and the polystyrene end-to-end case study demonstrating all components working together (Section 5).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".