skills/commcp-multi-agent-coordination-llm-based/SKILL.md
Build decentralized multi-agent coordination systems using LLM-based communication calibrated with conformal prediction. Agents share only statistically reliable messages, reducing noise and redundancy. Use when: 'coordinate multiple agents with LLM messaging', 'build a multi-agent system with calibrated communication', 'reduce noisy messages between cooperating agents', 'implement conformal prediction for agent communication filtering', 'design decentralized agent coordination without a central controller', 'filter unreliable LLM-generated messages in multi-agent pipelines'.
npx skillsauth add ndpvt-web/arxiv-claude-skills commcp-multi-agent-coordination-llm-basedInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to build multi-agent coordination systems where agents communicate through LLM-generated messages that are statistically calibrated using conformal prediction. Instead of agents flooding each other with every observation, each message passes through a confidence filter derived from split conformal prediction, guaranteeing that only messages exceeding a calibrated relevance threshold are transmitted. This eliminates the core problem in multi-agent LLM systems: receiver distraction from irrelevant or misleading inter-agent messages.
The Core Problem: In multi-agent systems using LLMs for communication, agents generate messages about their observations (e.g., "I found object X, it may be relevant to your task"). Raw LLM outputs are uncalibrated — a model saying "80% confident" does not mean it is correct 80% of the time. Sending all messages creates noise; naive thresholding misses important information unpredictably.
Conformal Prediction as the Solution: CommCP applies split conformal prediction to calibrate LLM-generated relevance assessments. During a calibration phase, the system collects LLM confidence scores on labeled examples (where ground truth relevance is known). It computes the quantile threshold q̂ at level (1 - ε) from these calibration scores, where ε controls the acceptable miscoverage rate (e.g., ε = 0.05 for 95% coverage). At runtime, an agent only transmits a message if the LLM's confidence score for that message exceeds q̂. This provides a distribution-free statistical guarantee: the probability that a truly relevant observation is filtered out is at most ε.
Decentralized Architecture: Each agent independently runs four modules: (1) a perception module that detects objects/entities from observations, (2) a communication module that uses an LLM to assess relevance of observations to other agents' tasks and applies conformal filtering before sending, (3) a planning module that computes action priorities from its own observations plus received messages, and (4) a confidence check module that determines when accumulated evidence is sufficient to commit to an answer or action. No central coordinator is needed — agents exchange peer-to-peer messages, and each receiver trusts incoming messages because they have passed the sender's calibrated filter.
Define the agent roles and their tasks. Enumerate each agent's capabilities and assigned objectives. Each agent should have a clear task description string (e.g., "Find and identify the red mug on the kitchen counter") and a set of possible observations it can make.
Design the message schema. Define a structured message format that agents exchange. At minimum, include: sender_id, receiver_id, observation (what was seen), relevance_category (classification of how it relates to the receiver's task), confidence_score (LLM's probability for the relevance classification), and metadata (position, timestamp, etc.).
@dataclass
class AgentMessage:
sender_id: str
receiver_id: str
observation: str
relevance_category: str # e.g., "same_target", "related", "irrelevant"
confidence_score: float # LLM probability for the chosen category
metadata: dict # position, timestamp, context
Implement the LLM relevance assessor. For each observation an agent makes, prompt the LLM to classify its relevance to every other agent's task. Use chain-of-thought reasoning and request logprob outputs for the classification options. The key is extracting calibrated probability scores, not just the top-1 label.
def assess_relevance(observation: str, other_agent_task: str, llm) -> dict:
prompt = f"""Given this observation: "{observation}"
And this agent's task: "{other_agent_task}"
Classify the relevance:
A) Same target object — directly fulfills the task
B) Highly related — provides useful information for the task
C) Unrelated — no connection to the task
D) Common/generic — too generic to be useful
Think step by step, then choose A, B, C, or D."""
response = llm.generate(prompt, return_logprobs=True)
probs = extract_option_probabilities(response.logprobs, options=["A","B","C","D"])
return {"category": max(probs, key=probs.get), "scores": probs}
Build the conformal calibration set. Collect 50-200 labeled examples where you know the ground-truth relevance of observations to tasks. Run each through the LLM relevance assessor. Store the confidence scores for positive-class predictions (categories A and B separately).
def calibrate(calibration_data: list, llm, epsilon: float = 0.05) -> dict:
scores_a, scores_b = [], []
for obs, task, true_label in calibration_data:
result = assess_relevance(obs, task, llm)
if true_label == "A":
scores_a.append(result["scores"]["A"])
elif true_label == "B":
scores_b.append(result["scores"]["B"])
# Compute quantile thresholds
q_a = np.quantile(scores_a, epsilon) # Lower bound for "same target"
q_b = np.quantile(scores_b, epsilon) # Lower bound for "related"
return {"threshold_A": q_a, "threshold_B": q_b}
Implement the conformal message filter. Before any message is sent, check whether its confidence score exceeds the calibrated threshold for its category. Only transmit if it passes. This is the core mechanism that provides the statistical guarantee.
def should_send_message(relevance_result: dict, thresholds: dict) -> bool:
cat = relevance_result["category"]
score = relevance_result["scores"][cat]
if cat == "A" and score >= thresholds["threshold_A"]:
return True
if cat == "B" and score >= thresholds["threshold_B"]:
return True
return False # Categories C, D are never sent
Build the agent planning module. Each agent maintains a priority map over possible actions. Own observations contribute directly; received (calibrated) messages from other agents contribute with a weighting factor. Use semantic value scoring to rank next actions.
def compute_action_priority(own_observations, received_messages, task):
value_map = {}
for obs in own_observations:
value_map[obs.target] = score_relevance(obs, task) * TAU_1
for msg in received_messages:
# Received messages already passed conformal filter, so trust them
value_map[msg.metadata["position"]] = (
score_relevance(msg, task) * TAU_2 # TAU_2 > TAU_1 rewards communication
)
return max(value_map, key=value_map.get)
Implement the confidence check for task completion. An agent commits to an answer or action only when P(answer) * P(relevant_view) > 1 - ε₂, where ε₂ is a task-specific confidence threshold. This prevents premature commitment on insufficient evidence.
Wire up the decentralized communication loop. Each agent runs an independent event loop: observe → assess relevance to peers → filter with conformal threshold → send surviving messages → receive messages from peers → update action priorities → act → check completion.
Run online calibration updates (optional). As agents accumulate new labeled interactions during deployment, periodically recompute the conformal thresholds on the expanded calibration set to adapt to distribution shifts.
Monitor and log communication efficiency. Track the message send rate, filter rate, and downstream task success. The ratio of messages sent to messages generated quantifies how much noise the conformal filter eliminates.
Example 1: Multi-Agent Document Research System
User: "Build a system where 3 agents each search different document databases, and share relevant findings with each other. Agent 1 searches legal docs, Agent 2 searches financial filings, Agent 3 searches news articles. They're all working on the same investigation query."
Approach:
Output:
Agent 2 → Agent 1: "Filing SEC-2024-4821 references entity 'Meridian Holdings'
matching your legal search. Confidence: 0.91 (threshold: 0.82).
Relevance: same_target. Source: financial_db/sec_filings/"
Communication stats:
Messages generated: 47 | Messages sent: 12 | Filter rate: 74.5%
Task completion: 3/3 agents converged | False leads avoided: ~35
Example 2: Parallel Code Review Agents
User: "I want multiple agents reviewing different parts of a large PR — one checks security, one checks performance, one checks correctness. They should share findings only when relevant to another reviewer's domain."
Approach:
Output:
# Calibrated thresholds (from 80 historical examples, epsilon=0.05):
thresholds = {
"security→performance": {"A": 0.89, "B": 0.78},
"security→correctness": {"A": 0.85, "B": 0.75},
"performance→security": {"A": 0.92, "B": 0.84},
# ... other pairs
}
# Runtime message log:
# [SENT] security → correctness: "Unparameterized SQL in db/queries.py:142,
# potential injection AND incorrect escaping" (score=0.88, thresh=0.75)
# [FILTERED] performance → security: "Slow loop in api/handler.py:89"
# (score=0.31, thresh=0.84)
# Review completed: 6 cross-domain findings shared out of 23 generated (73.9% filtered)
Example 3: Multi-Agent Data Pipeline Monitoring
User: "Set up agents that monitor different stages of our data pipeline (ingestion, transformation, serving). When one detects an anomaly, it should alert the others only if it's genuinely relevant to their stage."
Approach:
Output:
Alert routing decision:
Anomaly: schema_drift in ingestion.users_table
→ transformation agent: SEND (score=0.94 ≥ threshold_A=0.80) ✓
→ serving agent: SUPPRESS (score=0.61 < threshold_B=0.72) ✗
Reason: Schema drift directly breaks transformation mappings but
serving uses cached transformed data unaffected until next refresh.
Do:
ε conservatively (0.05-0.10) to start. A lower ε means fewer relevant messages are accidentally filtered, at the cost of more irrelevant messages getting through.Avoid:
ε incrementally (e.g., from 0.05 to 0.15) or expand the calibration set with more representative examples.Paper: CommCP: Efficient Multi-Agent Coordination via LLM-Based Communication with Conformal Prediction — Zhang et al., ICRA 2026. Focus on Section III (framework architecture), Section III-C (conformal prediction calibration), and Section IV (MM-EQA benchmark results). Key insight: conformal prediction provides distribution-free coverage guarantees for LLM-generated inter-agent messages, reducing communication volume by ~74% while improving task success rates.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".