skills/constrained-process-maps-multi-agent/SKILL.md
Build multi-agent workflows structured as constrained DAG process maps with Monte Carlo uncertainty estimation. Each agent occupies a specialized review role, with predefined escalation paths and terminal states (automated label or human review). Use when: 'build a multi-agent compliance pipeline', 'add uncertainty-aware escalation to my agent workflow', 'create a review chain with human fallback', 'design a DAG-based agent workflow with confidence thresholds', 'implement Monte Carlo sampling for agent decisions', 'build a multi-stage content moderation system'.
npx skillsauth add ndpvt-web/arxiv-claude-skills constrained-process-maps-multi-agentInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design and implement multi-agent systems structured as directed acyclic graphs (DAGs), where each node is a specialized LLM agent with a defined role (e.g., content reviewer, risk assessor, legal checker). Agents quantify their own epistemic uncertainty through Monte Carlo sampling -- running the same classification multiple times and measuring agreement -- then use predefined escalation edges to route uncertain cases to downstream specialists or human review. The framework is formalized as a finite-horizon MDP, guaranteeing termination and enabling principled accuracy-vs-cost trade-offs. Based on Joshi & Rudow (2026), this approach achieved up to 19% accuracy gains and 85x reduction in human review burden over single-agent baselines.
DAG-Structured Process Maps. The workflow is a directed acyclic graph G = (S, E) where nodes are specialized agents and edges are allowable escalation paths. Terminal (absorbing) states are the final outcomes: a concrete label (e.g., "safe", "unsafe") or "human review". Because the graph is acyclic, every trajectory is guaranteed to terminate within the DAG's diameter -- no infinite loops, no runaway agent chains. Each agent has a narrowly-scoped system prompt (a Standard Operating Procedure) defining its review focus.
Monte Carlo Uncertainty Estimation. Instead of asking an LLM "how confident are you?" (which is unreliable), each agent generates n independent classification samples for the same input. The distribution of outputs (e.g., 8 "safe", 1 "unsafe", 1 "uncertain" out of n=10) empirically quantifies epistemic uncertainty without accessing model internals. A policy function (e.g., majority vote with a confidence threshold) maps this sample vector to a decision: finalize the label, or escalate to the next agent in the DAG.
Two-Level Uncertainty Model. Agent-level uncertainty is the Monte Carlo sample distribution. System-level uncertainty is captured by which terminal state the MDP reaches -- if the pipeline cannot resolve the case with sufficient confidence, it terminates in the human-review absorbing state. This separation lets you tune each agent's threshold independently and optimize the overall system for your target accuracy-vs-human-review trade-off.
Define the terminal states. Enumerate the final outcomes your pipeline must produce: concrete labels (e.g., safe, unsafe, approved, rejected) plus human_review as the catch-all escalation sink.
Identify agent roles and draw the DAG. Map each review concern to a dedicated agent node. Define directed edges representing escalation paths. A typical three-tier structure: Worker -> Triage -> {Risk Agent, Legal Agent} -> terminal states. Verify the graph is acyclic and every path reaches a terminal state.
Write scoped system prompts (SOPs) for each agent. Each agent gets a narrow mandate. The Worker handles first-pass classification, the Triage agent routes based on uncertainty type, and specialist agents (Risk, Legal, etc.) apply domain-specific criteria. Prompts must instruct the agent to output one of the allowed labels (e.g., safe, unsafe, uncertain) followed by reasoning.
Implement Monte Carlo sampling per agent. For each input at each agent node, call the LLM n times independently to produce a sample vector [a^(1), ..., a^(n)]. Start with n=1 for speed, scale to n=3-5 for confidence, or n=25+ for fine-grained threshold tuning on critical subsets.
Define the policy function for each agent. Map the sample vector to a decision. The simplest policy is majority vote: if >50% of samples agree on a label, finalize it; otherwise escalate. For tighter control, set a confidence threshold (e.g., require 80% agreement to finalize). The threshold controls the accuracy-vs-escalation trade-off.
Implement the transition logic. Code the DAG traversal: start at the Worker, apply the agent's policy, follow the appropriate edge to the next node or terminal state. Pass the input and any accumulated context along escalation edges.
Wire in the human review terminal state. When the pipeline exhausts all agent nodes without reaching a confident label, route to human review. Log the full trajectory (which agents saw the case, their sample distributions, their reasoning) so the human reviewer has context.
Run calibration experiments. Test the pipeline on a labeled dataset across different values of n and different confidence thresholds. Measure accuracy, false positive/negative rates, human review volume, and latency. Use 5+ independent trials to estimate confidence intervals.
Tune thresholds per agent. Agents at different positions in the DAG may need different confidence levels. A first-pass Worker can be aggressive (low threshold, high throughput), while a Legal agent near the end of the pipeline should be conservative (high threshold, low risk of error).
Deploy with monitoring. Track per-agent uncertainty distributions, escalation rates, and terminal-state frequencies in production. Drift in these metrics signals model degradation or distribution shift in inputs.
Example 1: Content Safety Moderation Pipeline
User: "Build a multi-agent content moderation system that classifies user posts as safe or unsafe, with human review fallback for uncertain cases."
Approach:
safe, unsafe, human_reviewContentReviewer -> Triage -> {RiskAssessor, PolicyChecker} -> terminal statesimport asyncio
import json
from collections import Counter
# --- DAG Definition ---
PROCESS_MAP = {
"content_reviewer": {
"escalates_to": "triage",
"can_finalize": ["safe", "unsafe"],
},
"triage": {
"escalates_to": ["risk_assessor", "policy_checker"],
"can_finalize": [], # triage only routes, never finalizes
},
"risk_assessor": {
"escalates_to": "human_review",
"can_finalize": ["safe", "unsafe"],
},
"policy_checker": {
"escalates_to": "human_review",
"can_finalize": ["safe", "unsafe"],
},
}
AGENT_PROMPTS = {
"content_reviewer": (
"You are a content safety reviewer. Classify the following user post "
"with a single word: 'safe', 'unsafe', or 'uncertain'. "
"Then provide a comma and your reasoning."
),
"risk_assessor": (
"You are a risk assessment specialist. Evaluate whether this content "
"poses risks of harm, misinformation, or bias. Classify as 'safe', "
"'unsafe', or 'uncertain', followed by a comma and reasoning."
),
"policy_checker": (
"You are a platform policy compliance checker. Evaluate whether this "
"content violates terms of service or community guidelines. Classify "
"as 'safe', 'unsafe', or 'uncertain', followed by a comma and reasoning."
),
}
async def monte_carlo_sample(llm_client, prompt: str, input_text: str, n: int) -> list[str]:
"""Generate n independent classification samples."""
tasks = [
llm_client.complete(f"{prompt}\n\nContent: {input_text}")
for _ in range(n)
]
responses = await asyncio.gather(*tasks)
labels = []
for r in responses:
first_word = r.strip().split(",")[0].strip().lower()
if first_word in ("safe", "unsafe", "uncertain"):
labels.append(first_word)
else:
labels.append("uncertain")
return labels
def apply_policy(samples: list[str], threshold: float = 0.67) -> str:
"""Majority vote with confidence threshold. Returns label or 'escalate'."""
counts = Counter(samples)
total = len(samples)
for label in ["safe", "unsafe"]:
if counts.get(label, 0) / total >= threshold:
return label
return "escalate"
async def run_pipeline(llm_client, input_text: str, n: int = 3) -> dict:
"""Traverse the DAG from content_reviewer to a terminal state."""
trajectory = []
current_agent = "content_reviewer"
while current_agent not in ("safe", "unsafe", "human_review"):
config = PROCESS_MAP[current_agent]
prompt = AGENT_PROMPTS.get(current_agent)
if prompt is None:
# Triage node: route to first specialist
current_agent = config["escalates_to"][0]
continue
samples = await monte_carlo_sample(llm_client, prompt, input_text, n)
decision = apply_policy(samples, threshold=0.67)
trajectory.append({
"agent": current_agent,
"samples": samples,
"decision": decision,
})
if decision == "escalate":
next_node = config["escalates_to"]
current_agent = next_node if isinstance(next_node, str) else next_node[0]
else:
current_agent = decision # terminal state
return {"label": current_agent, "trajectory": trajectory}
Output: Each input gets a label (safe, unsafe, or human_review) plus a full trajectory showing which agents processed it and their sample distributions.
Example 2: Document Compliance Review
User: "I need a three-stage review pipeline for loan applications: initial screening, risk assessment, and regulatory compliance. Uncertain cases go to a human underwriter."
Approach:
approved, rejected, human_underwriterScreener -> RiskAnalyst -> RegulatoryReviewer -> human_underwriterLOAN_DAG = {
"screener": {
"sop": "Check basic eligibility: income, credit score, employment status.",
"escalates_to": "risk_analyst",
"labels": ["approved", "rejected", "uncertain"],
"threshold": 0.8, # high confidence to auto-approve at first pass
},
"risk_analyst": {
"sop": "Assess default risk, debt-to-income ratio, collateral value.",
"escalates_to": "regulatory_reviewer",
"labels": ["approved", "rejected", "uncertain"],
"threshold": 0.6,
},
"regulatory_reviewer": {
"sop": "Verify compliance with lending regulations, fair lending laws, TILA.",
"escalates_to": "human_underwriter",
"labels": ["approved", "rejected", "uncertain"],
"threshold": 0.6,
},
}
Each agent runs 5 independent LLM calls. The Screener uses a high threshold (0.8) because auto-approving at the first stage carries the most risk. Later agents use lower thresholds because cases reaching them have already been flagged as marginal.
Example 3: Tuning n and Thresholds
User: "My pipeline sends too many cases to human review. How do I reduce that?"
Approach:
n=1: accuracy=88.0%, human_review=0.2% (fast, aggressive)
n=3: accuracy=84.8%, human_review=2.3% (balanced)
n=5: accuracy=84.3%, human_review=3.8% (conservative)
n=25: accuracy=86.1%, human_review=1.1% (fine-grained thresholds)
Key insight: n=1 with a simple pass/escalate policy often outperforms higher n values on both accuracy and human review volume, because it avoids the "regression to uncertainty" that occurs when Monte Carlo samples introduce noise at moderate n. Use higher n only when you need fine-grained threshold control for specific agents.
Joshi, A. & Rudow, M. (2026). Constrained Process Maps for Multi-Agent Generative AI Workflows. arXiv:2602.02034v1. Key sections: Section 3 (MDP formalization and DAG structure), Section 4 (Monte Carlo uncertainty estimation), Section 5 (compliance case study with self-harm detection results).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".