skills/bayesflow-probability-inference-framework/SKILL.md
Generate high-quality multi-step LLM workflows using Bayesian inference with parallel look-ahead rollouts and importance-weighted resampling. Use when: 'build a workflow for this task', 'generate an agent pipeline', 'create a multi-step LLM chain', 'optimize my prompt chain', 'design an agentic workflow', 'Bayesian workflow generation'.
npx skillsauth add ndpvt-web/arxiv-claude-skills bayesflow-probability-inference-frameworkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to design and construct multi-step LLM workflows (chains of LLM calls, tool invocations, and post-processing logic) using the Bayesian Workflow Generation (BWG) framework from the BayesFlow paper. Instead of hand-crafting a single pipeline or brute-force searching over workflow configurations, BWG treats workflow generation as Bayesian inference: the LLM's internal knowledge serves as a prior over plausible workflows, task-specific reward signals form the likelihood, and the posterior concentrates probability on workflows that actually solve the task. The framework builds workflows step-by-step using parallel look-ahead rollouts for importance weighting and a sequential refiner for pool-wide quality improvements.
Core insight: Most workflow generation methods treat pipeline design as an optimization problem -- they search or evolve workflows heuristically. BWG recasts this as Bayesian inference over a posterior distribution on workflows: q(s_{1:T} | s_0) proportional to p(s_{1:T} | s_0) * exp[R(s_{1:T})], where p is the prior (the LLM's natural distribution over workflow steps given a task description s_0) and R is a reward function measuring task performance. The posterior concentrates mass on workflows that are both plausible (under the LLM's prior knowledge) and effective (high reward).
How it works in practice: BWG maintains a pool of N partial workflows. At each step t, it (1) extends each partial workflow by sampling a next step from the LLM, (2) scores each extended prefix by running K parallel look-ahead rollouts to completion and averaging the rewards to compute importance weights w_i = (1/K) * sum(exp[R(completion_k)]), and (3) resamples the pool according to normalized weights so promising prefixes are duplicated and poor ones are dropped. This step-level resampling avoids the "weight degeneracy" problem of scoring only complete trajectories. An optional in-loop refiner selects top candidates, applies targeted edits (a single consequential change), and inserts refined workflows back into the pool for diversity.
Why it matters: The framework has convergence guarantees -- without the refiner, the weighted empirical distribution provably converges to the target posterior as N grows. In practice, BayesFlow improved accuracy by up to 9 percentage points over SOTA baselines (AFlow, ADAS) across math, QA, knowledge, and coding benchmarks, while using 5-80% fewer tokens. The key practical takeaway: build workflows incrementally, score partial progress via look-ahead, and prune aggressively.
When a user asks you to design a multi-step LLM workflow using BWG principles:
Define the task and reward signal. Clarify the end-to-end task (e.g., "answer multi-hop questions", "generate and test code", "research and summarize"). Identify a concrete reward function: exact-match accuracy, F1 score, test pass rate, or user-defined quality criteria. The reward must be computable on validation examples.
Establish the helper interface. Define the two primitives available to the workflow: (a) chat_completion(system, instructions, history) -> response for LLM calls, and (b) exec_code(code, test_input) -> result or equivalent tool invocation. Keep the interface minimal -- BayesFlow deliberately avoids predefined agent modules (no hardcoded "Ensemble" or "Revise" nodes) and instead lets the workflow discover its own structure.
Generate an initial pool of N candidate workflow prefixes. Prompt the LLM to propose 5-15 distinct first steps for the task. Each step should be a self-contained code chunk with explicit # Step 1: annotation. Encourage diversity by varying the system prompt slightly (e.g., "propose a direct approach", "propose an approach with self-verification", "propose a chain-of-thought approach").
Extend each prefix by one step. For each partial workflow, prompt the LLM: "Given the workflow so far [steps 1..t-1], propose Step t." The LLM generates the next meaningful code chunk (an LLM call, a tool invocation, a parsing/validation step, or a conditional branch).
Score via parallel look-ahead rollouts. For each extended prefix, generate K=3-5 stochastic completions (the remaining steps through the end of the workflow) by prompting the LLM to finish the workflow. Run each completed workflow against validation examples and compute the reward. The importance weight for prefix i is: w_i = (1/K) * sum(exp[R(completion_k)]).
Resample the pool. Normalize the importance weights across all N prefixes. Resample N prefixes with replacement according to these weights. High-scoring prefixes get duplicated; low-scoring ones are dropped. This is the core Bayesian update -- it concentrates the pool on promising partial workflows.
Apply the in-loop refiner (optional but recommended). Select the top 3 candidates by validation reward. Sample one proportionally (softmax with temperature 0.1). Prompt the LLM: "Make exactly one consequential edit to this workflow to improve it. Preserve the overall structure and code format." Insert the edited workflow back into the pool. This adds diversity and enables global revisions that step-level extension cannot achieve.
Repeat steps 4-7 for T rounds (typically T=4-8, matching the natural number of steps in the workflow). At each round, collect any workflows that are already complete (all steps generated and functional).
Select the best workflow. From all completed workflows across all rounds, select the one with the highest validation reward. Optionally, keep the top-3 as an ensemble for even higher accuracy.
Output the final workflow as executable code. Present the workflow as a clean Python function (or structured pseudocode) with annotated steps, clear LLM call parameters, and error handling. Include the validation score and a brief rationale for why this structure was selected.
Example 1: Building a multi-hop QA workflow
User: "Design an LLM workflow that answers multi-hop questions like HotpotQA -- questions requiring information from multiple sources."
Approach:
Output (best workflow discovered):
def multi_hop_qa(question: str) -> str:
# Step 1: Decompose the question into sub-questions
decomposition = chat_completion(
system="You are a question decomposition expert.",
instructions=f"Break this multi-hop question into 2-3 independent sub-questions:\n{question}",
)
# Step 2: Answer each sub-question independently
sub_answers = []
for sub_q in parse_sub_questions(decomposition):
ans = chat_completion(
system="Answer the following factual question concisely.",
instructions=sub_q,
)
sub_answers.append((sub_q, ans))
# Step 3: Synthesize sub-answers into a final answer
context = "\n".join(f"Q: {q}\nA: {a}" for q, a in sub_answers)
final = chat_completion(
system="You synthesize information from multiple sources.",
instructions=f"Original question: {question}\n\nSub-answers:\n{context}\n\nProvide the final answer.",
)
# Step 4: Self-verify and correct
verification = chat_completion(
system="You are a critical reviewer.",
instructions=f"Question: {question}\nProposed answer: {final}\n\nIs this answer correct and complete? If not, provide a corrected answer.",
)
return extract_answer(verification)
Validation F1: 0.78 (vs 0.64 for single-shot CoT baseline).
Example 2: Code generation with test-driven self-correction
User: "Create a workflow for generating Python functions that pass unit tests, like the MBPP benchmark."
Approach:
Output (best workflow discovered):
def code_gen_workflow(task_description: str, test_cases: list[str]) -> str:
# Step 1: Generate initial solution
solution = chat_completion(
system="You are an expert Python programmer.",
instructions=f"Write a Python function for:\n{task_description}",
)
# Step 2: Early exit if tests pass
result = exec_code(solution, test_cases)
if result.all_passed:
return solution
# Step 3: Analyze errors and generate targeted fix
fix = chat_completion(
system="You are a debugging expert. Do not use try-except blocks. Fix the root cause.",
instructions=f"Code:\n{solution}\n\nErrors:\n{result.error_log}\n\nFix the code.",
)
# Step 4: Re-test the fix
result2 = exec_code(fix, test_cases)
if result2.all_passed:
return fix
# Step 5: Rewrite from scratch with error context
rewrite = chat_completion(
system="You are an expert Python programmer. Previous attempts failed.",
instructions=f"Task: {task_description}\n\nPrevious errors: {result2.error_log}\n\nWrite a completely new solution.",
)
# Step 6: Final fallback test
result3 = exec_code(rewrite, test_cases)
return rewrite if result3.all_passed else fix
Validation pass@1: 85.0% (vs 75.5% for AFlow baseline).
Example 3: Applying BWG principles to optimize an existing workflow
User: "I have a summarization pipeline that calls the LLM once. Help me improve it using BayesFlow principles."
Approach:
Output:
# Original (baseline): Single LLM call
# BWG-discovered workflow (3 steps, ROUGE-L improved from 0.34 to 0.41):
def summarize(document: str) -> str:
# Step 1: Extract key claims and supporting evidence
extraction = chat_completion(
system="Extract the 5 most important claims from this document with supporting quotes.",
instructions=document,
)
# Step 2: Draft summary from extracted claims
draft = chat_completion(
system="Write a concise summary covering all the key claims below.",
instructions=extraction,
)
# Step 3: Verify coverage and compress
final = chat_completion(
system="Review this summary against the original. Fix any missing key points. Remove redundancy. Keep it under 150 words.",
instructions=f"Original:\n{document[:2000]}\n\nDraft summary:\n{draft}",
)
return final
chat_completion and tool execution). Let the workflow discover its own structure -- avoid hardcoding agent roles or module types.# Step N: description) so the LLM can reason about workflow structure when extending or refining.Paper: "BayesFlow: A Probability Inference Framework for Meta-Agent Assisted Workflow Generation" (Yuan et al., EACL 2026 Findings). arXiv: 2601.22305. Look for Algorithm 1 (StepUpdate) and Algorithm 2 (BayesFlow main loop) for the core pseudocode, and Tables 1-2 for benchmark comparisons against AFlow, ADAS, and other baselines.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".