skills/gflowpo-generative-flow-network/SKILL.md
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
npx skillsauth add ndpvt-web/arxiv-claude-skills gflowpo-generative-flow-networkInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to systematically optimize prompts for language model tasks using the GFlowPO methodology -- a probabilistic framework that treats prompt search as posterior inference over a latent prompt space. Instead of greedy hill-climbing or random mutation, GFlowPO maintains a diverse pool of candidate prompts, scores them against real task examples, and uses a dynamic memory mechanism to progressively focus search on high-reward regions while preserving exploration diversity. Claude applies this as a structured iterative workflow: generate diverse candidates, evaluate on held-out examples, update a meta-prompt with both top-performing and diverse reference prompts, then repeat.
Prompt optimization as posterior inference. GFlowPO reframes the prompt search problem: instead of maximizing a single reward signal (which leads to mode collapse on a narrow prompt style), it samples prompts proportionally to a reward-weighted posterior. The target distribution is R(z) * p_ref(z|M), where R(z) is the task accuracy reward and p_ref(z|M) is a prior from a reference LM conditioned on a meta-prompt M. This means the optimizer naturally balances exploitation (high accuracy) with staying close to coherent, well-formed prompts.
Dynamic Memory Update (DMU). The core practical insight is the DMU mechanism: at each iteration, the meta-prompt is updated by injecting (1) diverse prompts sampled uniformly from a replay buffer of all previously seen candidates, and (2) the top-performing prompts from a priority queue. This is training-free -- no gradient updates, just swapping reference examples in the meta-prompt. The diversity injection prevents the search from collapsing to a single prompt family, while the top-performer injection steers generation toward proven strategies.
Off-policy replay for sample efficiency. Unlike on-policy RL prompt optimizers that discard past evaluations, GFlowPO stores all (prompt, score) pairs in a replay buffer and reuses them. The training policy mixes 50% fresh generations with 50% replay samples. This means every expensive LLM evaluation is used multiple times, making the approach 2-4x more sample-efficient than on-policy baselines like StablePrompt.
Define the task and collect evaluation examples. Gather 10-50 input/output pairs that represent the target task. Split into a scoring set (used to evaluate candidate prompts) and a held-out test set (used only for final validation). Identify the evaluation metric (exact match, F1, contains-answer, etc.).
Construct the initial meta-prompt. Build a template with three sections: (a) a task-agnostic instruction like "Generate a clear, concise instruction for the following task", (b) 3-5 randomly sampled input/output demonstration pairs from the scoring set, and (c) 2-3 initial reference prompts (the user's current best prompt plus 1-2 simple variations).
Generate a diverse batch of candidate prompts. Using the meta-prompt, generate 8-16 candidate prompts. Apply temperature 0.7-1.0 to encourage diversity. Each candidate should be a complete instruction that could replace the original prompt. Vary phrasing, structure, level of detail, and tone.
Score each candidate on the evaluation set. For every candidate prompt, run it against all examples in the scoring set. Compute accuracy as R(z) = epsilon + sum of correct predictions, where epsilon is a small constant (0.01) to avoid zero rewards. Record (prompt, score) pairs.
Update the replay buffer and priority queue. Add all (prompt, score) pairs to the replay buffer. Update the priority queue to hold the top-5 highest-scoring prompts seen across all iterations.
Perform Dynamic Memory Update on the meta-prompt. Sample 2 prompts uniformly at random from the replay buffer (diversity injection) and 1 prompt from the top of the priority queue (exploitation injection). Replace the reference prompts in the meta-prompt with these 3 prompts and their scores.
Generate the next batch of candidates using the updated meta-prompt. Now the meta-prompt contains both high-performing examples and diverse alternatives, steering generation toward promising yet varied regions of prompt space.
Repeat steps 4-7 for 3-6 iterations. Each iteration refines the meta-prompt's reference prompts, progressively improving candidate quality. Monitor whether the top score plateaus -- if the best score hasn't improved in 2 consecutive iterations, stop.
Select the final prompt from the priority queue. Take the top-3 prompts from the priority queue and evaluate each on the held-out test set. Report the winner with its test accuracy and the runner-up alternatives.
Document the optimization trace. Output a summary showing: initial prompt score, final prompt score, number of iterations, total candidates evaluated, and the top-3 prompts with their scores for reproducibility.
Example 1: Optimizing a sentiment classification prompt
User: "I have this prompt for sentiment analysis: 'Classify the following review as positive or negative.' It gets 72% accuracy on my test set. Can you optimize it?"
Approach:
Output:
Optimization complete (4 iterations, 48 candidates evaluated)
Best prompt (88% test accuracy):
"Read the customer review carefully. Based on the overall sentiment
expressed by the reviewer, classify it as exactly 'positive' or
'negative'. Output only the classification label, no explanation."
Runner-up (86% test accuracy):
"Determine whether the following product review is positive or negative.
Consider the reviewer's overall satisfaction. Respond with one word:
positive or negative."
Improvement: 72% -> 88% (+16 percentage points)
Example 2: Finding an instruction for a code explanation task
User: "I need a prompt that makes GPT-4 explain code snippets to beginners. I have 15 example pairs of (code, good_explanation). Help me find the best instruction."
Approach:
Output:
Optimization complete (3 iterations, 30 candidates evaluated)
Best prompt (ROUGE-L: 0.71):
"You are a patient programming tutor. Explain the following code snippet
line by line in plain English. Assume the reader has no programming
experience. Use analogies where helpful. End with a one-sentence summary
of what the code does overall."
Runner-up (ROUGE-L: 0.68):
"Break down this code for someone who has never programmed before.
Explain what each line does using everyday language, then summarize
the overall purpose."
Original prompt ROUGE-L: 0.54 -> Best: 0.71 (+31% relative improvement)
Example 3: Optimizing a multi-step reasoning prompt
User: "My chain-of-thought prompt for math word problems only gets 60% on GSM8K-style questions. I have 25 example problems with solutions. Optimize the prompt."
Approach:
Output:
Optimization complete (5 iterations, 40 candidates evaluated)
Best prompt (77% test accuracy):
"Solve this math problem step by step. First, identify all given
quantities and the question being asked. Then set up the equations
needed. Solve each step showing your arithmetic. Finally, verify
your answer by checking it satisfies the original conditions.
Write your final numerical answer after 'Answer:'."
Improvement: 60% -> 77% (+17 percentage points)
Top-3 prompts preserved for A/B testing.
Paper: GFlowPO: Generative Flow Network as a Language Model Prompt Optimizer (Cho et al., 2026) Key insight: Casting prompt search as posterior inference with R(z) * p_ref(z|M) as the target distribution, combined with a training-free Dynamic Memory Update that injects both diverse and top-performing prompts into the meta-prompt, achieves state-of-the-art prompt optimization with 2-4x better sample efficiency than on-policy RL baselines.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".
development
Build granular error taxonomies from incorrect reasoning traces, then use those rubrics to detect errors in LLM outputs across technical domains. Use when asked to: 'build a rubric for evaluating code solutions', 'create an error taxonomy for math reasoning', 'grade reasoning traces for correctness', 'build a reward function for domain-specific tasks', 'classify errors in chain-of-thought outputs', 'evaluate LLM reasoning without gold labels'.