skills/c-mop-integrating-momentum-boundary-aware/SKILL.md
Optimize LLM system prompts iteratively using boundary-aware contrastive sampling and momentum-guided clustering from the C-MOP framework. Use when: 'optimize this prompt', 'improve my system prompt', 'evolve prompt for better accuracy', 'automatic prompt tuning', 'prompt optimization with examples', 'refine prompt using test cases'.
npx skillsauth add ndpvt-web/arxiv-claude-skills c-mop-integrating-momentum-boundary-awareInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to systematically optimize LLM prompts using the C-MOP (Cluster-based Momentum Optimized Prompting) framework. Instead of ad-hoc prompt tweaking, C-MOP applies two structured mechanisms: Boundary-Aware Contrastive Sampling (BACS) to identify the most informative success/failure cases, and Momentum-Guided Semantic Clustering (MGSC) to accumulate stable optimization signals across iterations. The result is a disciplined prompt evolution process that avoids the noisy, contradictory edits typical of naive prompt refinement.
The core problem with naive prompt optimization is conflicting signals: fixing failure case A introduces a change that breaks previously-passing case B. Each iteration oscillates rather than converges. C-MOP solves this with two mechanisms.
Boundary-Aware Contrastive Sampling (BACS) categorizes evaluation examples into three groups after each prompt trial: Anchors (cases the prompt consistently gets right -- these define "what's working"), Hard Negatives (cases the prompt consistently gets wrong -- these define "what's broken"), and Boundary Pairs (cases near the decision boundary that sometimes pass, sometimes fail -- these reveal where the prompt is ambiguous). Instead of feeding the optimizer a random mix of successes and failures, BACS selects a structured triplet of anchor + hard negative + boundary case. This gives the optimization signal maximum contrast: "keep doing X (anchor), stop doing Y (hard negative), and clarify Z (boundary)."
Momentum-Guided Semantic Clustering (MGSC) maintains a history of suggested prompt edits ("gradients") across iterations and clusters them semantically using embeddings. Edits that recur across multiple iterations get amplified (they represent persistent issues), while one-off suggestions decay via a temporal weight factor (momentum = alpha * previous_momentum + (1 - alpha) * new_gradient, with alpha typically 0.8-0.9). This filters out noise and surfaces the stable consensus about what the prompt actually needs.
Collect the prompt and evaluation set. Obtain the user's current system prompt and a set of at least 10-20 test cases with inputs and expected outputs. More cases (50+) yield better boundary detection. Parse them into a structured list: [{input, expected_output, category (optional)}].
Run baseline evaluation. Execute the current prompt against all test cases. Record each result as pass/fail with the actual output. Compute baseline accuracy. This is iteration 0.
Classify examples using BACS tripartite sampling. Sort results into three bins:
Generate structured optimization gradient. Present the LLM optimizer with the current prompt plus the BACS triplet in this format:
Current prompt: [prompt text]
ANCHOR (working correctly):
Input: [anchor input] -> Expected: [expected] -> Got: [correct output]
Analysis: The prompt handles this well because [reason].
HARD NEGATIVE (failing):
Input: [hard neg input] -> Expected: [expected] -> Got: [wrong output]
Analysis: The prompt fails here because [diagnosis].
BOUNDARY CASE (ambiguous):
Input: [boundary input] -> Expected: [expected] -> Got: [partial/wrong output]
Analysis: The prompt is unclear about [specific ambiguity].
Task: Generate 2-3 specific, minimal edits to the prompt that fix the hard negative
and boundary case WITHOUT breaking the anchor case pattern.
Apply momentum to candidate edits. Maintain a running list of all suggested edits from previous iterations. Cluster semantically similar suggestions (e.g., "add output format specification" and "clarify expected response structure" are the same theme). Weight each cluster by recurrence count and recency:
weight = count * decay^(current_iteration - last_seen_iteration) where decay = 0.85Generate candidate prompts. Produce 3-4 prompt variants by applying the top-weighted edit themes. Each variant should make minimal, targeted changes. Keep a "beam" of the top candidates (beam_size = 4 is the paper's default).
Evaluate candidates on a fresh minibatch. Score each candidate prompt against a sample of test cases (the paper uses 256 samples per eval round). Rank by accuracy.
Select and iterate. Keep the top-performing candidates (beam search). Return to step 3 with the new best prompt. Repeat for 10-20 rounds or until accuracy plateaus (less than 0.5% improvement over 3 consecutive rounds).
Final validation. Run the best prompt against the full held-out test set. Compare against baseline. Report the improvement and the specific changes made.
Document the evolution trace. Output a summary showing: baseline accuracy, per-round accuracy, the key edit themes that persisted (high momentum), and the final optimized prompt.
Example 1: Optimizing a sentiment classification prompt
User: "I have a prompt for classifying customer reviews as positive/negative/neutral but it's only 72% accurate on my test set. Help me optimize it."
Approach:
Output after 8 rounds:
Baseline accuracy: 72%
Final accuracy: 84%
Key persistent edits (high momentum):
1. Added explicit neutral classification criteria (+6% lift)
2. Added sarcasm detection instruction (+4% lift)
3. Specified tie-breaking rule for mixed sentiment (+2% lift)
Decayed (noise) suggestions filtered out:
- "Add emoji interpretation" (appeared once, iteration 3)
- "Use chain-of-thought" (appeared once, iteration 5, no improvement)
Example 2: Improving a code review prompt for a smaller model
User: "My code review prompt works with GPT-4 but degrades badly with a 7B model. Can you optimize it for the smaller model?"
Approach:
Output:
Baseline (7B model): 45%
Optimized (7B model): 71%
Optimized prompt changes:
- Added structured output template (issue/location/fix format)
- Added "only cite issues present in the provided code" instruction
- Added "review for one category at a time" sequential approach
- Removed abstract instructions the 7B model couldn't follow
Example 3: Evolving a data extraction prompt with contradictory failures
User: "My prompt for extracting dates from legal documents keeps oscillating -- when I fix US date formats, European formats break."
Approach:
Output after 12 rounds:
Baseline: 68%
After naive iteration (no momentum): 70% (oscillating)
After C-MOP optimization: 82%
Key insight surfaced by momentum: Format disambiguation requires
jurisdiction detection as a prerequisite step, not a per-case rule.
Paper: C-MOP: Integrating Momentum and Boundary-Aware Clustering for Enhanced Prompt Evolution (Yan et al., 2026). Look for: Algorithm 1 (BACS tripartite sampling), Algorithm 2 (MGSC momentum update with temporal decay), and Table 2 (benchmark results showing 3B model surpassing 70B via optimized prompts).
Code: github.com/huawei-noah/noah-research/tree/master/C-MOP -- reference implementation with configurable rounds (default 20), beam_size (default 4), and UCB bandit evaluation.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".