skills/consistency-meets-verification-enhancing/SKILL.md
Generate high-reliability test suites without ground-truth implementations using the ConVerTest pipeline: Self-Consistency voting, Chain-of-Verification refinement, and Dual Execution Agreement. Use when asked to 'generate tests for this spec', 'write tests before implementation', 'create a test suite without reference code', 'test-driven development for this feature', 'generate reliable unit tests', or 'validate tests without a working implementation'.
npx skillsauth add ndpvt-web/arxiv-claude-skills consistency-meets-verification-enhancingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to generate reliable, high-coverage test suites for functions or modules before a reference implementation exists. It applies the ConVerTest pipeline from Taherkhani et al. (2026), which combines three strategies — Self-Consistency majority voting, Chain-of-Verification iterative refinement, and Dual Execution Agreement — to produce tests that are valid, comprehensive, and free of hallucinated assertions. The technique improves test validity by up to 39%, line coverage by 28%, and mutation detection by 18% over naive LLM test generation.
The core problem: When an LLM generates tests without a reference implementation, it frequently hallucinates expected values in assertions. A single-shot generation has no way to verify whether assertEqual(result, 42) is correct. ConVerTest solves this by generating many candidates and using statistical agreement as a proxy for correctness.
The pipeline has two stages. Stage 1 runs two parallel tracks: (a) Self-Consistency (SC) generates M diverse test stubs (setup + inputs + function calls, no assertions), then for each stub prompts the LLM N times to complete the assertions. The completions are parsed into ASTs, grouped by functionally identical logic, and the most frequent assertion set wins via majority vote. (b) Chain-of-Verification (CoVe) generates Z candidate code implementations, each refined through an iterative loop: generate a baseline solution, formulate verification questions (edge cases, constraint adherence, error handling), answer those questions against the code, and regenerate if issues are found.
Stage 2 performs Dual Execution Agreement. Every (code candidate, test case) pair is executed, producing a pass/fail matrix. Code candidates are clustered into agreement sets — groups that produce identical pass/fail patterns across all tests. Each set is scored by (tests_passed) * sqrt(set_size), rewarding both robustness (many tests passed) and consensus (many implementations agree). The highest-scoring set's representative solution becomes the reference, and tests are labeled valid if they pass against it. This eliminates tests with hallucinated expectations and implementations with subtle bugs in one step.
Extract the specification. Parse the user's input to isolate the function signature, docstring, parameter types, return type, and any behavioral constraints. If only a natural-language description exists, formalize it into a clear spec with input/output contracts.
Generate M diverse test stubs (SC Stage 1). Produce M (aim for 5-8) test skeletons that each set up different input scenarios: typical cases, boundary values, error conditions, empty inputs, large inputs. Each stub includes setUp, input variables, and the function call — but no assertions yet. Vary the scenarios deliberately to maximize coverage diversity.
Complete each stub N times via majority voting (SC Stage 2). For each of the M stubs, generate N (aim for 5-7) independent completions that add assertions. Parse each completion's assertions, group functionally identical ones (same expected values and assertion types), and select the assertion set with the highest vote count. Discard stubs where no assertion receives a clear majority (>50% agreement).
Generate Z candidate implementations via CoVe. Produce Z (aim for 3-5) independent code implementations of the spec. For each candidate, run a verification loop: (a) generate the baseline code, (b) formulate 4-6 verification questions covering correctness, edge cases, constraint adherence, and error handling, (c) answer each question against the code, (d) if any answer reveals an issue, regenerate with the verification context appended to the prompt.
Build the execution matrix. Execute every (implementation, test) pair. Record pass/fail for each cell. This produces a Z x M matrix.
Cluster into agreement sets. Group implementations that have identical pass/fail vectors across all tests. Two implementations are in the same set only if they agree on every single test outcome.
Score and select the consensus set. Score each agreement set as tests_passed * sqrt(set_size). Select the highest-scoring set. Pick any member of that set as the representative implementation.
Filter tests by validity. Label each test as valid (passes against the representative) or invalid (fails). Discard invalid tests — these likely contain hallucinated assertions.
Output the final test suite. Return the validated tests as a clean, runnable test file. Include a brief comment on each test noting the input scenario it covers. Optionally return the representative implementation as a reference.
Report coverage and confidence. Summarize how many stubs were generated, how many survived voting, the agreement set size, and the consensus score. Flag any tests that had weak majority votes (<60% agreement) as lower-confidence.
Example 1: TDD for a string utility
User: "Write tests for a function anagram_combos(sub_words: list[str], whole_word: str) -> list[str] that returns all permutations of sub_words whose concatenation equals whole_word."
Approach:
anagram_combos(["ab", "cd"], "abcd") returns ["abcd"] — select this.Output:
import unittest
from solution import anagram_combos
class TestAnagramCombosBasic(unittest.TestCase):
def test_single_valid_permutation(self):
result = anagram_combos(["ab", "cd"], "abcd")
self.assertEqual(sorted(result), ["abcd"])
def test_multiple_valid_permutations(self):
result = anagram_combos(["ab", "cd", "ef"], "abcdef")
self.assertIn("abcdef", result)
self.assertTrue(all(len(p) == 6 for p in result))
def test_no_valid_permutation(self):
result = anagram_combos(["ab", "cd"], "abdc")
self.assertEqual(result, [])
def test_single_element_list(self):
result = anagram_combos(["hello"], "hello")
self.assertEqual(result, ["hello"])
def test_empty_list(self):
result = anagram_combos([], "abc")
self.assertEqual(result, [])
def test_duplicate_sub_words(self):
result = anagram_combos(["ab", "ab"], "abab")
self.assertEqual(len(result), 1) # only one unique concatenation
Example 2: Generating tests for a data processing function from a docstring
User: "Generate a test suite for this spec: def merge_intervals(intervals: list[tuple[int, int]]) -> list[tuple[int, int]] — merges overlapping intervals, returns sorted non-overlapping intervals."
Approach:
(start, end) tuples, return sorted merged result.(1,3) and (3,5)), nested intervals.(1,3) and (3,5) merge into (1,5) — marginal majority. Flag as lower-confidence. All other stubs show 4/5+ agreement.6 * sqrt(3) = 10.4. Set B (1 implementation, passes 7/7) scores 7 * sqrt(1) = 7.0. Set A wins.Output:
import unittest
from solution import merge_intervals
class TestMergeIntervals(unittest.TestCase):
def test_no_overlap(self):
self.assertEqual(merge_intervals([(1, 2), (5, 7)]), [(1, 2), (5, 7)])
def test_partial_overlap(self):
self.assertEqual(merge_intervals([(1, 4), (3, 6)]), [(1, 6)])
def test_full_overlap(self):
self.assertEqual(merge_intervals([(1, 10), (3, 5)]), [(1, 10)])
def test_single_interval(self):
self.assertEqual(merge_intervals([(2, 5)]), [(2, 5)])
def test_empty_list(self):
self.assertEqual(merge_intervals([]), [])
def test_nested_intervals(self):
self.assertEqual(
merge_intervals([(1, 10), (2, 3), (4, 8)]), [(1, 10)]
)
# NOTE: Adjacent interval behavior (e.g., (1,3) + (3,5)) had weak
# consensus (3/5 votes). The spec should clarify whether touching
# endpoints merge. This test was excluded from the validated suite.
Example 3: Hardening an existing test suite
User: "I have these 4 tests for calculate_discount(price, tier) but I'm not sure the expected values are right. Can you verify them?"
Approach:
calculate_discount.assertEqual(result, [1, 2]) and assertEqual(result, [1,2]) are functionally identical.Taherkhani, H., DaghighFarsoodeh, A., Chowdhury, M., Pham, H. V., & Hemmati, H. (2026). Consistency Meets Verification: Enhancing Test Generation Quality in Large Language Models Without Ground-Truth Solutions. arXiv:2602.10522v1. https://arxiv.org/abs/2602.10522v1
Key sections: Section 3 for the full pipeline algorithm, Section 4 for the dual execution agreement scoring formula score = tests_passed * sqrt(set_size), and Section 5 ablations showing each component's individual contribution.
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".