skills/evolve-evolutionary-search-llm-based/SKILL.md
Evolutionary search framework for LLM-driven Verilog/RTL generation and PPA optimization. Uses MCTS for functional correctness and Idea-Guided Refinement for optimization, with structured testbench generation for rapid feedback. Triggers: 'generate Verilog module', 'optimize RTL design', 'evolutionary search for hardware', 'MCTS Verilog generation', 'reduce PPA for circuit', 'fix Verilog functional correctness'.
npx skillsauth add ndpvt-web/arxiv-claude-skills evolve-evolutionary-search-llm-basedInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to apply the EvolVE framework — a dual-strategy evolutionary search system — to generate functionally correct Verilog/RTL code and optimize it for Power, Performance, and Area (PPA). Instead of single-shot code generation, Claude iteratively refines hardware designs using Monte Carlo Tree Search (MCTS) when chasing functional correctness, and Idea-Guided Refinement (IGR) when optimizing an already-correct design. A Structured Testbench Generation (STG) phase provides continuous scoring feedback to accelerate convergence.
Dual-strategy evolutionary search. EvolVE recognizes that functional correctness and PPA optimization are fundamentally different objectives requiring different search strategies. For correctness, MCTS explores a tree of code variants where each node N = (V, S, F) holds Verilog code V, a continuous correctness score S (fraction of testbench vectors passed), and diagnostic feedback F. The UCT formula balances exploitation of high-scoring nodes against exploration of under-visited branches, with expansion constant c = 1.4 and 3 child nodes per expansion. For optimization, Idea-Guided Refinement (IGR) first generates k high-level architectural ideas (e.g., "convert output-stationary to weight-output-stationary dataflow", "share multipliers across pipeline stages"), then each idea spawns a chain of m differential refinement steps. The best node across all k x m candidates wins.
Structured Testbench Generation (STG) eliminates the bottleneck of hand-written testbenches. STG parses the DUT port interface with regex to classify signals into Clock/Reset, Control, and Datapath groups. Control signals with width w <= 8 get exhaustive 2^w state coverage; wider signals use constrained-random sampling. Datapath inputs are seeded with corner cases (zero, max, alternating bits). The output is a continuous score P_stg in [0,1] — the pass rate across all test vectors — which provides much richer gradient signal than binary pass/fail. This scoring function drives both MCTS backpropagation and IGR candidate ranking.
Scoring separates the two phases. For generation (correctness): S_gen = p_i / |T| where p_i is passed test count. For optimization: S_opt = -A * L / eta (negative area-latency product, normalized) — but only if all tests pass; otherwise a heavy penalty C_penalty = -10^5 forces the search back toward correctness. This constraint ensures optimization never sacrifices functionality.
Parse the hardware specification. Extract the module name, port interface (inputs, outputs, widths), behavioral description, and any constraints (target clock period, area budget, latency requirement). If the spec is ambiguous, ask the user to clarify before proceeding.
Generate Structured Testbench (STG). Classify each port signal as Clock/Reset, Control, or Datapath using its name and width. For control signals with width <= 8 bits, enumerate all 2^w input combinations. For wider datapath signals, generate corner-case vectors (all zeros, all ones, alternating 0xAA/0x55, max unsigned value) plus 20-50 constrained-random vectors. Write a self-checking testbench that computes a continuous pass rate.
Produce an initial Verilog implementation. Generate a straightforward, correct-first design from the spec. Prioritize clarity and correctness over optimization. Use synchronous design patterns with explicit reset logic.
Evaluate via simulation. Run the design against the STG testbench using iverilog and vvp. Compute the continuous correctness score (pass rate). Capture any compiler errors or simulation mismatches as diagnostic feedback F.
If correctness < 100%, apply MCTS-style search. Treat the current code as a tree node. For each iteration: (a) select the most promising node via UCT balancing, (b) generate a child variant by applying targeted fixes guided by the diagnostic feedback, (c) simulate the child, (d) backpropagate the score to all ancestors. Repeat for up to N iterations (start with 10-20; the paper shows monotonic improvement up to 300 nodes). Use differential edits — fix specific bugs rather than regenerating the entire file.
Once functionally correct, decide if optimization is needed. If the user requested PPA optimization or the design has obvious inefficiencies (e.g., combinational multipliers that could be pipelined, redundant registers), proceed to IGR. Otherwise, deliver the correct design.
Apply Idea-Guided Refinement (IGR) for optimization. Generate 3-5 high-level optimization ideas appropriate to the design type. Examples: "pipeline the datapath to reduce critical path", "share ALU resources across time-multiplexed operations", "convert to systolic array dataflow", "use Gray coding to reduce switching activity". For each idea, produce an implementation and then refine it through 3-4 differential edit steps.
Score optimization candidates. For each variant that passes all testbench vectors, compute the PPA proxy: area * latency. If synthesis tools are available (yosys), run synth -top <module>; stat to get cell count and estimated area. Rank candidates by the optimization score.
Select and deliver the best design. Present the winning variant to the user with a summary of: (a) correctness status, (b) architectural approach chosen, (c) PPA metrics if available, (d) trade-offs compared to alternatives explored.
Iterate if the user requests further refinement. The IGR phase can be re-entered with new ideas informed by synthesis feedback. Each round should target a specific metric (area, timing, power) rather than all at once.
Example 1: Generating a FIFO buffer from spec
User: "Write a synchronous FIFO with 8-entry depth, 32-bit data width, full/empty flags, and parameterizable depth."
Approach:
sync_fifo, params DEPTH=8, WIDTH=32, ports clk, rst_n, wr_en, rd_en, din[31:0], dout[31:0], full, empty.{wr_en, rd_en} (4 states), seed din with 0x00000000, 0xFFFFFFFF, 0xAAAAAAAA, 0x55555555, plus random vectors. Test sequences: fill to full, read to empty, simultaneous read/write, overflow/underflow edge cases.empty asserted one cycle early during simultaneous R/W. Generate child node fixing the count update priority. Re-simulate — score 1.0.Output:
module sync_fifo #(
parameter DEPTH = 8,
parameter WIDTH = 32
)(
input wire clk,
input wire rst_n,
input wire wr_en,
input wire rd_en,
input wire [WIDTH-1:0] din,
output reg [WIDTH-1:0] dout,
output wire full,
output wire empty
);
localparam PTR_W = $clog2(DEPTH);
reg [WIDTH-1:0] mem [0:DEPTH-1];
reg [PTR_W:0] wr_ptr, rd_ptr;
wire [PTR_W:0] count = wr_ptr - rd_ptr;
assign full = (count == DEPTH);
assign empty = (count == 0);
always @(posedge clk or negedge rst_n) begin
if (!rst_n) begin
wr_ptr <= 0;
rd_ptr <= 0;
end else begin
if (wr_en && !full) begin
mem[wr_ptr[PTR_W-1:0]] <= din;
wr_ptr <= wr_ptr + 1;
end
if (rd_en && !empty) begin
dout <= mem[rd_ptr[PTR_W-1:0]];
rd_ptr <= rd_ptr + 1;
end
end
end
endmodule
Example 2: Optimizing a combinational multiplier for area
User: "I have a working 16x16 unsigned multiplier but it uses too much area. Optimize it."
Approach:
assign product = a * b; style.{0x0000, 0xFFFF, 0x0001, 0x8000, 0x5555, 0xAAAA} cross-product plus random pairs. Verify existing design passes 100%.yosys available, synthesize each: read_verilog mult.v; synth; stat.| Variant | Cells | Latency (cycles) | Area*Latency |
|-----------------------|-------|-------------------|--------------|
| Original (combinational) | 1847 | 1 | 1847 |
| Shift-and-add (16 cyc) | 312 | 16 | 4992 |
| Booth radix-4 | 923 | 1 | 923 |
| Pipelined array (4 stg) | 1105 | 4 (1 throughput) | 1105* |
Recommendation: Booth radix-4 cuts area by 50% with no latency penalty.
Example 3: Debugging a failing FSM
User: "My UART transmitter FSM fails simulation — the stop bit is never sent."
Approach:
tx_start, data_in[7:0]. Enumerate all FSM transitions with data corner cases.tx_out stays low during STOP state; bit counter never reaches 8 because reset condition uses < instead of <=.bit_cnt < 8 to bit_cnt <= 7 (equivalent but clarifies intent). Re-simulate — score 0.75 (same issue, wrong fix).bit_cnt == 8 while counter only reaches 7. Fix transition condition to bit_cnt == 7. Re-simulate — score 1.0.-10^5) exists for a reason.yosys (even rough ones) are far more reliable than guessing from code structure alone.| Problem | Cause | Resolution |
|---------|-------|------------|
| iverilog compilation error | Syntax error in generated Verilog | Feed the exact error message as diagnostic feedback F into the next MCTS expansion. Common fixes: missing end, wrong port widths, undeclared wires. |
| Testbench hangs (no $finish) | Design deadlocks or testbench timeout missing | Add initial begin #(MAX_TIME) $finish; end to STG testbench. If design deadlocks, check FSM for unreachable states. |
| All MCTS children score 0 | Spec misunderstood or fundamentally wrong approach | Backtrack to root. Re-read the specification. Generate a completely different architectural approach rather than refining the broken one. |
| Optimization variant breaks correctness | Aggressive transformation introduced bug | Score falls to C_penalty. Discard this branch. The IGR chain should continue from the last correct variant, not the broken one. |
| yosys synthesis fails | Unsupported constructs (e.g., real, system tasks) | Remove simulation-only constructs from the synthesizable module. Keep testbench and DUT cleanly separated. |
| Scores plateau across iterations | Search exhausted local optima | Inject new ideas into IGR (try a fundamentally different architecture). For MCTS, increase exploration constant c to encourage visiting unexplored branches. |
yosys or a commercial synthesis flow, PPA estimates are heuristic at best. Cell counts from yosys do not map directly to silicon area without a technology library.Paper: EvolVE: Evolutionary Search for LLM-based Verilog Generation and Optimization — Hsin et al., 2026. Look for Algorithm 1 (unified MCTS/IGR framework), the STG signal classification scheme (Section III-C), and the IC-RTL benchmark results (Table 5) showing up to 66% PPA reduction over human contest submissions.
Benchmark code: github.com/weiber2002/ICRTL — Six industry-scale RTL problems with testbenches, reference solutions, and open-source simulation scripts (iverilog + yosys).
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".