skills/chipbench-next-step-benchmark-evaluating/SKILL.md
Evaluate and improve LLM-generated hardware designs using ChipBench methodology: structured Verilog generation with hierarchical decomposition, systematic RTL debugging across four bug categories, and cross-language reference model generation. Use when: 'generate Verilog for this module', 'debug this RTL code', 'create a Python reference model for this hardware design', 'verify my Verilog against a reference implementation', 'benchmark my chip design workflow', 'translate this hardware spec to SystemC'.
npx skillsauth add ndpvt-web/arxiv-claude-skills chipbench-next-step-benchmark-evaluatingInstall this skill globally with one command. Works with Claude Code, Cursor, and Windsurf.
3 of 9 scanners reported clean
Some scanners were skipped, did not run, or reported a non-clean status. Review each row below.
This skill enables Claude to apply the ChipBench evaluation methodology to real hardware design tasks. ChipBench (arXiv:2601.21448v2) exposes critical gaps in LLM chip design capabilities by evaluating across three axes: Verilog generation (with hierarchical complexity tiers), systematic RTL debugging (across four bug categories), and heterogeneous reference model generation (Python, SystemC, CXXRTL). The key insight is that structuring hardware design tasks by complexity tier and applying cross-language verification catches errors that single-language testing misses entirely.
Tiered Complexity Decomposition. ChipBench categorizes hardware design tasks into three difficulty tiers that mirror real industrial workflows: (1) self-contained modules — single-module designs with no sub-module dependencies, averaging ~48 lines; (2) hierarchical modules — designs that instantiate sub-modules, where the LLM receives sub-module implementations and must generate only the top-level integration; (3) CPU/IP-level components — real modules from open-source CPU projects (ALUs, branch predictors, register files) where even state-of-the-art models achieve under 22% accuracy. This tiering is critical because LLMs that score 95%+ on flat benchmarks collapse when faced with hierarchical structure. When generating Verilog, always identify which tier a design falls into and adjust the approach accordingly.
Systematic Bug Taxonomy for Debugging. Rather than treating debugging as a monolithic task, ChipBench classifies RTL bugs into four actionable categories: arithmetic bugs (operator misuse, e.g., * instead of +), assignment bugs (incorrect constant values to registers/wires), timing bugs (wrong clock cycle assignments, blocking vs. non-blocking confusion), and state machine bugs (flawed FSM transition logic). This taxonomy drives targeted debugging strategies — timing bugs require clock-cycle-aware analysis while arithmetic bugs need expression-level scrutiny.
Heterogeneous Cross-Language Verification. The Heterogeneous Test Engine (HTE) validates hardware designs by generating reference models in multiple languages, compiling golden Verilog to C++ via Verilator, and running 1,000+ random stimulus iterations comparing outputs. The critical detail: reset is toggled once and outputs are compared only during normal operation phase, matching real-world verification practice. This catches bugs that same-language testing misses — a Python model and a SystemC model failing differently on the same stimulus reveals specification ambiguity.
Classify the design tier. Determine if the target module is self-contained (no sub-module dependencies), hierarchical (instantiates other modules), or CPU/IP-level. This dictates prompt structure and expected difficulty.
Extract the I/O interface. Parse the specification to identify all input/output ports, their widths, clock/reset signals, and any parameterizable values. Write the module declaration skeleton first.
For hierarchical designs, enumerate sub-modules. List every sub-module the top-level design must instantiate. Provide or request their interface definitions. Generate the top-level wiring and instantiation before filling in glue logic.
Generate the RTL body using specification-to-RTL mapping. Translate each behavioral requirement into combinational or sequential logic. For FSMs, explicitly enumerate states and draw the transition table before writing case blocks.
Apply syntax validation. Check the generated Verilog compiles with iverilog or a similar tool. Fix syntax errors before functional verification.
Run functional verification against testbenches. Execute the design against both corner-case tests and random stimulus (1,000+ vectors). Compare outputs against golden reference or specification.
Categorize the bug type. Analyze the failing behavior to classify it: arithmetic (wrong computation results), assignment (incorrect constant/initial values), timing (output appears on wrong cycle, blocking/non-blocking mismatch), or state machine (stuck states, wrong transitions).
Apply category-specific debugging. For timing bugs: trace clock edges and check blocking (=) vs. non-blocking (<=) usage. For arithmetic: isolate each operator and verify against the spec. For assignment: check all constant values and reset conditions. For FSMs: enumerate all state transitions and compare against the specification.
Verify the fix against the original testbench and confirm the corrected module produces expected outputs across all stimulus patterns.
Select the target language based on design complexity. Use Python for simple combinational/sequential modules (fastest iteration). Use SystemC or CXXRTL for complex designs requiring cycle-accurate simulation or when cross-language verification is needed.
Implement the reference model preserving cycle-level behavior. Match the hardware's clock-by-clock semantics — don't abstract away timing. For Python models, use explicit step() functions that advance one clock cycle. Reset the model once, then compare only during normal operation.
Cross-verify against the Verilog implementation. Run identical stimulus through both the reference model and the RTL, comparing outputs at every cycle boundary. Mismatches indicate bugs in either the reference model or the RTL.
Example 1: Self-Contained Verilog Module Generation
User: "Generate a Verilog module for a 4-bit priority encoder with valid output."
Approach:
input [3:0] in, output reg [1:0] pos, output reg valid.casez or priority if-else chain.Output:
module priority_encoder (
input [3:0] in,
output reg [1:0] pos,
output reg valid
);
always @(*) begin
valid = 1'b1;
casez (in)
4'b1???: pos = 2'd3;
4'b01??: pos = 2'd2;
4'b001?: pos = 2'd1;
4'b0001: pos = 2'd0;
default: begin
pos = 2'd0;
valid = 1'b0;
end
endcase
end
endmodule
Example 2: Debugging an FSM Timing Bug
User: "My UART transmitter sends data one cycle too early. Here's the RTL..."
Approach:
tx_data is loaded vs. when tx_start is asserted.=) vs. non-blocking (<=) misuse in sequential always blocks.Diagnosis:
// BUG: Using blocking assignment in sequential block
always @(posedge clk) begin
tx_reg = shift_reg[0]; // <-- Should be <=
shift_reg = shift_reg >> 1; // <-- Should be <=
end
// FIX: Non-blocking assignments for sequential logic
always @(posedge clk) begin
tx_reg <= shift_reg[0];
shift_reg <= shift_reg >> 1;
end
The blocking assignment causes tx_reg to update immediately within the same simulation delta, making downstream logic see the new value one cycle too early.
Example 3: Python Reference Model for a FIFO
User: "Create a Python reference model for my synchronous FIFO so I can cross-verify."
Approach:
Output:
class SyncFifoModel:
def __init__(self, depth=8, width=8):
self.depth = depth
self.mem = [0] * depth
self.wr_ptr = 0
self.rd_ptr = 0
self.count = 0
def reset(self):
self.wr_ptr = 0
self.rd_ptr = 0
self.count = 0
def step(self, wr_en, rd_en, wr_data):
"""Advance one clock cycle. Returns (rd_data, full, empty)."""
rd_data = self.mem[self.rd_ptr] if (rd_en and self.count > 0) else 0
# Update state (matches non-blocking RTL semantics)
next_count = self.count
if wr_en and self.count < self.depth:
self.mem[self.wr_ptr] = wr_data
self.wr_ptr = (self.wr_ptr + 1) % self.depth
next_count += 1
if rd_en and self.count > 0:
self.rd_ptr = (self.rd_ptr + 1) % self.depth
next_count -= 1
self.count = next_count
return rd_data, (self.count == self.depth), (self.count == 0)
Run 1,000+ random stimulus vectors through both this model and the Verilog FIFO, comparing rd_data, full, and empty at each cycle. Toggle reset once at the start, then compare only during normal operation.
Example 4: Hierarchical Module — ALU Top-Level Integration
User: "I have adder, multiplier, and shifter sub-modules. Generate the top-level ALU that selects between them."
Approach:
Output:
module alu_top (
input clk,
input [31:0] a, b,
input [2:0] op,
output reg [31:0] result
);
wire [31:0] add_out, mul_out, shift_out;
adder u_add (.a(a), .b(b), .sum(add_out));
multiplier u_mul (.a(a), .b(b), .product(mul_out));
shifter u_shift (.a(a), .amt(b[4:0]), .out(shift_out));
always @(*) begin
case (op)
3'b000: result = add_out;
3'b001: result = mul_out;
3'b010: result = shift_out;
default: result = 32'b0;
endcase
end
endmodule
The key is generating only the glue logic and instantiation — the sub-modules are provided, not generated.
<=) in sequential always @(posedge clk) blocks and blocking assignments (=) in combinational always @(*) blocks. Mixing these is the #1 source of timing bugs.step() or eval() functions that advance exactly one clock cycle. Never abstract away cycle boundaries.end, mismatched port widths, undeclared signals.Paper: ChipBench: A Next-Step Benchmark for Evaluating LLM Performance in AI-Aided Chip Design — Look for the tiered complexity taxonomy (Section 3), the four-category bug classification (Section 4), the Heterogeneous Test Engine cross-language verification methodology (Section 5), and the automated training data toolbox (Section 6). Code: https://github.com/zhongkaiyu/ChipBench
development
Audit LLM-based automatic short answer grading (ASAG) systems for adversarial vulnerabilities using token-level and prompt-level attack strategies from the GradingAttack framework. Triggers: 'test grading robustness', 'adversarial attack on grading', 'audit LLM grader', 'red-team answer grading', 'ASAG vulnerability assessment', 'grading fairness attack'
development
Build structured information-seeking agents that decompose complex queries into multi-turn search-and-browse workflows, aggregate results from multiple web sources, and return answers in typed structured formats (items, sets, lists, tables). Applies the GISA benchmark's ReAct-based agent architecture and evaluation methodology. Trigger phrases: "build an information-seeking agent", "search agent pipeline", "multi-turn web research agent", "structured web search workflow", "aggregate information from multiple sources", "web research with structured output"
data-ai
Optimize LLM prompts using GFlowPO's iterative generate-evaluate-refine loop with diversity-preserving exploration and dynamic memory. Use when: 'optimize this prompt', 'find a better prompt for this task', 'prompt engineering with examples', 'auto-tune my system prompt', 'improve prompt accuracy', 'generate prompt variations'.
development
Constrain LLM generation with executable Pydantic schemas and multi-agent pipelines to produce structurally valid, domain-rich artifacts. Uses ontology-as-grammar to eliminate hallucinated structures while preserving creative output. Trigger phrases: "generate a valid game design", "schema-constrained generation", "build a multi-agent pipeline with Pydantic validation", "ontology-driven content generation", "structured creative generation with DSPy", "generate artifacts that pass domain validation".