Hardware Metric Reasoning using LLMs

Background and Motivation

Recent advances in Large Language Models (LLMs) have transformed the landscape of hardware design automation. LLMs have been applied successfully to tasks such as Verilog code generation, EDA tool scripting, accelerator design, and RTL error fixing. However, while these applications demonstrate that LLMs can generate and refine HDL code, a crucial gap remains: LLMs are not yet proficient at reasoning about post-synthesis metrics such as area, delay, and static power. These metrics are critical to hardware designers, as they directly impact design performance, energy efficiency, and manufacturability.

Existing approaches often use LLMs to iteratively refine code to meet design constraints (e.g., area or power budgets) through prompting or search-based techniques. Yet, such methods do not give LLMs deeper insight into how HDL-level design choices propagate into synthesis outcomes. To address this gap, MetRex proposes a direct framework for metric reasoning—enabling LLMs to estimate synthesis results from code itself.

What is MetRex?

MetRex is the first benchmark designed to evaluate LLM reasoning about post-synthesis metrics of Verilog HDL designs. It provides:

A large-scale dataset of 25,868 Verilog designs, annotated with three key metrics:
- Area
- Delay
- Static power
A Chain of Thought (CoT) template, which structures intermediate reasoning steps such as gate counts, per-gate metrics, and critical path analysis, guiding LLMs toward more accurate predictions.
An automated data-cleaning flow using Verilog compilers, synthesis tools, and LLM agents to ensure that only clean, synthesizable designs are included.

Example of Chain of Thought

By integrating reasoning steps into dataset annotations, MetRex helps LLMs move beyond surface-level code generation to deeper code-to-metric inference.

Key Contributions

Paper Metrex: A benchmark for verilog code metric reasoning using llms(Abdelatty et al., 2025) is published in ASPDAC’25.

GitHub repo is available at https://github.com/scale-lab/MetRex.

The project makes four main contributions:

Benchmark Dataset: MetRex introduces the first large-scale dataset for metric reasoning, covering diverse Verilog sources such as RTL-Coder, VeriGen, ISCAS benchmarks, OpenCores, and NVDLA designs.
Automated Workflow: A synthesis-and-repair loop integrates LLMs with tools like Yosys, Icarus Verilog, and Cadence Genus to fix errors and annotate metrics automatically.
Chain-of-Thought Reasoning: The CoT template provides interpretable reasoning steps for metric estimation, improving accuracy compared to direct prompting by up to 8.9% across different metrics.
Supervised Fine-Tuning (SFT): Fine-tuning LLMs with MetRex improves performance by 37.0% (area), 25.3% (delay), and 25.7% (static power) compared to few-shot prompting.
Comparative Analysis: Against regression-based baselines such as MasterRTL, MetRex-trained LLMs achieve up to 17.4% higher accuracy within 5% error margins, while also being 1.7× faster by eliminating preprocessing.

Broader Impact

MetRex highlights the unique strengths of LLMs in hardware design:

Direct Verilog processing: Unlike conventional ML models that require feature extraction (e.g., ASTs or operator graphs), LLMs can analyze raw HDL code without lossy transformations.
Interpretability: Through CoT reasoning, LLM predictions are explainable, offering gate-level breakdowns and logical derivations rather than opaque numerical outputs.
Scalability: With fine-tuning, LLMs can provide accurate estimates at scale, reducing dependence on costly synthesis runs during early design exploration.
Ultimately, MetRex lays the groundwork for a new direction in LLM-powered EDA, where models not only generate RTL but also reason about its implications, enabling faster, more insightful design space exploration.

Large Language Models (LLMs) have been applied to various hardware design tasks, including Verilog code generation, EDA tool scripting, and RTL bug fixing. Despite this extensive exploration, LLMs are yet to be used for the task of post-synthesis metric reasoning and estimation of HDL designs. In this paper, we assess the ability of LLMs to reason about post-synthesis metrics of Verilog designs. We introduce MetRex, a large-scale dataset comprising 25,868 Verilog HDL designs and their corresponding post-synthesis metrics, namely area, delay, and static power. MetRex incorporates a Chain of Thought (CoT) template to enhance LLMs’ reasoning about these metrics. Extensive experiments show that Supervised Fine-Tuning (SFT) boosts the LLM’s reasoning capabilities on average by 37.0%, 25.3%, and 25.7% on the area, delay, and static power, respectively. While SFT improves performance on our benchmark, it remains far from achieving optimal results, especially on complex problems. Comparing to state-of-the-art regression models, our approach delivers accurate post-synthesis predictions for 17.4% more designs (within a 5% error margin), in addition to offering a 1.7x speedup by eliminating the need for pre-processing. This work lays the groundwork for advancing LLM-based Verilog code metric reasoning.

Background and Motivation

What is MetRex?

Key Contributions

Broader Impact

References

2025