6.3

/10

Spotlight6 位审稿人

最低5最高8标准差1.1

3.7

置信度

正确性2.8

贡献度2.7

表达3.2

NeurIPS 2024

Scalable and Effective Arithmetic Tree Generation for Adder and Multiplier Designs

Yao Lai,Jinxin Liu,David Z. Pan,Ping Luo

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

Employing RL methods for arithmetic tree generation yields superior designs for adders and multipliers.

摘要

关键词

Reinforcement LearningComputer ArithmeticElectronic Design Automation

评审与讨论

审稿意见

评分: 7置信度: 32024-07-08

This paper presents a new method for designing arithmetic modules by modeling tasks as single-player tree generation games, using reinforcement learning techniques. This approach combines prefix and compressor tree modules to find optimal multipliers. Experiments show that the developed 128-bit adders and multipliers outperform the latest designs, significantly reducing delay and size. The method enhances computational efficiency and hardware size within hours and integrates seamlessly into advanced technology, offering valuable insights for future hardware design.

优点

The strengths of the paper are listed below: The paper introduces an innovative method for designing arithmetic modules that outperforms traditional human design techniques.

The paper is well-presented, offering a clear explanation of its methodology. It improves results against other method PrefixRL

缺点

The weaknesses of the paper are listed below:

the paper innovation limited to application of RL and Tree Search to a circuit design problem, which are studied in other previous work such as: Mirhoseini, Azalia, et al. "A graph placement methodology for fast chip design." Nature 594.7862 (2021): 207-212.

问题

A few question and notes are listed below:

It seems the main additions of this work are two-level Retrieval and MCTS compared to PrefixRL. Are there other inovation that distingush this work vs PrefixRL?

How one would extend to other arithmetic operations, such as exponentiation? Seems a weakness to the method is that one needs to design the "game" for addition operators. Maybe There are work that design envs that author could try: Ma, Yecheng Jason, et al. "Eureka: Human-level reward design via coding large language models." arXiv preprint arXiv:2310.12931 (2023).

局限性

The authors adequately addressed the limitations and, if applicable, potential negative societal impact of their work

作者回复

2024-08-06

Thank you so much for your positive comments.

Weakness 1 [Innovation]

Circuit design is a broad field encompassing various tasks. Our work focuses on arithmetic unit design (adders and multipliers), which differs significantly from graph placement in search space and evaluation metrics. Graph placement involves optimizing the positions of circuit modules with graph structures, while arithmetic unit design deals with the structures of prefix and compressor trees, resulting in vastly different combinatorial space and optimization criteria.
Moreover, our approach includes several innovations:
- When designing adders, we use the MCTS framework to balance exploring new design possibilities and refining existing ones, ensuring a comprehensive and efficient search across the design space.
- For the compressor tree search in MultGame, the search performance is enhanced by redefining actions and states, and building the design from scratch, providing greater search flexibility.
- The optimization process is also optimized by two-level retrieval and pruning strategies, enabling the efficient scaling of the design's bit-width (128-bit adder and 64-bit multiplier).
- Our co-design framework allows simultaneous optimization of prefix and compressor trees, further enhancing the performance of designed multipliers.

Question 1 [Difference with PrefixRL]

Both our approach and PrefixRL [17] apply an RL framework to circuit design. However, our method incorporates several critical enhancements that we would like to clarify:

Advancement in the RL (state, action, and reward) formulation:
- State: In PrefixRL, prefix trees are represented using a binary matrix, which is solely applicable to the prefix tree structure. In contrast, our model does not directly represent the prefix tree structure. Instead, each node, representing a prefix tree, is evaluated by $W(s)$ , a scoring function designed to effectively balance exploration and exploitation.
- Action: PrefixRL’s actions include adding and deleting cells. However, adding cells does not improve the theoretical metrics (level/size) of the adder it represents. Therefore, we have strategically omitted actions that add cells when optimizing these metrics. This refinement significantly boosts search efficiency, especially in the case of a 128-bit adder, by focusing on actions that directly contribute to performance improvement.
- Reward: The reward in PrefixRL is calculated based on the performance improvement at each step. Our method, however, utilizes the performance value of the final node reached during the simulation phase of MCTS. This shift reduces the computational overhead from multiple time-consuming simulations and provides a more direct assessment of performance outcomes.
Advancement in Task Setting and Method Design:
- Task Setting: PrefixRL only addresses adder design, whereas we work on adder and multiplier designs.
- Method Design: We have introduced a novel framework for the co-optimization of the multiplier’s compress tree and prefix tree. This integrated approach significantly enhances the overall effectiveness of multiplier design.

Question 2 [Extend to Exponentiation Operation]

Exponentiation operations, whether employing the naive method or the method of exponentiation by squaring, involve numerous multiplication operations. Thus, our existing multiplier design approach is inherently well-suited to optimize these procedures. By harnessing the repeated multiplication principle intrinsic to both naive and binary exponentiation techniques, we can extend our RL framework, initially developed for multipliers, to enhance the design and efficiency of exponentiation units. This extension also allows us to further improve our model to accommodate the iterative nature of exponentiation, ensuring that our optimization effectively addresses the distinctive characteristics and performance demands of these hardware modules.

Question 3 [Need for Designing Game]

We understand your concern regarding the need to design the "game" for our work. Indeed, the majority of problems solved using RL necessitate the explicit definition of states, actions, and rewards, much like designing a game. The benefit of this approach is that a well-defined structure can significantly enhance the agent's ability to optimize the target efficiently and effectively.

Regarding the work "Eureka", we fully agree that it represents an excellent contribution to the field. We will discuss Eureka in our revised paper and consider applying its reward design methodologies to arithmetic unit design in our future work. We believe that leveraging Eureka's approach can further improve our algorithm's efficiency and performance.

评论- rebuttal acknowledgment

2024-08-08

thank you for clarifications and addressing the questions

2024-08-08

Thank you very much for the update. We appreciate your time and effort in helping us improve our paper.

审稿意见

评分: 8置信度: 42024-07-11

The paper introduces a novel approach to optimizing arithmetic module designs, particularly adders and multipliers. By modeling design tasks as single-player tree generation games and employing RL techniques (MCTS and PPO), the authors effectively explore the design space to uncover superior structures. Their method significantly enhances computational efficiency and reduces hardware size, outperforming existing techniques in both aspects. This work demonstrates the effective application of RL in hardware design, presenting an innovative tree-generation strategy for improved design optimization.

优点

Originality:

Few competing works address adder-multiplier co-design automation using RL or machine learning techniques. This paper stands out as an interesting and pioneering effort in optimizing such designs with RL.
The introduction of tree-generation games and the application of RL techniques to optimize arithmetic module designs are novel and innovative.

Quality:

Based on the experimental results, this paper has discovered several optimal 128-bit adder architectures, a noteworthy achievement given the vast search space explored.
The experiments are thorough, validated across different technology nodes and using both open-source and commercial tools, confirming the method's effectiveness.

Clarity:

The paper is well-organized, featuring a clear introduction, detailed methodology, and systematic presentation of results. This structure facilitates understanding the complex concepts introduced.
The introduction video is a nice touch and helps in understanding the proposed method.

Significance:

The authors have made the research open-source and documented most of the experimental details comprehensively, which significantly enhances the reproducibility of the results.
The improvements of the proposed method over existing techniques in terms of delay and area are significant and demonstrate the potential of the proposed approach in optimizing hardware design.

缺点

This study primarily focuses on optimizing adders and multipliers, with limited exploration of other modules within the hardware system.
While the proposed methods demonstrate notable improvements for 128-bit adders and 64-bit multipliers, their scalability to even larger bit widths needs further investigation.

Minors:

I suggest the authors consistently use either the abbreviation 'RL' or the full term 'reinforcement learning' throughout the paper.
In Equation 3, $\sum_{i=0}^{n}$ should be $\sum_{i=1}^{N}$ .
Lines 226-227: '... from step 0 to step T' should be '... from step 1 to step T'.

问题

Could you provide detailed information on the design flow? Moreover, how are the floorplan and placement parameters set in OpenROAD?
Could you elaborate on how your method might be extended or adapted to optimize other hardware units?

局限性

N/A

作者回复

2024-08-06

Thank you very much for your positive comments.

Weakness 1 [Focus on Adders and Multipliers]

While our current results focus on adders and multipliers, these operations are among the most time-consuming on GPUs, making their optimization highly significant.
We would like to clarify that our work can also be extended to other arithmetic modules. Taking exponentiation as an example, exponentiation operations require multiple multiplication operations. By leveraging the underlying principle of repeated multiplication in exponentiation, we can apply our RL framework, initially designed for optimizing adders and multipliers, to enhance the design and efficiency of exponentiation units. This extension would involve tuning our model to accommodate the iterative nature of exponentiation, ensuring that the optimization captures the unique characteristics and performance requirements of these hardware modules.

Weakness 2 [Scalability]

We understand your concern regarding the scalability of our methods to larger bit widths. Currently, most computer systems predominantly use integer widths within the 128-bit range (with 64-bit multipliers producing 128-bit results), covering most practical applications. Moreover, we have employed pruning and two-level retrieval techniques to effectively explore and discover superior 128-bit prefix adder structures. Additionally, through optimized synthesis, we have scaled our multiplier designs from 16-bit to 64-bit, demonstrating significant improvements over the RL-MUL approach. Thus, our approach addresses current practical requirements and showcases the potential for scalability and enhanced performance through advanced optimization techniques.

Minors

Thank you for pointing out. We will correct them in the revised version.

Question 1 [Details about the synthesis]

We have provided detailed information on the timing aspects of the design flow in the appendix of our paper. Specifically, Fig. 12 and Fig. 13 contain comprehensive script codes that outline our approach to both logical and physical synthesis. For Fig. 12, we have maintained consistency with the RL-MUL methodology, while for Fig. 13, we utilized the default physical synthesis flow parameters provided by OpenROAD. These figures should provide the detailed information you have requested regarding the timing aspects and the specific parameters used in OpenROAD.

Question 2 [Extension to Other Hardware Units]

Our method can also be extended to exponentiation units. See Weakness 1.

2024-08-14

Thank you for your response. I would like to keep my score (strong accept).

评论- Thank you for the update

2024-08-14

We are very grateful for your positive comments once again. We also truly appreciate the time and effort you have put into reviewing our work.

审稿意见

评分: 6置信度: 42024-07-11

This paper discusses the application of reinforcement learning to optimize the design of arithmetic circuits, specifically adders, and multipliers. Two single-player tree generation games, AddGame and MultGame, are designed to formulate adder and multiplier design problems. AddGame re-designs the search method, following a similar state space and action space as PrefixRL. MultGame optimizes the compressor tree from scratch instead of starting from existing solutions. Results show they outperform PrefixRL and RL-MUL.

优点

The paper re-designs the search techniques for adders and multipliers. Experimental results demonstrate the effectiveness of the proposed methods. The findings are significant for hardware design.
Main concerns are addressed which include the performance of designed adders and multipliers and the time cost of the searching process.
The writing is clear and well-structured.

缺点

The paper would benefit from a more detailed introduction of the states and actions for both AddGame and MultGame in the main part. It was confusing to understand how a compressor tree is represented until I referred to Appendix A.4.
Some details are missing. For example, it is unclear whether the results shown in Figure 6 are synthesized using Nangate45 technology.

问题

The paper states, “The compressor tree is built from scratch instead of starting from existing solutions for more design flexibility.” Does building from scratch indeed offer more design flexibility while maintaining performance?
Is the objective of Table 7 to minimize delay? Can you present a result with a trade-off objective?
Please provide the definition of accuracy mentioned in Appendix B.6.

局限性

Limitations and societal impacts are discussed in the paper.

作者回复

2024-08-06

Thank you very much for your positive comments.

Weakness 1 [Add Detailed Introduction in Main Text]

Thank you for the suggestion. In our revised paper, we will include the introduction of the states and actions for both AddGame and MultGame, and the tree representations in the Appendix, to the main text.

Weakness 2 [Technology Used in Fig. 6]

The Fig. 6 results are based on Nangate45 PDK (45 nm). We will add this information to the caption of Fig. 6 in our revised version.

Question 1 [Benefits of Building from Scratch]

Building from scratch theoretically allows for the exploration of a broader range of compressor tree structures. All compressor trees can be viewed as being composed of basic compressor units, and building from scratch begins with adding these fundamental compressors. This approach grants greater freedom compared to modifying an existing compressor tree. This is because modifying existing compressor trees limits the exploration of new structures due to constraints such as the number of iterations.

Additionally, the number of compressors in a compressor tree is finite, which means the steps in our building-from-scratch method are also limited. Consequently, optimization performance can be maintained within an acceptable range. As shown in Table 8, the optimization time remains within a reasonable timeframe. This demonstrates that building from scratch offers more design flexibility while maintaining performance.

Question 2 [Objective in Table 7]

The designs in Table 7 are obtained by performing trade-off optimization with open-source tools and then directly testing these optimized circuits with commercial tools. The increase in area can be attributed to the differences between the open-source and commercial tools used. Due to variations in their synthesis methodologies, the results from open-source tools might not achieve optimal performance when used with commercial tools. However, our approach still demonstrates significant advantages in terms of delay. We will clarify this in the new version of our paper.

Question 3 [Accuracy Definition in Table 9]

As we introduced in the two-level retrieval, we defined a fast flow and a full flow. The fast flow is quick but may not be accurate without routing, whereas the full flow takes longer but provides precise delay and area outputs. We assume that the delay and area predicted by the fast flow are $d'$ and $a'$ , respectively, while the delay and area achieved from the full flow are $d$ and $a$ . The accuracy reported in Table 9 is defined as $\frac{d'}{d}$ and $\frac{a'}{a}$ . Experimental results show that the fast flow tends to slightly underestimate delay, while its area predictions are completely accurate.

审稿意见

评分: 7置信度: 42024-07-13

This paper aims to leverage reinforcement learning (RL) techniques for automatic arithmetic circuit design. The authors propose to cast the design tasks as single-player tree generation games, and leverage reinforcement learning techniques to optimize these arithmetic tree structures. For adder circuits, the proposed approach discovers designs of 128-bit adders that achieve Pareto optimality in theoretical metrics. Moreover, the proposed approach significantly outperforms previous state-of-the-art RL-based approaches for adders and multipliers design.

优点

The paper is well-written and logically sound.
The paper explores the application of RL methods to arithmetic circuit design, presenting a unique intersection of interest for RL community and AI chip development.
The authors introduce a novel approach by modeling arithmetic module design tasks as single-player tree generation games, namely AddGame and MultGame, leveraging the well-established RL capabilities for complex decision-making tasks.
The method exhibits high flexibility and scalability, making it applicable to both 7nm technology and higher-bit units. These characteristics are crucial for practical hardware design and industrial applications.
Experiments show that the proposed approach discovers designs of 128-bit adders that achieve Pareto optimality in theoretical metrics, which is an impressive result.

缺点

The paper employs 45nm and 7nm PDKs. They are open-source and academically oriented, but not commercial-grade industry PDKs. This choice might introduce variations in the designs when transitioning from theoretical models to real-world industrial applications. Thus, the authors are encouraged to test their approach on real industry PDKs to validate the practicality in future work.

A minor grammatical correction is needed on line 114, where “it derived” should be revised to “it is derived.”

问题

Why the authors do not test on real industry PDKs?
Does the proposed method have any limitations when applied to real industry PDKs?
In Fig. 6, why do various RL approaches produce a range of designs, whereas traditional designs like BK and Sklansky adders result in only a single design?

局限性

N.A.

作者回复

2024-08-06

Thank you very much for your positive feedback.

Weakness 1 [PDK Selection]

Although the 45nm and 7nm PDKs employed in our study are open-source and academically oriented, they are designed to closely mimic corresponding industry nodes and are widely used for benchmarking purposes within the academic community. This ensures complete reproducibility of our results, which is a critical aspect of rigorous scientific research. For example, both PrefixRL and RL-MUL utilize the open-source Nangate45 PDK. In contrast, commercial-grade industry PDKs are proprietary and restrict the sharing of detailed process information. While we acknowledge the importance of validating our approach with industry PDKs in future work to confirm its practicality in real-world applications, open-source PDKs provide a robust foundation for demonstrating our methodology's feasibility and effectiveness within this paper's scope.

Weakness 2 [Grammatical Error]

A good catch. We will correct this in our revised version.

Question 1 [Usage of Industry PDK]

The PDKs used in our study are designed to mimic corresponding industry nodes and are widely adopted for benchmarking purposes. They help enhance reproducibility, while industry PDKs, being proprietary, may present some challenges with reproducibility.

Question 2 [Applicability to Real Industry PDKs]

Our proposed method can be directly applied to real industry PDKs. The only requirement is to modify the library files corresponding to the parameters during synthesis.

Question 3 [Design Number Variability]

Traditional designs like Brent-Kung and Sklansky adders have fixed prefix structures, resulting in a single, consistent design each time they are implemented. In contrast, RL methods explore a wide design space during a single execution, generating various tree structures and diverse design outcomes. This flexibility allows RL methods to produce multiple designs within one optimization process.

2024-08-14

Thanks for the detailed response. I will keep my current score.

评论- Thanks for the reply

2024-08-14

Thank you for your positive feedback and the time spent reviewing our work.

审稿意见

评分: 5置信度: 22024-07-15

This paper presents a reinforcement-learning-based approach to optimizing arithmetic units, specifically adders and multipliers, to enhance computational efficiency and reduce area consumption. The key idea is to frame the design as tree generation games. The evaluation shows that the proposed method can generate 128-bit adders with up to 26% reduced delay and 30% smaller size, and multipliers achieving up to 49% faster speeds and 45% size reductions compared to existing techniques.

优点

The paper is in good writing style. Figures are well plotted.
The use of reinforcement learning and tree generation games for Adder and Multiplier generation is novel and interesting.
The evaluation setup is clear and the comparison between different methods is comprehensive. The trade-off between the area and delay of the resulted ALU designs is well demonstrated.
The reported enhancements in speed and size for adders and multipliers with large bit width are substantial.

缺点

The improvement of PPO-based method becomes smaller as the bit width becomes lower. The paper claims that multipliers and adders are more important for large models in AI applications. However, the modern large models are typically running with 16 bits or even lower bit width. The applicability of the proposed method in modern applications remain questionable.
The evaluation is mostly performed on integer data types. It is unclear how proposed method works on the adders and multipliers of floating-point data type.
While the results are promising, the specific modeling seems to only work for adders and multipliers. The generalization of the proposed approach on other basic arithmetic units are unclear.

问题

Please refer to the weakness.

局限性

Authors have adequately addressed the limitations.

作者回复

2024-08-06

Thank you so much for your constructive comments.

Weakness 1 [Application in Modern LLM]

We appreciate this observation regarding the diminishing improvements of our PPO-based method as the bit width decreases. Indeed, the paper highlights the significance of multipliers and adders for large models, which traditionally have used higher bit widths. However, it is essential to note that while modern AI models often utilize 16 bits or even lower for efficiency, the computational demands and precision requirements can vary significantly depending on the application and deployment scenario. Recent studies [1] have shown that the performance of LLAMA 3 significantly degrades with low-bit quantization. This indicates that our method for high-bit precision has substantial potential for improving the computational speed of future large models and high-performance computing tasks.
In addition to theoretical improvements, we are actively collaborating with industry partners to validate and refine our approach in real-world scenarios, ensuring that our method remains relevant and effective for contemporary AI applications, irrespective of their bit-width requirements.

[1] How Good Are Low-bit Quantized LLAMA3 Models? An Empirical Study

Weakness 2 [About Floating-Point Data]

Thank you for raising the concern regarding the performance of our method on floating-point v.s. integer data types. It is important to clarify that at the fundamental level, the arithmetic operations of addition and multiplication share core similarities between floating-point and integer data types. For example, in floating-point multiplication, the exponents are added (integer addition), and the significands are multiplied (integer multiplication). This indicates that the core computational workload lies in integer arithmetic. Thus, while our current evaluation is on integer data types, the principles and optimizations easily apply to floating-point arithmetic.
Although our initial evaluations focused on integer operations due to their straightforward implementation and common usage in specific applications, the core optimizations are inherently applicable to floating-point arithmetic as well. Future work will explicitly focus on floating-point evaluations to demonstrate this applicability and to fine-tune our method for any floating-point-specific optimizations that may be required.

Weakness 3 [Significance in Addition and Multiplication. Extension to Other Arithmetic Units]

While our current results focus on adders and multipliers, these operations are among the most time-consuming on GPUs, making their optimization highly significant.
We would like to clarify that our work can also be extended to other arithmetic modules. Taking exponentiation as an example, exponentiation operations require multiple multiplication operations. By leveraging the underlying principle of repeated multiplication in exponentiation, we can apply our RL framework, initially designed for optimizing adders and multipliers, to enhance the design and efficiency of exponentiation units. This extension would involve tuning our model to accommodate the iterative nature of exponentiation, ensuring that the optimization captures the unique characteristics and performance requirements of these hardware modules.

审稿意见

评分: 5置信度: 52024-07-17

A reinforcement learning-based method is proposed to design adders and multipliers within a framework of single-player tree generation games, named AddGame and MultGame. The method leads to the discovery of superior designs for 128-bit adders and multipliers, achieving significant improvements in delay and size compared to previous methods.

优点

The results look quite promising, especially on large designs.

缺点

The novelty is limited, as the RL environment (state, action, reward) is pretty much similar to previous work [17], and MCTS is also not something new in a game-playing problem.
Some additional data points in the results might not look reasonable. See questions below.

问题

What's the specific representation of adder and multiplier in the proposed method? Is it similar to previous works (e.g., a matrix in [17])?
Did the obtained design go through an equivalence-checking process to ensure the functionality correctness? If so, the details should be provided.
Figure 6 presents that the proposed method achieves promising results. However, it is observed that the Sklansky adder achieves even better delay than the Kogge-stone, which looks questionable and degrades the convincingness of the experiment flow.

局限性

Yes.

作者回复

2024-08-06

Thank you very much for your constructive feedback.

Weakness 1 [Distinction and Improvement over PrefixRL [17]]

Both our approach and PrefixRL [17] apply an RL framework to circuit design. However, our method incorporates several critical enhancements that we would like to clarify:

Advancement in the RL (state, action, and reward) formulation:
- State: In PrefixRL, prefix trees are represented using a binary matrix, which is solely applicable to the prefix tree structure. In contrast, our model does not directly represent the prefix tree structure. Instead, each node, representing a prefix tree, is evaluated by $W(s)$ , a scoring function designed to effectively balance exploration and exploitation.
- Action: PrefixRL’s actions include adding and deleting cells. However, adding cells does not improve the theoretical metrics (level/size) of the adder it represents. Therefore, we have strategically omitted actions that add cells when optimizing these metrics. This refinement significantly boosts search efficiency, especially in the case of a 128-bit adder, by focusing on actions that directly contribute to performance improvement.
- Reward: The reward in PrefixRL is calculated based on the performance improvement at each step. Our method, however, utilizes the performance value of the final node reached during the simulation phase of MCTS. This shift reduces the computational overhead from multiple time-consuming simulations and provides a more direct assessment of performance outcomes.
Advancement in Implementation:
- Two-level Retrieval: We use the fast synthesis flow to retrieve potential designs, and then use full synthesis flow to get the final performance results, improving the efficiency.
Advancement in Task Setting and Method Design:
- Task Setting: PrefixRL only addresses adder design, whereas we work on adder and multiplier designs.
- Method Design: We have introduced a novel framework for the co-optimization of the multiplier’s compress tree and prefix tree. This integrated approach significantly enhances the overall effectiveness of multiplier design.

Weakness 2 [Data Points]

See the response in Question 3.

Question 1 [State Representation]

Specific representation of adder and multiplier:

Adder: We represent the prefix tree structure based on the score $W(s)$ at the nodes of the search tree, eliminating the need to explicitly design a matrix representation model in PrefixRL [17] and making it more generalizable.
Multiplier: We select high-level meta-features, such as the maximum estimated delay value of bits, different from the matrix representation in counterparts (e.g. RL-MUL). Utilizing these higher-level features enhances the model's ability to learn the construction of multipliers with detailed information, thereby improving design outcomes.

Difference with PrefixRL [17]:

Please refer to our response in [Weakness 1] for the advantages of our approach over PrefixRL.

Question 2 [Did the obtained design go through an equivalence-checking process to ensure the functionality correctness?]

Sure, we did it. Specifically, we used corresponding testbenches for the generated Verilog code, which includes hundreds of sets of addition or multiplication operations. By observing the outputs, we verified the correctness of the computations.

Question 3 [Sklansky v.s. Koggle-Stone]

We agree with the reviewer that, in general, the Kogge-Stone adder generally has lower latency than the Sklansky adder due to its highly parallel, low fanout, and balanced prefix computation structure. As observed in our paper (Figure 6, middle subfigure), the Kogge-Stone adder exhibits lower latency compared to the Sklansky adder.

However, in some specific experimental conditions, the Sklansky adder can also exhibit lower latency, as observed in our experiments (Figure 6, left subfigure) and also reported in the literature [1, 2, 3, 4]. The specific reasons for this observation are the following:

Impact by the Interconnection Complexity: In real-world implementations, the Kogge-Stone adder's structure results in very complex interconnections. Typically, with abundant hardware resources, this interconnection complexity can be effectively managed. However, when actual hardware resources are limited, this complexity can lead to significant wiring congestion and increased actual delay. In contrast, the Sklansky adder, with its reduced interconnection, offers simpler wiring. This simplicity decreases the likelihood of congestion and potentially enhances overall performance under resource limitations.
Impact by Synthesis Tools: Delay is influenced not only by the differences in the prefix tree structures (in Kogge-Stone and Sklansky adders) but also by the subsequent synthesis processes. Typically, the high fan-out in the Sklansky adder tends to increase its delay. However, modern synthesis tools (e.g., Yosys, OpenROAD) are equipped with advanced optimization techniques that can effectively manage high fan-out by balancing the load, buffering critical paths, and optimizing the placement of logic elements. These tools can significantly reduce the adverse impact of high fan-out on delay, ensuring that the Sklansky adder performs efficiently.

To further validate our experimental results, we also consulted an expert in adder design, who believes that it is entirely acceptable for the Sklansky adder to have a slightly lower delay than the Kogge-Stone adder after physical synthesis.

[1] Performance comparison among various parallel prefix adder topologies

[2] PrefixRL: Optimization of parallel prefix circuits using deep reinforcement learning

[3] Implementation of 64 Bit Arithmetic Adders

[4] An Efficient Design and Performance Analysis of Novel 8 Bit Modified Wallace Multiplier Using Sklansky Adder in Comparison with Kogge-Stone Adder (KSA)

评论- rebuttal acknowledgment

2024-08-08

Thanks for the clarification. Most of the concerns are addressed. A few more comments:

If the logic equivalence checking is performed, please add the description in future versions.
Regarding the clarification on Kogge-stone vs. Sklansky, if it is due to interconnection, the authors may inspect the final layout and/or tool settings to verify.

The score is raised accordingly.

评论- Thank you for the update

2024-08-09

Thank you very much for the update.

We will add the checking description in our revised version.
Thank you for the suggestion. Our codebase, available to the research community, includes detailed settings that facilitate such investigations. We will further inspect the final layout and tool settings, and report them in revision.

We appreciate your time and effort in helping us improve our paper.

最终决定Accept (spotlight)

2024-09-25

This submission models Adder and Multiplier design as a single player tree generation problem for RL agents. AddGame supports the prefix tree generation for Adder/Multiplier design and utilizes the Monte Carlo Tree Search RL agent to find the optimal adder. MultGame supports compressor tree generation for Multiplier design and utilizes Proximal Policy Optimization solver to find the optimal multiplier. Optimal adder/multiplier configs are selected based on the delay/area tradeoff each design config. The authors use a synthesis tool, optimized for runtime (skip routing step w/o impact to area/delay estimate), to generate the delay/area metrics using N7 tech PDK. The experiments section compares Adder/Multiplier designs generated by AddGame/MultGame against previous state-of-the-art Adder/Multipliers generated by Manually/via Default Synthesis Recipe/Other Learning based techniques. The comparison data shows superior area/delay designs generated by AddGame/MultGame for a range of bit sizes (8-64 bit Mult, 32-128 bit Add). These results have tremendous practical application as Adder/Multipliers are the primary compute blocks used for computationally intensive benchmarks, AIML and otherwise.

During the discussion phase reviewers asked several clarification questions related to the RL agent setup, the Synthesis/PnR tool settings, tech node selection and potential applicability of the proposed techniques to HW modules other than Adder/Multipliers. The authors have done a great job of adding detailed responses to all such questions.

Reviewer RWUd, raised a great question regarding specific datapoints in Figure 6, left chart, showing Sklansky adder with lower delay than Kogge-stone adder. The reviewers provide a convincing explanation of this datapoint highlighting the limitations of HW resources (routing/wire congestion) as the cause of this non-intuitive datapoint.

The paper is well organized, and the figures and the text explain all key concepts well. I recommend this paper be accepted for NeurIPS.

Edit: Updated my recommendation to spotlight (prev Oral) post discussion with SAC.