ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization
We propose accelerating pretrained LLMs through a "post-training" shift and add reparameterization, towards efficient multiplication-less LLMs, dubbed ShiftAddLLM.
摘要
评审与讨论
In this paper, “ShiftAddLLM: Accelerating Pretrained LLMs via Post-Training Multiplication-Less Reparameterization,” the authors propose a novel post-training reparameterization method that can perform multiplication-free operations in LLMs (Large Language Models). Through a modified APoT (Adaptive Power of Two) quantization and a new post-training method, the authors achieve higher accuracy than the other quantization methods on LLM models while reducing quantization errors. This quantization method also offers a quick quantization process without the need for a retraining procedure, which is typically time-consuming and effort-intensive when applied to LLM models. Once parameters are quantized, multiplication can be done using a simple adder with a LUT-based query multiplicator, which stores precalculated multiplication results.
This work makes a significant contribution to the field of LLM optimization by introducing a novel method that enhances efficiency and reduces computational demands without sacrificing accuracy. Its strengths lie in its innovative approach to error reduction and automated bit allocation, which together offer a compelling solution for deploying LLMs on resource-constrained edge devices. However, the method's reliance on specialized hardware support and the need for further testing in diverse scenarios are notable limitations. Despite these weaknesses, the paper's advancements in quantization strategies and performance improvements make it a valuable addition to current research efforts in model compression and an efficient AI deployment.
优点
By replacing multiplication operations in both the attention and multi-layer perceptron (MLP) layers, the logic area and power can be reduced. Unlike traditional quantization methods that only reduce either activation errors or weight errors, this work tries to reduce both by applying awareness of the other parameter in the error reduction effort (Multi-Objective Optimization method). To my knowledge, this is the first work that attempts to reduce errors for both weight and activation. Finally, this method is supported by an automated bit allocation strategy, which further simplifies the application of this new method to LLMs. There are many APoT quantization variants [2, 3, 4] since it first appeared in 2019 [1]. This quantization method is already established or too old to be considered as new. However, I think this paper holds some meaningful adaptations compared to those similar works. First, the performance of the PTQ (Post Training Quantization) method is indeed better than other new quantization strategies. APTQ [5] also tried to implement Optimal Brain Quantization (OBQ) and uses Hessian trace as a sensitivity metric to reduce quantization error in an attention-aware manner. Compared to another SoTA work, APTQ, this work shows better performance within the same quantization settings (3 bits). In Table 1 of [5], the LLaMa1-7B for the WikiText-2 dataset perplexity test is given, and this work shows a better perplexity score of 6.04, which is 0.72 lower than the APTQ. Unfortunately, other results with larger LLM models (> 13B) are not given in APTQ, so this is the only comparison that could be made.
[1] Li, Yuhang, Xin Dong, and Wei Wang. "Additive Powers-of-Two Quantization: An Efficient Non-uniform Discretization for Neural Networks." International Conference on Learning Representations, 2020. (published in arxiv first at 2019).
[2] Geng, Xinkuang, et al. "Compact Powers-of-Two: An Efficient Non-Uniform Quantization for Deep Neural Networks." 2024 Design, Automation & Test in Europe Conference & Exhibition (DATE). IEEE, 2024.
[3] Przewlocka-Rus, Dominika, et al. "Power-of-two quantization for low bitwidth and hardware compliant neural networks." arXiv preprint arXiv:2203.05025 (2022).
[4] Oh, Sangyun, et al. "Automated log-scale quantization for low-cost deep neural networks." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2021.
[5] Guan, Ziyi, et al. "APTQ: Attention-aware Post-Training Mixed-Precision Quantization for Large Language Models." arXiv preprint arXiv:2402.14866 (2024).
缺点
The multi-objective optimization method is quite novel and impressive, but I think it needs to be tested with more cases. As I mentioned, this is the first work that introduces this method. One drawback is that the quality of quantization is not greatly improved, and we don’t know how much performance improvement will be there if the column-wise scaler is used, since GPU kernels for supporting this do not currently exist. Though, it offers another layer of trade-off between latency and accuracy by letting the user choose between column-wise scaler use or block-wise scaler use. Another point I would like to note is that automated bit allocation may not be fully effective, since the different sensitivity among LLM’s layers and the use of mixed precision degrade the advantage of this work. The use of LUT query-based multiplier and mixed precision for different sensitivity is a good idea, but it is for one LLM model and may not be applicable to different LLM models.
问题
N/A
局限性
N/A
We greatly appreciate your careful review and constructive suggestions. Below are our detailed responses to your concerns.
W1: The multi-objective optimization method is quite novel and impressive, but it needs to be tested with more cases.
Thank you for acknowledging the novelty of our multi-objective optimization method! Following your suggestions, we’ve conducted additional tests and comparisons on the OPT and LLaMA model families.
First, we conducted additional ablation studies using both OPT and LLaMA models, supplementing the results in Table 8 of the submitted manuscript. The new results are presented in Table 5 of the attached PDF in our global response. Our multi-objective optimization demonstrates superior performance, achieving average perplexity reductions of 20.85/8.56/112.26 on OPT models and 1.00/1.21/2.25 on LLaMA models compared to weight-only objective, activation-only objective, or their vanilla combinations, respectively.
Second, we tested the average layer-wise quantization errors for a direct comparison. The results are shown in the table below. Our multi-objective optimization achieves significantly lower quantization errors for both weights and output activations. Specifically, we observed average per-parameter weight quantization error reductions of 0.4 and 0.1, and total output activation error reductions of 18.9 and 53.1 compared to OPTQ and LUT-GEMM, respectively.
| Model | Methods | Wei. Error (Per Param.) | Output Act. Error (Total) |
|---|---|---|---|
| OPT-350M | Wei. Obj. | 0.08 | 31.15 |
| OPT-350M | Act. Obj. | 0.32 | 23.14 |
| OPT-350M | Ours Multi-Obj. | 0.02 | 8.83 |
| OPT-2.7B | Wei. Obj. | 0.12 | 155.38 |
| OPT-2.7B | Act. Obj. | 0.55 | 68.07 |
| OPT-2.7B | Ours Multi-Obj. | 0.01 | 27.15 |
| LLaMA-2-7B | Wei. Obj. | 0.17 | 12.22 |
| LLaMA-2-7B | Act. Obj. | 0.37 | 5.24 |
| LLaMA-2-7B | Ours Multi-Obj. | 0.02 | 3.50 |
These tests demonstrate that our multi-objective optimization achieves lower quantization errors and better model accuracy compared to previous methods focused solely on weight-only or activation-only objectives.
W2: The use of mixed precision for layers with different sensitivities is a good idea, but it is designed for one LLM model. How effective is the automated bit allocation? And how applicable is this approach to different LLM models?
Thank you for your constructive questions! We agree that the effectiveness and applicability of our automated bit allocation need to be validated across different LLM models.
To address your questions, we evaluated our mixed bit allocation strategy and compared Ours (Mixed) with Ours (Lat.). The results are shown in Table 6 of the attached PDF in our global response. Ours (Mixed) further reduces perplexity by an average of 96.86, 3.23, and 2.63 for OPT, LLaMA, and Gemma models, respectively, under comparable or even less latency. This set of experiments further validates the applicability of our automated bit allocation strategy to different LLMs.
In addition, we want to clarify that, for each model, we search for the optimal bit allocation with negligible overhead (e.g., 1%~10% of the reparameterization time). For example, it takes 0.5 seconds for searching versus 72 seconds for reparameterizing OPT-125M with a single bit configuration, and 1 minute for searching versus 13 minutes for reparameterizing OPT-13B with a single bit configuration. This is achieved by leveraging the proposed proxy criteria (as shown in Eq. 3 of the submitted manuscript), instead of searching according to the reparameterization errors, which is time-consuming and requires running models at each bit. Using the proxy criteria, the bit allocation candidate rankings are highly correlated with the rankings obtained using actual reparameterization errors, with a Kendall of 0.910/0.905/0.915 for OPT-125M/1.3B/13B and 0.931/0.929/0.897 for LLaMA-7B/13B/8B, respectively.
In summary, our proposed automated bit allocation strategy is effective and applicable to different LLMs with negligible overhead using the proposed proxy criteria.
I will maintian my positive score for this paper.
Dear Reviewer 4dkb,
We thank you for the prompt response and for maintaining the positive rating score! We appreciate your constructive suggestions and will incorporate the new experimental results and corresponding analysis into our final revised manuscript.
Best regards,
Paper 12025 Authors
In this paper, the authors propose the ShiftAddLLM method to simplify complex matrix multiplications using simple shift and add operations. To enable efficient computation, they suggest assigning scaling factors in a column-wise and block-wise manner, where these scaling factors follow the form of powers of 2, replacing scalar multiplications with bit shifts. This scaling factor assignment provides an efficient trade-off between latency and accuracy. The authors conduct experiments using several well-known large language models (LLMs) across various tasks.
优点
- The proposed methods consider quantization techniques not only for accuracy but also for efficient computation, addressing the growing demand for LLM services with an increasing number of parameters.
- The bit allocation method, though simple, proves effective and is demonstrated through various models.
- The authors show that it is possible to constrain scaling factors to follow a specific form, enabling shift operations instead of multiplications, with reasonable accuracy degradation.
- The experiments encompass various design explorations.
缺点
-
It appears that the authors do not discuss the impact of batch size on performance and focus solely on latency. What would be the limitations when the batch size increases in terms of throughput? While latency is an important factor, the overall service cost is dominated by throughput, which can be improved by increasing the batch size and by reducing memory footprint through quantization. The experimental results should include a description of batch sizes. If only a batch size of 1 is considered in the manuscript, the authors should explain why this consideration is practical.
-
The authors claim that replacing multiplications with shift operations is important. While this might benefit ASIC design for AI inference, GPUs already have numerous multiplication units. The authors demonstrate these benefits using an Eyeriss-like hardware simulation. If this paper focuses on new ASIC design, it should be rewritten to discuss the necessary ASIC design comprehensively. Discussing shift operations alone is insufficient for a thorough discussion on new ASIC design.
-
The authors do not include recently published state-of-the-art quantization methods. For example, FlexRound and OmniQuant are more advanced schemes that should be considered. LUT-GEMM does not suggest new quantization formats but addresses efficient computation methods for previously existing quantization schemes.
-
This reviewer is skeptical about using perplexity (PPL) as the main metric for experiments. Measuring scores for MMLU and conducting A/B tests (using AlpacaEval with GPT-4) would better represent the quality of the proposed scheme.
-
For 3-bit and 2-bit experiments, even though the proposed method might be superior to previous ones, the authors do not present results for a 4-bit setup. Compared to full-precision, score degradation is noticeable with 3-bit and 2-bit methods. This reviewer cannot understand why the authors selected such extremely low-bit methods only.
问题
Please refer to weakness
局限性
- The authors do not discuss various batch sizes, which are highly relevant to the limitations of the proposed scheme.
- Only very low-bit quantization schemes are considered, and the authors do not explain why this selection was made.
- It is difficult to estimate latency improvement from the manuscript. Do the authors plan to release open-source code?
We appreciate your time and suggestions in reviewing our work. Below are our detailed responses to your concerns.
W1 & L1: Discuss the impact of batch sizes on throughput. If only a batch size of 1 is considered, explain why this is practical?
Following your suggestion, we have further tested the throughput of our CUDA kernels and end-to-end models with increased batch sizes, as demonstrated in Figure 1 of the attached PDF in our global response. Our ShiftAddLLM still outperforms all three baselines at a batch size of 8 in terms of accuracy-efficiency trade-offs, achieving on average 3.37x/2.55x/1.39x throughput improvements compared to OPTQ, AWQ, and LUT-GEMM at similar or much better accuracy.
Previously, we assumed a batch size of one for mobile applications where only one user is using the LLM. This assumption also stems from the sequential nature of LLMs during generation, i.e., generating one token at a time based on all previously generated contexts. The assumption of a batch size of 1 is also used in previous literature, such as AWQ, OPTQ, and LUT-GEMM, to measure the latency or throughput for LLM serving. We will clarify this assumption.
W2: The authors show the benefits via an Eyeriss-like H.W. simulation, but GPUs already have numerous multiplication units. Rewritten to address the new ASIC design if that is focus?
We humbly clarify and emphasize that we reported real-measured GPU latency (see Lines 281-282 and Figure 6 of the submitted manuscript) instead of simulated results to demonstrate the up to 65% latency savings, which benefit from our proposed shift-and-add reparameterization and dedicated CUDA kernel optimization for GPUs.
Regarding energy savings, since we cannot directly measure energy on GPUs, we used existing simulators to estimate the energy savings. However, our focus remains on showing the practical benefits of current GPUs. We are not aiming to propose a new ASIC design in this work, as that is not our main focus.
W3: Benchmark with FlexRound and OmniQuant?
As suggested, we further compare our ShiftAddLLM with both FlexRound and OmniQuant on OPT and LLaMA models. As shown in Tables 2 & 3 of the attached PDF in our global response, our ShiftAddLLM consistently shows better accuracy-efficiency trade-offs, achieving average 0.15 (4-bit) / 0.39 (3-bit) and 0.30 (4-bit) / 0.52 (3-bit) perplexity reduction, as compared to FlexRound and OmniQuant, respectively. Note that the baseline results are directly obtained from the original paper and follow-up work LRQ [1]. In addition, we tested OmniQuant at 2 bits ourselves and found it fails for OPT models, whereas ours performs well for OPT models and also achieves an average 1.96 perplexity reduction than OmniQuant on LLaMA at 2 bits.
[1] LRQ: Optimizing PTQ for LLMs by Learning Low-Rank Weight-Scaling Matrices, arXiv'24
W4: Skeptical about using perplexity as the main metric. Show scores for MMLU and conduct A/B tests?
We acknowledge that perplexity is not the gold metric and therefore also provided the accuracy of seven downstream tasks in Table 5 of the submitted manuscript.
Furthermore, as suggested, we extend the evaluation to include the MMLU and A/B test using AlpacaEval with GPT-4. The results are shown in Table 4 and Figure 2 of the attached PDF in our global response. Our ShiftAddLLM consistently achieves 3.58% accuracy improvements over OPTQ and 3.83% over LUT-GEMM for MMLU when using the OPT-60B model. For A/B tests, we used the GPT-4 score to evaluate the quantized models against the FP16 counterpart. Our ShiftAddLLM achieves 8.6%/20.7% higher winning rates than OPTQ (29.3%) and LUT-GEMM (17.2%) when using the LLaMA-2-7B model. We provide an example of the generation comparison in Figure 2 of the attached PDF.
W5 & L2: Do not present results for a 4-bit setup. Why select such extremely low-bit methods like 3/2 bits only?
As requested, we have provided the 4-bit results in Table 1 of the attached PDF in our global response. These results show that ShiftAddLLM consistently outperforms the baselines at 4 bits, achieving average perplexity reductions of 0.90/1.32/1.00 and 0.44/0.22/0.02 as compared to OPTQ / LUT-GEMM / AWQ, using OPT models and LLaMA models, respectively. In addition, we have also included comparisons with FlexRound and OmniQuant at 4 bits (see response to W3).
We consider lower-bit quantization because we aim to push the accuracy-efficiency boundary to lower bits with minimal accuracy compromise. This is meaningful for large-scale LLMs, where even at 3 bits, they remain memory-bound. As analyzed using the Roofline model shown in Figure 5 of [2], for Nvidia A6000 GPUs, the turning point from memory-bound to compute-bound is 200 arithmetic intensity (OPs/bytes). For LLaMA-7B models, all the operators in the decode/generation phase have around or less than 1 arithmetic intensity, as shown in Table 1 of [2]. Even at 4 bits, the arithmetic intensity is approximately 1 / 3 * 32 = 8 (same ops but 4/32 fewer memory accesses), which is far less than the turning point of 200 and thus remains memory-bound, let alone larger models like LLaMA-70B or beyond. Reducing from 4 bits to 2 bits can help increase the arithmetic intensity and thus the theoretically maximum performance by 2x, from 6144G OPS to 12288G OPS. If memory is not a bottleneck for much smaller cases or prefill stages, higher bits can be used for better accuracy. Our goal is to offer an additional option and trade-off for large, memory-bound cases, without forcing the exclusive use of 2 bits.
[2] LLM Inference Unveiled: Survey and Roofline Model Insights, arXiv'24
L3: Difficult to estimate latency improvement. Plan to release the code?
Yes, we do plan to open-source the code to ensure reproducibility, as we also promised in the abstract.
We have reported the real-measured GPU latency. Our ShiftAddLLM achieves 6.5% ~ 65.0% latency reductions for OPT/LLaMA models, at similar or even lower perplexity.
I sincerely appreciate the authors' efforts to address my concerns. I am particularly pleased with the additional experimental data provided in response to my previous comments. The extended experiments presented in the attached PDF have effectively demonstrated the impact of the multi-objective function proposed in the manuscript.
However, I still have the following concerns:
- The proposed approach appears to be a straightforward combination of MSFP and LUT-GEMM. While the manuscript provides a thorough introduction to LUT-GEMM, it would be beneficial to also introduce MSFP and clarify how the revised manuscript differentiates this approach.
- The results related to the ASIC design remain unclear. If possible, including detailed explanations about EyeRiss in the appendix would be helpful. Without this, it is challenging to fully understand the claims regarding power and area reduction.
Overall, I am raising my score.
Dear Reviewer 7Vce,
Thank you very much for your feedback and for raising your score. We are particularly glad that our additional experiments and clarifications have effectively addressed many of your concerns. Regarding the remaining points you raised, please find our answers below:
C1: Clarification on MSFP and ShiftAddLLM (Our Approach)
Thank you for your suggestion to introduce MSFP more thoroughly and differentiate our approach from it. We will ensure that MSFP [1] is clearly explained in the revised manuscript. Our approach builds on the foundation of LUT-GEMM and DenseShift [2] but introduces significant differences from MSFP. Specifically, while MSFP employs a shared exponent across a group of elements, our method unlocks the use of powers-of-two quantization to only scaling factors, allowing each to have a different exponent with a mantissa of zero, while keeping the activation in standard FP16 format without shared exponents.
This differentiation, combined with our proposed incorporation of shifts, adds, LUT components, and tailored scaling factor patterns, enables our multi-objective optimization and mixed-bit allocation strategies. These contributions allow our approach to achieve extremely low-bit weight quantization (e.g., 2-bit) with minimal accuracy loss—something challenging for MSFP, DenseShift, LUT-GEMM, or their simple combinations.
[1] Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point, NeurIPS’20
[2] DenseShift: Towards Accurate and Efficient Low-Bit Power-of-Two Quantization, ICCV’23
C2: Additional Clarifications on ASIC Design and Eyeriss
Thank you for highlighting this. To enhance the accessibility of our paper, we will include detailed explanations about the Eyeriss [3] architecture and how we adapt it. Specifically, we modify the MAC array by replacing selected MAC units with shift, add, and LUT units, facilitating our proposed design’s area and energy efficiencies (26%~89% savings). We will also incorporate a figure inspired by NASA [4] to visually demonstrate this modification, which we hope will clarify our claims regarding power and area reduction.
[3] Eyeriss: An Energy-Efficient Reconfigurable Accelerator for Deep Convolutional Neural Networks, JSSC’17
[4] NASA: Neural Architecture Search and Acceleration for Hardware Inspired Hybrid Networks, ICCAD’22
We will add these additional discussion points to our final revised manuscript. We hope our response and including these discussion points in the paper fully address your concerns.
Thank you very much once again for the time and consideration you’ve given to our paper. Your review, together with other reviewers, has certainly helped improve and strengthen our work.
Best regards,
Paper 12025 Authors
This response is somewhat disappointing.
First, the MSFP format allows for a shared exponent to be used as a scaling factor, with the exponent represented as an integer. Fundamentally, this means that MSFP and ShiftAddLLM employ very similar schemes for scaling factor assignment. While it is true that MSFP assumes quantization of activations, when it comes to weights, this reviewer sees no significant differences between the formats used by MSFP and ShiftAddLLM.
Second, this paper is confusing regarding the intended target system. The authors present performance results using GPUs, which are already commercialized and optimized for large batch sizes. Then, they suddenly introduce experimental results based on an ASIC design, focusing on power and area reduction using the proposed scheme. Furthermore, the authors unexpectedly claim that small batch sizes were employed for mobile applications. This creates a very confusing narrative. What is the actual target? Are the authors suggesting that ShiftAddLLM is effective across all platforms (mobile, GPUs, and ASIC)? It would be much more effective to select a single, clear target system and discuss the pros and cons of the proposed scheme in that context.
At this point, I am unsure whether this paper demonstrates clear novelty or a well-defined target system.
Dear Reviewer,
Could you please review the rebuttal, discuss it with the peer reviewers, and finalize your rating?
Thank you for your efforts!
Regards,
AC
Response [1/2]
Dear Reviewer 7Vce,
We were very encouraged by your previous feedback that our initial rebuttal successfully addressed your major concern regarding batching results and clarified both the motivation for very low-bit quantization and our real-GPU latency (as opposed to using a new ASIC). We believe that adding these additional experimental results and clarifications will further strengthen our work and its contributions to the community. Regarding the new discussion, we humbly seek to clarify our points and hope to reach a consensus in our understanding.
For the first point, there were potential misunderstandings. We agree with you that MSFP shares exponents and shifts the mantissa accordingly, mimicking multiplication by powers of two. However, while we recognize MSFP’s unique contributions/innovations well, we humbly clarify that our approach differs from MSFP in two key aspects:
-
Nature of Approach: MSFP uses shared exponents but relies on various shifted mantissa to represent the weights; without this, all weights would collapse to the same value. In contrast, we do not use shared exponents for scaling factors and eliminate the need for mantissa. In particular, each scaling factor is represented as a distinct power-of-two integer (equivalent to the exponents in floating-point numbers, completely removing the mantissa bits). In this way, the multiplication between a floating-point activation and a power-of-two integer scaling factor can be simplified to adding the corresponding integer to the exponent bit of the floating-point activation, as described in Figure 1(c) of the submitted manuscript. In addition, rather than sharing the exponents, the entire scaling factor in ShiftAddLLM is shared across groups of binary weights in a column/block-wise manner, as illustrated in Figure 3(a) and detailed in Section 4.2 of the submitted manuscript, carefully designed to optimize both weight quantization and output activation errors without conflicts. Hence, there are clear differences between the MSFP datatype and our quantization scheme. In fact, our method is orthogonal to MSFP and can be combined with it by representing input activations in MSFP for more aggressive performance improvements.
-
Determining Shared Exponents or Scaling Factors: The method for determining shared exponents in MSFP or shared scaling factors in our quantization scheme is different. MSFP selects the maximum exponent to share across the bounding-box size, i.e., the number of elements sharing one exponent [1], which is simpler in implementation yet might not be as adaptive. In contrast, in our ShiftAddLLM, the reparameterized binary weights and scaling factors result from multi-objective optimization. This optimization adaptively designs scaling factor patterns to avoid conflicts between optimizing weight errors and optimizing output activation errors.
Finally, in terms of the performance outcomes, MSFP at 4 bits (1-bit sign and 3-bit mantissa) already suffers from large quantization errors, as evidenced by the significant KL divergence shown in Figure 3 of [1]. In contrast, our ShiftAddLLM at 3 or 4 bits can still achieve comparable accuracy to FP baselines.
To directly compare ShiftAddLLM with MSFP, we conducted additional experiments to compare (1) quantization errors and (2) KL divergence using both methods against their floating-point counterparts. We randomly selected ten weight matrices from OPT-350M, quantizing or reparameterizing them using both methods. The results, summarized in the tables below, indicate that ShiftAddLLM consistently outperforms MSFP, achieving lower KL divergence by 0.0065, 0.0271, and 0.0952, and reducing quantization errors by 1707.3, 3251.1, and 5862.0 at 4-bit, 3-bit, and 2-bit quantization, respectively.
| Methods | Bits | Avg. KL Divergence | Avg. Quant. Error |
|---|---|---|---|
| MSFP (bouding-box size = 128) | 4 | 0.0117 | 4129.1 |
| ShiftAddLLM (group size = 128) | 4 | 0.0052 | 2421.8 |
| MSFP (bouding-box size = 128) | 3 | 0.0434 | 7859.9 |
| ShiftAddLLM (group size = 128) | 3 | 0.0163 | 4608.8 |
| MSFP (bouding-box size = 128) | 2 | 0.1485 | 14355.7 |
| ShiftAddLLM (group size = 128) | 2 | 0.0533 | 8493.7 |
[1] Pushing the Limits of Narrow Precision Inferencing at Cloud Scale with Microsoft Floating Point, NeurIPS’20
Thank you for your thoughtful and detailed responses. However, I feel the need to clarify the points I raised in my previous comments:
MSFP vs ShiftAddLLM: The authors primarily discuss how to obtain specific numbers for two different formats. While I fully acknowledge the efficient optimization methods proposed, my concern is that if the Shift and Add operations are crucial (as implied by their inclusion in the title), why is there no reference to MSFP in the paper? MSFP allows scaling factors to follow a particular structure that also enables efficient operations, such as shift and adder for exponent computations. If the authors intend to emphasize that the optimization method is key rather than the format itself, this distinction should be clearly described, while acknowledging that the format itself has parallels, such as with MSFP (row-wise or column-wise assignment would be minor in this sense). Unfortunately, the authors address quantization error or KL divergence, which I did not mention in my previous comment. My point is that if the ‘shift’ and ‘add’ operations are so significant as to be highlighted in the title, why do the authors focus more on accuracy rather than on the unique features of ShiftAddLLM, especially in comparison to MSFP? When I ask specific questions, the authors seem to provide indirect answers.
Target System: I am inquiring about the pros and cons of the proposed method for different target systems. I am also asking about the fundamental limitations of the proposed computation scheme. It’s understood that there may not be a single perfect computation engine, but a particular scheme could be especially efficient for certain scenarios.
Overall, I appreciate the merits of the work, particularly the further optimization to obtain quantized weights and efficient operations, especially for small batch sizes. However, the rebuttal and manuscript seem to mix too many messages, with limitations and relevant works not adequately recognized.
Nonetheless, as the authors have made significant efforts to address my concerns, I have decided to raise the score to 5.
Response [2/2]
For the second point, as emphasized in our submitted manuscript, our primary focus is on GPU acceleration, specifically through the development of dedicated CUDA kernel support. It is worth noting that, we intentionally did not delve into specific ASIC designs, which were referenced only to demonstrate potential energy savings. We hope to humbly clarify that while we appreciate and follow your suggestion to describe the ASIC design used in our energy saving experiment, it is not the primary target of our work. In addition, our ShiftAddLLM has been evaluated and outperforms the baselines on both (1) a single GPU (bs=1) for single-user interactions, which is to follow the settings in SOTA prior works like AWQ, OPTQ, and LUT-GEMM and can be used for mobile applications, and (2) cloud data-center setups (bs > 1) supporting multiple users.
Finally, as recognized by the other two reviewers, our method's contributions extend beyond reparameterization. They also include (1) the crucial multi-objective optimization and (2) the automated mixed-bit allocation strategy, for which we have provided additional results to fully demonstrate their effectiveness (please refer to our response to Reviewer 4dkb’s W1/W2 for more details).
In summary, we have provided experiments and clarifications to address all the concerns and comments you raised:
- The impact of batch sizes on throughput and why consider batch size of 1:
- We have clarified in our reply to your W1 & L1 and included the results in Figure 1 of the attached PDF.
- ASIC Design vs. GPU Acceleration:
- We have clarified that our results are real-measured GPU speedups.
- Comparison with FlexRound and OmniQuant:
- We have included the results in Tables 2 and 3 of the attached PDF.
- Show scores for MMLU and conduct A/B tests using AplacaEval with GPT-4:
- We have included the results in Table 4 and Figure 2 of the attached PDF.
- Lack of results for a 4-bit setup:
- We have included the results in Table 1 of the attached PDF.
- Why reduce from 4 bits to 3 bits or even 2 bits:
- We have clarified in our reply to your W5 & L2.
- Comparison with MSFP (asked during the discussion phase):
- We have clarified the differences and included the comparison in this response.
Best regards,
Paper 12025 Authors
The presented work replaces multiplications by shift and add operations as a post-training processing step of LLM neural network models. The proposed quantization improves a lot over SOTA methods using improved trade-offs and better control of the quantization error. Despite bit-level operations, the resulting models seem to still execute well and fast on GPUs.
优点
Unlike SOTA approaches, the proposed method is a post-processing step applied to a model and does not require any data-intensive steps (e.g. (re-)training or fine-tuning).
缺点
Clarity of the presentation could be improved.
The technique is not specific to LLMs, the presentation even lacks reference to any specific NN architecture. Therefore it is a bit unfortunate that even the title already names the technique ShiftAddLLM. As the major contribution suggests replacing multiplications in context of matrices by shift and add operations, the field of applications may extend very well beyond the restricted focus of LLMs even to outside machine learning. At least a discussion on this would be great, but I even suggest to find a better name for the technique and clearly distinguish between the generality of the approach and the specific field of application in the discussed context.
Lines 41-42 mentions "... up to 31x energy and 26x area reductions (see Tab. 1)." Please refer to the exact number pairs in the Table to help the reader.
Lines 83-84: please add here mention of the correct baseline for these perplexity improvements as the experimental results in fact are perplexity degrations from the FP16 baseline.
Lines 141-146: the presentation of energy (and area) savings are a bit misleading. In fact, a naive implementation of 32-bit multiplication from shifts and adds as of Table 1 consumes 2x the energy of the 32-bit multiplication listed there. In addition you don't list the LUT operations. This section could use some improvement to sum over all the operations in Figure 1 and compare it to the equivalent operations that it replaces instead of just repeating some maximum savings of Table 1.
Lines 218-219: The numbers in the text are not mentioned in the referenced table of Figure 3. Please correct and explain in more detail.
Line 241, but also Figure 4 & 8: Please briefly introduce abbreviations of the layer identifiers, e.g. Q/K, V, Out., FC1, FC2.
Table 3: to me moving from 3 bits to 2 bits hurts substantially wrt perplexity, being at best marginally faster. If relevant at all, only memory savings are noticable and enable to use 2 bits for the first time (over SOTA baselines) with some slight compromises on perplexity. A discussion of this would be useful.
Table 5 lacks numbers of an FP16 baseline.
Figure 8: although the patterns support readability in other Figures (4 & 7), they don't help here and I personally find them rather disturbing. I suggest to either use different patterns for different colors here as well or just colors with different luminance as the order of colors is the same in both diagrams.
The two "Insights." sections (lines 181-185 and 230-234) do not add information and I suggest to remove them to gain space to improve clarity and add discussions elsewhere where needed.
The order and placement of tables and figures does not support the flow of reading the publication. A few suggestions:
Figure 1 is better placed in front of Section 4.1 (referenced there first).
Figure 2 is better placed in front of Section 4.2 (referenced there first).
Table 2 is better placed in front of Section 5.2 (referenced there first).
Tables 5 and 4 should switch numbers.
Tables 7 and 6 should switch numbers.
Section 5.4: the section title "Limitation and Broader Societal Impact Discussion" hints at more than limitations, but the section only discusses limitations. Please correct the title.
Typos:
Line 348: "... ShiftAdLLM." -> "... ShiftAddLLM."
问题
Figure 1 could use some improved explanations how the Shifted Activations are turned into LUTs. It is also unclear where the "8-bit key" in the ShiftAddLLM block comes from and how it is being used. By intuition I would say that the binary weights select from the LUTs. Why is then "another" 8-bit key necessary? LUT output is FP16 if not mistaken, please state the format as well to add clarity.
Line 282: "... using an Eyeriss-like hardware accelerator..." Did you really use that in this work or in fact the cited "DNN-chip predictor" [67]? Please add clarity, especially if energy consumption was calculated and not measured.
局限性
The authors already list a limitation: despite the fact that GPUs can still execute the resulting quantized model well and fast, different quantization schemes require customized CUDA kernels. Reference implementations using FP16 math run without any additional effort and usually require little customization to speed them up (e.g. enable using tensor cores in ML graphs).
We greatly appreciate your positive comments and constructive suggestions. Below are our detailed responses to your questions.
W1: The clarity of the presentation could be improved.
(1) Technique applicability beyond LLMs: discussion needed. You are correct that the idea is general and can be extended to other smaller models like CNNs [1] or ViTs [2]. Meanwhile, this work’s implementation is specifically dedicated to large-scale LLMs: It is the first instance of applying the shift-and-add technique at the scale of LLMs with billions of parameters. While many ideas perform well with models having millions of parameters, they often fail to scale effectively. Unlike previous methods that require additional training and do not yield good results for large-scale LLMs, our approach is uniquely tailored for LLMs. We incorporate "post-training" reparameterization and carefully designed scaling factor patterns, enabling multi-objective optimization for LLMs and ensuring superior performance compared to prior quantization methods. We will add this discussion to the final revision.
[1] ShiftAddNet: A Hardware-Inspired Deep Network, NeurIPS'20
[2] ShiftAddViT: Mixture of Multiplication Primitives Towards Efficient ViTs, NeurIPS'23
(2) Lines 41-42: Refer to the exact numbers. Thanks for pointing this out! The number is derived by comparing adds to multiplications in terms of the INT32 format. The energy savings are 3.1 / 0.1 = 31x, and the area savings are 3495 / 137 ≈ 26x. We will clarify this in the tables.
(3) Lines 83-84: Perplexity reduction or improvements? We apologize for the misuse of the word "improvements." We are actually referring to perplexity reductions, as lower perplexity denotes better results. We will correct this.
(4) Lines 141-146: The presentation of energy savings could be improved. We greatly value this suggestion and consider a summed energy comparison between equivalent computations. We tested matrix multiplication from one MLP layer of OPT-66B between weight and activation using: (1) FP16 MACs, (2) OPTQ with 3 bits weights and (3) our ShiftAddLLM with 3 bits weights. The resulting energy consumption are 80.36J, 18.48J, and 9.77J, respectively. Our method achieves energy savings of 87.8% compared to FP16 and 47.1% compared to OPTQ.
(5) Lines 218-219: Refer to numbers in the table. The number is derived by comparing the first two rows of Figure 3 (b). The perplexity is reduced by 16.3 - 9.6 = 6.7, and the latency overhead is (44.1 - 33.2) / 44.1 ≈ 24.7%. We will make this clear.
(6) Line 241: Briefly introduce abbreviations. Sure, in self-attention, Q/K/V refers to linear layers for queries, keys, and values, respectively. Out, on the other hand, refers to the output linear layer. In MLPs, FC1 and FC2 refer to the two adopted linear layers. We will clarify this.
(7) Table 3: Discussion on why reducing from 3 bits to 2 bits. Great point! We will add the following discussion: We aim to push the accuracy-efficiency boundary to lower bits with minimal accuracy compromise. This is meaningful for large-scale LLMs, where even at 3 bits, they remain memory-bound. As analyzed using the Roofline model shown in Figure 5 of [3], for Nvidia A6000 GPUs, the turning point from memory-bound to compute-bound is 200 arithmetic intensity (OPs/bytes). For LLaMA-7B models, all the operators in the decode/generation phase have around or less than 1 arithmetic intensity, as shown in Table 1 of [3]. Even at 3 bits, the arithmetic intensity is approximately 1 / 3 * 32 ≈ 10 (same ops but 3/32 fewer memory accesses), which is far less than the turning point of 200 and thus remains memory-bound, let alone larger models like LLaMA-70B or beyond. Reducing from 3 bits to 2 bits can help increase the arithmetic intensity and thus the theoretically maximum performance by 1.5x, from 8192G OPS to 12288G OPS. If memory is not a bottleneck for much smaller cases or prefill stages, higher bits can be used for better accuracy. Our goal is to offer an additional option and trade-off for large, memory-bound cases, without forcing the exclusive use of 2 bits. We will add this discussion to the final revision.
[3] LLM Inference Unveiled: Survey and Roofline Model Insights, arXiv'24
(8) Table 5 lacks an FP16 baseline. We provide the FP16 baseline results in Table 4 of the attached PDF in our global response. Our ShiftAddLLM at 3 bits achieves comparable or even better accuracy than the FP16 baseline, for example, 72.45 vs. 69.82 for BoolQ and on average 0.52% accuracy gain across eight tasks using the OPT-66B model.
(9) Figure 8: Remove the patterns. We will remove the patterns and use the recommended colors.
(10) Remove “Insights” sections. Will remove.
(11) Change the order of tables/figures. Will change orders accordingly.
(12) Correct Section 5.4 title. Will correct the title.
(13) Typos. Will correct it.
Q1: Figure 1 could benefit from improved explanations. Intuitively, the binary weights select from the LUTs? LUT output is FP16?
Your intuition is correct. We use binary weights to serve as the key for the LUTs. The "8-bit key" refers to grouped binary weights, with eight binary weights forming an INT8 key. To construct the LUTs, we precompute 256 (2^8) possible values for every eight elements in the shifted activations. Suppose the shifted activation is an n-dimensional vector. In that case, we will get n/8 LUTs, where the grouped binary weights are used as keys, and the precomputed partial sums are stored as values. Yes, the LUT output is in FP16 format. We will add these details to the final revision for clarity.
Q2: Clarify whether using Eyeriss or DNN-chip Predictor.
We used the cited DNN-chip predictor to simulate and calculate the energy (within 18% of the differences with Eyeriss’s chip measurement results as claimed). We will clarify this.
I thank the authors for addressing my comments well and with a lot of care.
I considered raising my scores, but decided against, because I also agree with some comments of reviewer 7Vce, in particular the lack of batching results. This becomes more and more important for deployment of LLMs.
Dear Reviewer 3fe7,
Thank you very much for taking the time to check our rebuttal, providing your positive feedback, and considering raising your score. We are very encouraged to hear that our rebuttal has addressed your comments well. Regarding the batching aspect, we would like to provide further experiment results and clarification as follows,
1. Sensitivity Analysis with Larger Batch Sizes.
In line with Reviewer 7Vce’s suggestions and to address the importance of batching in the deployment of LLMs, we have conducted additional experiments focusing on the throughput of our CUDA kernels and end-to-end models with increased batch sizes, from 1 to 8.
As shown in Figure 1 of the rebuttal PDF in our global response, our ShiftAddLLM continues to outperform all three baselines—OPTQ, AWQ, and LUT-GEMM—at a batch size of 8. Specifically, our method achieves throughput improvements of 3.37, 2.55, and 1.39 at iso-quality, respectively, while maintaining similar or even better accuracy (see Figure 1a and Figure 1b in the rebuttal PDF).
2. Why Batch Size = 1?
Our initial focus on a batch size of one was guided by the prevalent scenario in mobile applications, where individual user interactions typically involve sequential token generation. This assumption, which has been adopted in prior works like AWQ, OPTQ, and LUT-GEMM, reflects the real-world latency and throughput concerns in LLM serving. But we agree that the cloud is also an important use case, and we will make sure to discuss this case and provide ablation studies in the revised manuscript.
We hope these new results and clarification can effectively address your concerns regarding throughput at larger batch sizes. We greatly appreciate that your constructive review has helped further improve and strengthen our work, making our research more valuable for the community.
Best regards,
Paper 12025 Authors
Dear ACs and Reviewers,
First, we deeply appreciate the time and effort you have devoted to providing reviews for our paper, particularly given the substantial scale of a conference like NeurIPS. Your efforts are truly valued.
We are immensely grateful for the positive feedback our paper has received. The accolades, highlighting its innovative approach, significant contributions to the field of LLM optimization, well-executed and fast GPU implementation, absence of data-intensive steps like retraining or finetuning, reduced logic area and power, and a simple yet effective method, along with extensive and thorough experiments, are all deeply gratifying. It is particularly encouraging that these aspects have garnered such appreciation from the reviewers.
In addition to the aforementioned commendations, we have also received requests for additional experiments and further clarifications from reviewers. In response, we have conducted the requested experiments and provided detailed clarifications to the questions raised, as summarized below.
To summarize, the following experiments have been supplied:
- (1) Lack the numbers for an FP16 baseline:
- We have responded to Reviewer 3fe7’s W1-(8) and included the results in Table 4 of the attached PDF.
- (2) The impact of batch sizes on throughput:
- We have responded to Reviewer 7Vce’s W1 & L1 and included the results in Figure 1 of the attached PDF.
- (3) Comparison with FlexRound and OmniQuant:
- We have responded to Reviewer 7Vce’s W3 and included the results in Tables 2 and 3 of the attached PDF.
- (4) Show scores for MMLU and conduct A/B tests using AplacaEval with GPT-4:
- We have responded to Reviewer 7Vce’s W4 and included the results in Table 4 and Figure 2 of the attached PDF.
- (5) Lack of results for a 4-bit setup:
- We have responded to Reviewer 7Vce’s W5 and included the results in Table 1 of the attached PDF.
- (6) More test cases for the multi-objective optimization:
- We have responded to Reviewer 4dkb’s W1 and included the results in Table 5 of the attached PDF.
- (7) More evaluation of the effectiveness and applicability of the automated bit allocation:
- We have responded to Reviewer 4dkb’s W2 and included the results in Table 6 of the attached PDF.
To summarize, the following questions have been clarified:
- (1) The technique's applicability beyond LLMs:
- We clarify this in our response to Reviewer 3fe7’s W1-(1).
- (2) Presentation details and clarity improvements:
- We clarify those in our response to Reviewer 3fe7’s W1-(2-6), (9-13).
- (3) Why reduce from 4 bits to 3 bits or even 2 bits:
- We clarify this in our response to Reviewer 3fe7’s W1-(7) and Reviewer 7Vce’s W5 and L2.
- (4) Figure 1 could benefit from improved explanations:
- We clarify this in our response to Reviewer 3fe7’s Q1.
- (5) Whether using Eyeriss or DNN-chip Predictor:
- We clarify this in our response to Reviewer 3fe7’s Q2.
- (6) Why consider the batch size of one:
- We clarify this in our response to Reviewer 7Vce’s W1 and L1.
- (7) New ASIC design and simulation instead of GPU acceleration:
- We clarify that our results are real-measured GPU speedups in our response to Reviewer 7Vce’s W2.
- (8) Plan to release the code:
- We promised this in the abstract of our submitted manuscript and here also clarify this in our response to Reviewer 7Vce’s L3.
Regarding Reviewer 7Vce’s questions about batch sizes, hardware simulation or real-measured GPU latency, downstream task performance, and comparison with FlexRound and OmniQuant, we've provided clarifications and results within the length limitations. We are open to providing further details in case any points still need to be clarified. As committed in the abstract, we will release both the codebase and pre-trained models, enabling others to replicate our results effectively.
We would greatly appreciate it if you could review our rebuttal responses. We hope that the new experiments and clarifications address your concerns. We are always willing to engage in further discussion, so please let us know if our responses do not fully resolve your concerns, and we will be happy to provide additional clarifications. Thank you!
Best regards,
Paper 12025 Authors
Dear Reviewers,
Thank you for your efforts. Please review the rebuttals, engage in the discussions, and provide your final ratings.
Thank you again for your valuable contributions.
AC
The paper introduces a novel method that replaces complex matrix multiplications with shift-and-add operations as a post-training optimization step in neural network models. This approach improves computational efficiency, particularly in terms of energy and area reductions, without the need for re-training, unlike state-of-the-art (SOTA) methods. Despite utilizing bit-level operations, the optimized models maintain strong performance on GPUs. Extensive experiments validate the effectiveness of the method. After a thorough discussion, all reviewers recommend accepting the paper. The AC agrees and suggests acceptance, with the authors expected to incorporate all revisions in the final version.