PaperHub
6.0
/10
Poster4 位审稿人
最低3最高7标准差1.7
7
3
7
7
3.8
置信度
正确性2.8
贡献度2.3
表达3.0
NeurIPS 2024

Mixture of Scales: Memory-Efficient Token-Adaptive Binarization for Large Language Models

OpenReviewPDF
提交: 2024-05-15更新: 2024-11-06
TL;DR

BinaryMoS introduces memory-efficient token-adaptive binarization for LLMs, reducing model size and enhancing representation ability of binarized weights

摘要

关键词
Large Language ModelsBinarizationQuantizationModel CompressionEfficient LLM

评审与讨论

审稿意见
7

This paper studies the problem of post-training binarization of large language models. Built upon OneBit, this work propose BinaryMoS, to use a mixture of scaling weights for the linear layer of binarized LLMs. Instead of static scales in OneBit, BinaryMoS employs a set of scaling weight experts and adaptively combines them given the input representation of current tokens. As a result, BinaryMoS offers stronger representation capabilities while incurring minimal computation and memory overhead. Experimental results on various LLM benchmarks show the proposed technique compares favorably to other binarization methods for LLMs.

优点

  1. BinaryMoS has minimal computational and memory overhead, and keeps the architecture simple without introducing sparse weight matrices.
  2. Mixture-of-expert technique could potentially applied to other quantization method.
  3. The method works for a wide range of LLMs.
  4. The proposed method out-performs other binarization techniques on several LLM benchmarks.

缺点

  1. The paper lacks discussion of the cost of post-training adaptations. The training cost seems comparably high as it requires three epochs over the selected dataset.
  2. The improvement over OneBit is small, and the improved binarization still perform significantly lower than other less aggressively quantized method.

问题

  1. While post-training binarization performs poorly compared to the original base model, methods like BitNet seem to bridging the gap by training low-bit models from scratch. Can your method extend to the scenario where the model is trained from scratch?

局限性

The authors have adequately addressed the limitations of the work.

作者回复

We appreciate the constructive reviews, and here we address your comments in detail:

W1. High training cost

Though training three epochs over the selected dataset may seem costly, the datasets used for fine-tuning LLMs for quantization, particularly C4 and Wikitext2, are generally very small, making the training cost affordable. Our selected dataset consists of a total of 195M tokens, with 192M tokens from the C4 dataset and 3M tokens from the Wikitext2 dataset.

We will highlight the exact size of the selected dataset to clarify the training cost in the revised paper.

W2. Impact of BinaryMoS

Because the accuracy improvement of BinaryMoS over OneBit is not as dramatic as the improvement of OneBit over PB-LLM and BiLLM, there may be concerns that the proposed solution's improvement is marginal. However, we would like to highlight that the improvement of BinaryMoS over OneBit is substantial enough to make binary LLMs more applicable in practice. For example, both PB-LLM and BiLLM fail to generate grammatically correct sentences, while both OneBit and BinaryMoS can generate grammatically correct sentences. However, as presented in Table A4-1, BinaryMoS can generate contextually proper answers, whereas OneBit fails to generate correct answers. This is because BinaryMoS processes each token with token-adaptive scaling factors which contain contextual information.

We will also update the above discussion and Table A4-1 in the Appendix of the revised paper.

Table A4-1. Comparison of generation quality on the LLaMA-1-13B models with BinaryMoS and OneBit.

Prompt: A cowboy rides a _
BinaryMoS: A cowboy rides a wild, powerful horse around the prairie.
OneBit: A cowboy rides a pistol.
 
Prompt: There are a number of ways to reduce air pollution, such as _
BinaryMoS: There are a number of ways to reduce air pollution, such as using clean-burning fuels like natural gas. Natural gas provides better emissions than coal, or oil.
OneBit: There are a number of ways to reduce air pollution, such as cleaning machines more often for longer periods. Cleaning materials and products are less toxic.
 
Prompt: The capital of the state of New York is _
BinaryMoS: The capital of the state of New York is Albany, situated along the west bank of the Hudson.
OneBit: The capital of the state of New York is located in the eastern part of the northern and the central part of the south region of the United States.

Q1. Training from scratch.

Thank you for your question and sharing your insight. The main novelty of the proposed BinaryMoS is its ability to increase the representational power of binary models by introducing the concept of Mixture-of-Scale, which can adaptively calculate proper scaling factors based on the context. This BinaryMoS architecture can also be trained from scratch. As you highlighted, post-training binarization inherently performs poorly compared to training-from-scratch approaches like BitNet, as post-training binarization introduces much less training iterations. Hence, if we have enough training facilities, we can further improve accuracy of binary models by adopting training-from-scratch approaches for BinaryMoS.

Thank you again for your valuable feedback. If you have any additional questions, please let us know.

审稿意见
3

This paper proposes using a mixture of scales for binarizing continuous latent weights. It includes a router implemented by a linear layer, outputting a softmax score which is then used as the combining matrix of the scale basis (called scale experts). The method uses two score basses, one for input SinS_{in} and one for output SoutS_{out}, each includes a number of scale vectors. The author conducted an experimental analysis of the number of experts to be used, and evaluated the proposed method on different language models with results showing performance gains by mixture of scales.

优点

The proposed idea is pretty nice, simple, and shows performance gains.

缺点

Weak significance: Although the idea is nice and valid, its scientific and technological significance as well as its originality is low. The paper could be more suitable for a workshop than a major venue like NeurIPS. The experimental analysis of the number of scaling experts shows that the performance is not always improved with the number of experts. This also strengthens the above opinion of low significance.

Several minor typos to be corrected, such as to use a unique term for MoS.

问题

Related to (4) where the utilized scales are obtained by combining the scale basis using GG, scale experts SinS_{in} and SoutS_{out} are of primary importance in the method. How are these scale experts are constructed and/or learnt?

If the scale experts are learnt, what can be the benefits of learning Sin,outS_{in, out} then compute S^in,out\hat{S}_{in, out} instead of learning these scales directly?

局限性

Yes

作者回复

We appreciate the reviews, and hope our rebuttal could convince you and change your stance on our paper.

W1. Weak significance and low originality

In this work, we propose a new binary LLM architecture to increase the representational power of binary models with negligible inference overhead by introducing the concept of Mixture-of-Scale, which can adaptively calculate proper scaling factors based on the context. We would like to highlight that good model architecture design is significant in the field of deep learning, as the model architecture determines the fundamental capability of the models. For this reason, many papers proposing modifications to model architecture are published in major venues. For example, Bi-Real Net, which applied a residual path to every convolution layer to enhance the accuracy of 1-bit convolutional neural networks (CNNs), has been published in ECCV [R3-1].

[R3-1] Liu, Zechun, et al. "Bi-real net: Enhancing the performance of 1-bit cnns with improved representational capability and advanced training algorithm." Proceedings of the European conference on computer vision (ECCV). 2018.

W2. Performance of the number of scale experts

We agree that performance does not always improve as the number of experts increases. Meanwhile, we want to emphasize that the key point of the proposed BinaryMoS is the introduction of multiple scaling experts (e.g., 2, 4, 8) to generate token-adaptive scaling factors, not the specific number of experts. In this way, BinaryMoS can improve the accuracy of binary models compared to OneBit, which uses fixed scaling factors. As shown in Tables 3 and 4 of our paper, BinaryMoS significantly enhances the accuracy of binary models. This highlights the effectiveness of token-adaptive scaling factor generation in BinaryMoS for improving the accuracy of binary models.

Q1. Effectiveness of training scale experts

As you correctly pointed out, the choice of scale experts SinS_{in} and SoutS_{out} significantly influences the accuracy of BinaryMoS, so it is important to find proper SinS_{in} and SoutS_{out}. If we randomly set SinS_{in} and SoutS_{out}t as you questioned, BinaryMoS fails to achieve sufficient accuracy results. Therefore, during the training process of BinaryMoS, both SinS_{in} and SoutS_{out} are also trained. Before training, BinaryMoS adopts SVD-based SinS_{in} and SoutS_{out} initialization methods proposed in OneBit, so that SinS_{in} and SoutS_{out} are initialized to minimize the binarization error of pre-trained weights, rather than applying random initialization. We will clarify that both SinS_{in} and SoutS_{out} are also trained and their initialization methods in the revised version of the paper.

If we directly learn Sin^\hat{S_{in}} and Sout^\hat{S_{out}}, it works exactly the same as OneBit. During inference, the scaling factors trained with this approach are fixed regardless of input context. Meanwhile, in the proposed BinaryMoS, on top of training SinS_{in} and SoutS_{out}, the router weight WRW_R is also trained to generate gating score GG based on the input context XX so the token-adaptive scaling factors can be generated as described in Eq. (3) and (4) of our paper.

Thank you again for your valuable feedback. If you have any additional questions, please let us know.

评论

Thank the authors for the response that partially answered my concerns, therefore I'm happy to increase my rating by 1.

评论

Regarding my rating is lower than that the other reviewers, I read again the paper to make sure that I hadn't missed something important, I didn't find. Instead, I have some following suggestions about math eq (2) -- (5) for the authors to improve the quality of the paper:

  1. Notation odot\\odot seems denoting the elementwise product. First, it should be specified.
  2. Suppose that my above interpretation 1. is correct, given that ABA \odot B requires that AA and BB must have the same dimensions, eq (2) - (5) need to be corrected mathematically. In particular, eq (2) in its current form cannot read well.
  3. The dimension of variables needs to be revised. Take your current notation of variable dimensions: X:k×mX: k \times m, WFP:n×mW_{FP}: n \times m:
    • For eq (3), it should be that WR:m×eW_R: m \times e (currently n×en \times e). It result in G:k×eG: k \times e.
    • Is Sin:m×eS_{in}: m \times e and Sout:n×eS_{out}: n \times e right? If so, eq (4) has dimension mismatch issue, maybe to be corrected as hatSin=GSinThat S_{in} = G S_{in}^T, hatSout=GSoutThat S_{out} = G S_{out}^T?
    • If correcting as above, then hatSin:k×mhat S_{in}: k \times m, hatSout:k×nhat S_{out}: k \times n. This leads a problem with eq (5): \odot products of the LHS have mismatched dimensions, while its RHS should be corrected as (please double-check my suggestion here) XhatSinSign(WT)hatSoutX \odot hat S_{in} Sign(W^T) \odot hat S_{out}.
评论

Thank you for your additional comments. We acknowledge that there are some typos and unclear explanations in the equations presented in our paper. We will address them in this response and revise the manuscript accordingly.

Response to 1 and 2. Notation \odot and Dimension Mismatch in \odot Operation

As you suggested, we will explicitly state in the revised paper that the notation \odot represents element-wise multiplication. Additionally, we want to clarify that we assume element-wise multiplication with broadcasting to a common shape, which functions exactly the same as in current deep learning frameworks. This means that in A \odot B, the dimensions of A and B do not need to match exactly for the operation to proceed. For example, if A has dimensions n×mn \times m and B has dimensions 1×m1 \times m, B will be broadcasted to a common shape of n×mn \times m and then multiplied by A. Specifically, for the first dimension, the same value of B is multiplied by A. We will also clarify this broadcasting to a common shape in our revised paper.

Response to 2 and 3. Clarification of Eq.(2)-(5)

Eq.(2): This equation was adopted from the OneBit operation. To ensure compatibility with the subsequent equations, we will swap the row and column dimensions of SinS_{in} and SoutS_{out}. In the revised version, Eq. (2) will be expressed as X[SinTSign(WFPT)Sout]=[(XSin)Sign(WFPT)]SoutX [S_{in}^T \odot \text{Sign}(W^T_{FP}) \odot S_{out}] = [(X \odot S_{in})\text{Sign}(W^T_{FP})] \odot S_{out}, where SinR1×mS_{in}\in \mathbb{R}^{1 \times m}, SoutR1×nS_{out}\in \mathbb{R}^{1 \times n}.

Eq.(3): As you correctly pointed out, WRW_R should have dimensions m×em \times e, but it was mistakenly written as n×en \times e. We will correct this in the revision.

SinS_{in} and SoutS_{out} of Eq.(4): Given that ee scaling factors are used, the dimension of SinS_{in} and SoutS_{out} should be e×me \times m and SoutS_{out} as e×ne \times n, respectively. We will clarify these dimensions in the revision. Meanwhile, as the dimension of GG is k×ek \times e, the matrix multiplication in Eq. (4) between GG and Sin/outS_{in/out} is correct.

Eq.(5): As you mentioned, the transpose notation should be removed from Sin^\hat{S_{in}} and Sout^\hat{S_{out}}. We will correct Eq. (5) in the revision to: [(XSin^)Sign(WFPT)]Sout^[(X \odot \hat{S_{in}})\text{Sign}(W^T_{FP})] \odot \hat{S_{out}}.

We apologize for any confusion these equations may have caused. We will include clearer descriptions and correct typos in the final version of the paper.

Lastly, we would like to emphasize the major contribution of this work once again. We propose a new LLM binarization method called BinaryMoS. It introduces the concept of Mixture-of-Scale to adaptively generate scaling factors, thereby enhancing the representational power of binary models.

The accuracy evaluation results in our paper and the latency measurements provided in the global rebuttal demonstrate that the proposed method achieves state-of-the-art accuracy while maintaining the latency advantage of binary models.

We hope our response has clarified your concerns. Thank you again for your feedback.

审稿意见
7

This paper proposes a binarization technique for LLMs, inspired by the mixture-of-expert (MOE) model. In the proposed approach, multiple scaling factors for the binarized matrices are available, each treated as an expert just like in MOE. The model infers a weight combination of these scaling factors adaptively at each time step, hence the binarization strategy is different for each input token. The paper shows that this adaptive binarization technique does not increase much memory overhead as the traditional MOE does, but can effectively improve model quality, as compared with existing 1-bit or 2-bit quantization baselines.

优点

The proposed approach effectively improved the quality of binarized LLMs, which has always been a challenging task. The introduction of adaptive binarization makes a lot of sense and deserves broader attention and experimentation.

缺点

More detailed analysis in terms of efficiency, scaling behaviors and ablation studies could have been given.

问题

  1. The performance of BinaryMoS depends critically on the choice of static experts S_in, S_out. How robust is the proposed approach is to random initialization of these experts? Does the magnitude of these scaling experts matter a lot?

  2. Are there any interesting patterns learned for expert weights? For examples, whenever one expert gets assigned larger weights, do the input sequences contain specific n-grams or syntactic structure? It would be great to find some insights into how and why the model learned to make expert choices, for better interpretability.

  3. Related to question 2, if there is no clear interpretable expert assignment patterns that can be found, what if you just determine expert weights randomly without training? This should be a baseline to validate the effectiveness of the learning procedure.

  4. How important is the knowledge distillation loss? It seems there is no analysis to the relative importance of distillation.

  5. How does the quantization performance scale with model size? If there is a clear scaling law of BinaryMoS which appears to be more efficient than full-precision training, then this approach should become a standard for very large model training and inference.

局限性

The paper addressed limitations in the "Discussion and future work" section.

作者回复

We appreciate the constructive reviews, and here we address your comments in detail:

Q1. Importance of scale experts

As you correctly pointed out, the choice of static experts SinS_{in} and SoutS_{out} significantly influences the accuracy of BinaryMoS, so it is important to find proper SinS_{in} and SoutS_{out}. If we randomly set SinS_{in} and SoutS_{out} as you questioned, BinaryMoS fails to achieve sufficient accuracy results. Therefore, during the training process of BinaryMoS, both SinS_{in} and SoutS_{out} are also trained.

The term ‘static’ might cause confusion, but SinS_{in} and SoutS_{out} are static only during the inference stage, not during the training stage. Moreover, before training, BinaryMoS adopts SVD-based SinS_{in} and SoutS_{out} initialization methods proposed in OneBit, so that SinS_{in} and SoutS_{out} are initialized to minimize the binarization error of pre-trained weights, rather than applying random initialization.

We will clarify that both SinS_{in} and SoutS_{out} are also trained and their initialization methods in the revised version of the paper.

Q2. Linguistic pattern of assigning expert

The weights of experts vary across input tokens, and we were not able to identify any interesting patterns learned for expert weights. However, we have included example sentences and the corresponding expert weights assigned to each token in Figure 1 of the attached PDF file of author rebuttal. This is provided in case others might find some interesting patterns. If you have any further insights or opinions on the patterns, please let us know.

Q3. Influence of random expert weights

Thank you for the valuable insight. As you suggested, we measure the accuracy of BinaryMoS by setting expert weights randomly without training as shown in Table A2-1. The evaluation results demonstrate significant accuracy degradation, underscoring the importance of training to fully achieve the advantage of the proposed Mixture-of-Scale scheme. We will update the analysis result in Table A2-1 and the above discussion in the Appendix of the revised paper.

Table A2-1. Accuracy evaluation of BinaryMoS model with random expert weights

ModelPPL (Wiki2)PPL (C4)Avg acc
LLaMA-1-7B1034.07718.3937.81
LLaMA-1-13B111.07141.9539.84

Q4. Importance of knowledge distillation

Knowledge distillation (KD) has been widely adopted to improve the accuracy of QAT techniques, and it has been also utilized in previous work such as OneBit. To verify the importance of KD loss, we compare the accuracy of both OneBit and BinaryMoS with and without KD in Table A2-2. The binarized models without KD show worse perplexity and accuracy results compared to those with KD. However, regardless of the application of KD, BinaryMoS consistently outperforms OneBit, demonstrating the effectiveness of the proposed Mixture-of-Scale approach. We will update the analysis result in Table D and the above discussion in the Appendix of the revised paper.

Table A2-2. Perplexity and averaged zero-shot accuracy results of OPT-1.3B model

Methodw/ KDPPL (C4)zero-shot acc. (avg.)
OneBitTrue20.7647.50
BinaryMoSTrue18.8349.34
OneBitFalse22.9545.46
BinaryMoSFalse20.1946.05

Q5. Scaling law of BinaryMoS

As you pointed out, if there is a clear scaling law for BinaryMoS, it could become a standard for compressing very large models. However, since BinaryMoS adopts QAT-based strategies, which require training the full weight parameters to adapt the model to quantization, it is very challenging to evaluate the effect of BinaryMoS on very large models, such as LLaMA 30B and 70B, due to high training cost. Hence, there is a difficulty in determining the scaling law of BinaryMoS at this point. Please note that this limitation is not specific to BinaryMoS but is a general limitation of QAT-based strategies, including previous works like OneBit. To scale BinaryMoS to very large models, we need to find a way to integrate the BinaryMoS approach with parameter-efficient fine-tuning techniques (PEFT) such as LoRA [R2-1]. This integration would make the training procedure feasible for very large models. Therefore, we believe that this is an important future direction for BinaryMoS to extend its usability and effectiveness to much larger models.

[R2-1] Yelysei Bondarenko, et al. “Low-Rank Quantization-Aware Training for LLMs”, arxiv:2406.06385.

Thank you again for your valuable feedback. If you have any additional questions, please let us know.

评论

Thanks for your response, I will keep my original rating.

审稿意见
7

This paper introduces a method to compress large language models (LLMs) by quantizing weight values to 1-bit. The goal is to mitigate performance degradation seen in previous quantization methods applied to LLMs. They utilize ideas from Mixture-of-Experts and introduce token-adaptive scaling factors that help control the effective values of binarized weights. Their approach results in high compression ratios compared to earlier binarization methods, maintaining memory efficiency.

优点

  • Quantization helps lower the barriers to deploying large language models in compute-constrained environments. While previous approaches have been able to achieve this, it has come at the expense of linguistic utility. BinaryMoS in an attempt to lower these barriers without these costs.

  • Through experiments on various benchmarks, they demonstrate the effectiveness of BinaryMoS in memory-efficient quantization

  • The paper offers a clear explanation of existing research work on binarization and this helps with clarity and readability.

缺点

  • The motivation for extending the ideas from MoE to One-Bit is not entirely clear. Is the paper focused on applying binarization to MoE-style models? Please clarify this aspect.

  • The benefits of using 4 experts compared to 2 experts do not seem significant across tasks. Given the additional memory overhead introduced by these experts, wouldn't it be better to advocate for using two experts instead?

  • The analysis section on token-adaptive scaling factors does not adequately explain why BinaryMoS is preferable to One-Bit. While the variation in gating scores for experts across tokens is shown, there is no context provided about the sentences being analyzed. It seems possible that BinaryMoS is only useful and sensitive to certain domains and tasks and may not apply to every task.

  • The omission of latency measurements compared to other binarization methods raises questions about the robustness and efficiency of BinaryMoS. Since latency measurements are critical for real-world deployments of large language models, a discussion on this topic would be great

问题

See weaknesses

局限性

The authors claim to address the limitations of their work in the "Discussion and Future Work" section, but this discussion is incomplete. They do not mention latency requirements and potential failure cases of BinaryMoS are also omitted and not listed as limitations. Additionally, they should acknowledge that performance loss is still expectedly incurred compared to FP16.

作者回复

We appreciate the constructive reviews, and here we address your comments in detail:

W1. Motivation of BinaryMoS

The motivation stems from the fact that previous binarization methods, including OneBit, still have low representational power. To push the limits of binarized weights, we propose token-adaptive scaling factors inspired by MoE. Typically, MoE is used not for model compression techniques like quantization but to enhance the original model's capabilities by duplicating the weights of FFN layers according to the number of experts. Simply combining binarization methods with traditional MoE can enhance the model's capabilities, but the memory overhead is large, conflicting with the goal of extreme compression inherent in binarization. Therefore, we propose a novel binarization method that adopts the concept of using multiple experts from MoE and applies it to the scaling factors of binarization. In this way, while OneBit uses fixed scaling factors regardless of context, the proposed BinaryMoS can generate token-adaptive scaling factors which can improve the representation power of the binarized model.

In summary, our BinaryMoS scheme is inspired by MoE-style models, but it does not employ multiple experts with separate weights. Instead, it uses multiple scaling experts to generate token-adaptive scaling factors while fixing binarized weights, thereby retaining the memory efficiency of binarized models.

W2. Benefits of using 4-expert

As the reviewer pointed out, in terms of zero-shot accuracy, the average accuracy of 2-experts and 4-experts is 50.64% and 50.61%, respectively, showing no significant difference. However, in terms of perplexity, the 4-expert configuration achieves remarkable improvements compared to the 2-expert configuration by lowering perplexity from 12.18 to 11.85 (reported in Table 2 of our paper). Hence, we can expect that 4-experts will generally achieve better language modeling capabilities compared to 2-experts. Moreover, due to the low memory cost of adopting scaling experts, as shown in Table A1-1, the model sizes of 2-expert and 4-expert BinaryMoS are similar, while both significantly reduce the model size compared to the original Float16 model. In other words, the cost of adopting the 4-expert configuration is low, while we can expect an improvement in language modeling capabilities. Based on these observations, we chose to employ 4-experts in BinaryMoS.

Table A1-1. Comparison of memory size for 2-expert and 4-expert BinaryMoS.

ModelFloat164-expert2-expert
LLaMA-1/2-7B13.51 GB1.40 GB1.38 GB
LLaMA-1/2-13B26.20 GB2.33 GB2.30 GB

W3. Relationship between gating score and context

While Figure 3 shows the variation in gating scores for experts across tokens in an exemplary sentence, we observed a similar tendency throughout our experiments for various tasks reported in Section 4.4. Each token is assigned a different scaling factor with our token-adaptive binarization method. Although our experiments are still limited, we believe this trend could extend to other LLM tasks as well, since the concept of applying multiple scaling experts for weight binarization is general.

W4. Latency measurements compared to other binarization methods

Thanks for pointing out the latency measurements. As you highlighted, the latency is a critical component for real-world deployment. However, previous binarization papers, including PB-LLM, BiLLM, and OneBit, have not reported their latency due to the lack of a CUDA kernel for matrix multiplication between FP activation and 1-bit weights. Therefore, we first developed a custom CUDA kernel for 1-bit matrix multiplication by modifying the CUDA kernel for multi-bit matrix multiplication [R1-1, R1-2]. Then, we also developed a custom CUDA kernel for BinaryMoS by fusing the operations of scaling experts and routers on top of the 1-bit matrix multiplication CUDA kernel.

We measured the latency of the linear layers in LLaMA-7B and LLaMA-13B (batch size: 1) and reported the latency results in Table G-1 and G-2 of global response. As illustrated in our paper, PB-LLM and BiLLM require extra matrix multiplications, so they tend to show similar or larger latency results compared to the original FP16 models. OneBit, which employs the simplest binarization scheme, achieves significant improvement over the original FP16 model and shows the minimum latency. Meanwhile, as our BinaryMoS introduces additional operations on processing scaling experts, which require far fewer operations compared to the matrix multiplication, our BinaryMoS also shows similar latency results as OneBit. This demonstrates that the multi-scaling factor module in BinaryMoS improves performance in terms of perplexity and zero-shot accuracy with minimal overhead to latency.

[R1-1] Taesu Kim, et al., “QUICK: Quantization-aware Interleaving and Conflict-free Kernel for efficient LLM inference”, arxiv:2402.10076

[R1-2] https://github.com/SqueezeBits/QUICK

L1. Limitation of BinaryMoS

Thanks for the suggestion.

First, we will include the latency results which we reported in the response to W4 and global response in the revised manuscript. Regarding other limitations, we will include the following parts in the limitation section of the revised manuscript.

  1. Our results are still limited to relatively small models up to 13B. We plan to work on larger models such as 30B.

  2. The performance loss from the FP16 model is still substantial, so further innovations regarding binarization are needed to reduce the performance gap.

Thank you again for your valuable feedback. If you have any additional questions, please let us know.

评论

Thanks a lot for the clarifications! I will increase my rating to 7!

评论

Thank you for your feedback! We appreciate the constructive reviews for improving our work.

作者回复

Dear reviewers,

Thank you all for your valuable feedback on our work.

We appreciate all the insightful questions and comments provided, and we have responded to each reviewer's comments as thoroughly as possible. Moreover, an additional one page PDF file is attached to provide supplementary figures. We hope our responses will help resolve any concerns or curiosity.

Many reviewers have raised questions about the latency of BinaryMoS, so we have evaluated the latency of previous binary models and the proposed BinaryMoS by developing appropriate CUDA kernels for binary models. We measured the latency of the linear layers in LLaMA-7B and LLaMA-13B (batch size: 1) and reported the results in Tables G-1 and G-2. As illustrated in our paper, PB-LLM and BiLLM require extra matrix multiplications, making them very slow. OneBit, which employs the simplest binarization scheme, achieves significant improvement over the original FP16 model and shows the minimum latency. Meanwhile, our BinaryMoS introduces additional operations for processing scaling experts, which require far fewer operations compared to matrix multiplication. Consequently, BinaryMoS also shows similar latency results to OneBit. This demonstrates that the multi-scaling factor module in BinaryMoS improves performance in terms of perplexity and zero-shot accuracy with minimal overhead to latency.

Table G-1. Latency (msec) of Linear Layer in LLaMA-1/2-7B.

Model configLLaMA-1/2-7B
Weight Size4096 ×\times 40964096 ×\times 1100811008 ×\times 4096
Float160.068150.151720.14346
PB-LLM0.096070.177510.16833
BiLLM0.087110.096380.10420
OneBit0.032660.033700.03494
BinaryMoS0.034490.036900.03695

Table G-2. Latency (msec) of Linear Layer in LLaMA-1/2-13B.

Model configLLaMA-1/2-13B
Weight Size5120 ×\times 51205120 ×\times 1382413824 ×\times 5120
Float160.095580.224080.21355
PB-LLM0.122730.243670.23466
BiLLM0.095230.124210.13095
OneBit0.033380.041440.04258
BinaryMoS0.035610.043390.04445

Thanks again for your valuable comments. If you have any further questions or comments, please let us know.

最终决定

Paper proposes BinaryMoS, a token-adaptive binarization method for LLMs that leverages the concept of MoE for scaling factors. This allows for contextually adjusting binary weight values, thereby enhancing the representation power compared to traditional static binarization techniques.

While one reviewer maintains a lower score (no engagement), their concerns are primarily focused on the paper's significance and presentation, which can be further addressed through the suggested improvements (eg., scaling to larger models, subject to resource availablity ofc, to understand its scalability; investigating the reln between learned scaling factors, input tokens and their impact on performance). The remaining reviewers, after the rebuttal, view the paper favorably, recognizing its technical contribution and potential impact.

I believe this paper makes a valuable contribution to the field of LLM compression by introducing a novel and effective binarization method. The proposed method achieves high enough accuracy (could be considered as SOTA) for binary LLMs while maintaining memory efficiency and low latency. And the paper is well-written and the results are convincing.