4.8

/10

Rejected4 位审稿人

最低3最高6标准差1.1

4.3

置信度

正确性2.5

贡献度2.5

表达2.3

NeurIPS 2024

decoupleQ: Towards 2-bit Post-Training Uniform Quantization via decoupling Parameters into Integer and Floating Points

Yi Guo,Fanliu Kong,Xiaoyang Li,Hui Li,WeiChen,Xiaogang Tian,jinping cai,Yang Zhang,Shouda Liu

OpenReview PDF

提交: 2024-04-26更新: 2024-11-06

TL;DR

transform model quantization into a constrained optimization problem

摘要

关键词

quantization; large language model; optimization

评审与讨论

审稿意见

评分: 5置信度: 42024-07-07

The paper introduces decoupleQ, a novel method that decouples model parameters into integer and floating-point parts. This approach transforms the quantization problem into a mathematical constrained optimization problem, avoiding the limitations of traditional heuristic quantization methods. DecoupleQ achieves a significant improvement over existing methods in LLM., especially at extreme low bits (2-bit) and also release the W2A16 CUDA kernel.

优点

DecoupleQ eliminates the need for ad-hoc techniques to handle outliers and sensitive channels, focusing solely on optimizing model accuracy under extreme low-bit quantization.
DecoupleQ achieves a notable advancement over existing methods in LLM, particularly at extremely low bit. And the W2A16 CUDA kernel has been released.
DecoupleQ approach can be readily extended to supervised fine-tuning (SFT) to enhance model accuracy, or adapted for downstream sub-tasks.

缺点

Please correct me if I am wrong. It seems that decoupleQ combines several existing approaches. Specifically, it uses Adaround to get the integer part in ResNets and GPTQ to get the integer part in LLMs. Additionally, it integrates PTQ and QAT by applying PTQ to the integer part while using supervised training for the floating-point part.
Regarding your point from lines 58-61, I believe GPTQ clearly outlines how to calculate scale and zero point in their code. Moreover, GPTQ can be seen as a constrained optimization problem, where the constraints align with yours: each integer weight is confined within [ $\alpha$ , $\beta$ ], which is a default constraint in GPTQ.
Further experiments on LLMs are essential. For example, evaluating decoupleQ's performance in multi-task settings and within the LlaMa 3 family would provide valuable insights.
Could you provide more ablation studies in the second stage, such as experiments without training norm layers?
There is a typo in line 125. The first letter of 'decoupleQ' should be capitalized.

问题

Please refer to the weaknesses.

局限性

Please refer to the weaknesses.

作者回复

2024-08-04

Thank you very much for your careful reading of our paper and your generally positive comments. And thank you for your accurate summary and for highlighting our strengths. We will try to respond to your questions in as much detail as possible, and we would be grateful if you could point out any omissions.

Weakness 1:

The core value of this paper lies in transforming a quantization problem into a mathematical optimization problem ( refer to formula (6) in the original paper), and this quantization method is a general method, regardless of ResNet or LLM.

From an algorithmic perspective, our process is as follows:

First, we focus on a linear (or conv) layer in the model. We aim to minimize the difference between pre- and post-quantization within a linear layer (refer to formula (5)). Then, substituting (4) into (5) and taking into account per-channel quantization, we get the objective function (7). Adding the constraints on $w$ , we get the optimization problem (6). This is a constrained mathematical optimization problem. After the problem is solved, we get the solutions of the integer part and the floating point part. That is, we get the integer part and the floating point part via solving the optimization problem (6).
Second, since minimizing the quantization error within the linear layer does not mean minimizing the model quantization error, we performed an optimization at the block level (2). In this stage, we freeze the integer part and only train the floating point part.
The above two steps both fall within the scope of PTQ, and we found that with only these two steps, we can get a reasonable model accuracy (refer to Table. 3). In practice, if we want to further improve the model accuracy, we can fine-tune the floating-point part of the whole model via labelled dataset. This process is similar to QAT, but it is a very lightweight training. Because we have fixed the integer part that occupies the vast majority of the parameters.

Weakness 2:

I'm sorry we were not clear here. We originally wanted to express that the core contribution of GPTQ is how to efficiently update the remaining elements, without specifically studying how to obtain better $(s,z)$ . In the code of GPTQ, they use minmax or mse search to get $(s,z)$ , while in decoupleQ, we solve the optimization problem (6) to get $(s,z)$ . Thank you very much for pointing out the problem with this sentence. We will delete it in the revision to avoid misunderstanding.
What we said "unconstrained" in GPTQ is "the optimization problem formulated for updating the remaining elements is unconstrained". For simplicity, suppose there are three elements in the weight $w=[w_1, w_2,w_3]$ , and suppose $scale=1$ and $zero=0$ . In GPTQ, they first fake-quantize $w_1$ to be $\widetilde{w_1}$ , here $\widetilde{w_1}$ is constrained within the interval $[\alpha, \beta]$ ; then update $w_2$ and $w_3$ to be $w_2 ' = w_2+\Delta_2$ and $w_3 ' = w_3+\Delta_3$ . However, $w_2'$ and $w_3'$ is not constrained within $[\alpha, \beta]$ , that is, the updates to the remaining elements is unconstrained. In decoupleQ, we proposed two levels of approximation when updating the remaining elements, (10) and (11), for quantization time and model accuracy trade-off. (10) is constrained while (11) is unconstrained.

Weakness 3:

Thanks very much for you suggestions. The lack of rich public experiments was indeed our shortcoming. We will make our code public on GitHub (regardless of whether the paper is accepted or rejected finally), and continue to add more rich experimental results. We hope that reviewers can pay more attention to our innovation in theory and scheme. We also believe that the novelty of a method may outweigh the number of experiments, especially in NeurIPS. In addition, the work is tangible and can be applied to industry. We do have launched 2-bit quantized large speech models in multiple consecutive releases of our company's core products. After the reviewing period is completed, the identity of our company and the products launched will be made public. Also, we have released the W2A16 CUDA kernel used in our products, which is under the merging review into NVIDIA tensorRT repo. We believe that our work will make certain contributions to the industry and open-source community.

Weakness 4:

We have not yet tried fixing the norm layers and only training $s$ and $z$ in the sft stage. In all of our private experiments, we only fix the integer part $\widehat{W}$ , and train all the floating-point parts, including $s, z$ and norm layers. Thank you very much for your suggestion. I have also heard that the training of the norm layer has a certain impact on the accuracy of the model [1]. I plan to do some experiments, and if the results are significantly different, we will update it to GitHub.

[1] Tuning LayerNorm in Attention: Towards Efficient Multi-Modal LLM Finetuning

Weakness 5:

Sincerely thank you for looking at our paper in such detail. We lowercase the "d" on purpose. "decouple" is lowercase and "Q" is uppercase, isn't that interesting? Just like "iPhone" or "eBay".

Thank you again for your valuable time and comments. And I am very happy to discuss these issues with you. If you have any questions, we will reply to you as soon as possible.

2024-08-12

Thank you for the detailed rebuttal. The author has addressed some of my concerns. However, conducting experiments without training norm layers is crucial, as other methods typically do not involve training norm layers. Additionally, I agree with other reviewers: They address a portion of the optimization problem using GPTQ and another portion similar to BRECQ. The novelty is limited, resembling more an aggregation of existing techniques rather than a novel contribution.

2024-08-12

Thank you very much for your reply. We are sorry that we did not explain it clearly in our first reply. It is a pleasure to continue discussing with you.

We have urgently conducted some ablation studies, and more experimental results are on the way. In this experiment, we choose to freeze or not freeze the layernom in LLama when training $(s,z)$ in the Block-wise minimization, and other settings are consistent with those in the original paper. Each element in the table represents (PPL of wiki-2, PPL of C4).

	1-7B	2-7B	1-13B	1-30B	1-65B
Train LN	(9.49, 11.41)	(9.74, 11.83)	(7.86, 9.86)	(6.37, 8.51)	(5.59, 7.63)
Freeze LN	(9.69, 11.54)	(9.84, 11.97)	(7.90, 9.92)	(6.37, 8.52)	(5.48, 7.64)

When LN is frozen, PPL generally increases slightly. This may be due to the reduction of learnable parameters. Nevertheless, our results are still significantly better than previous shown in Table 3. Thanks for your comments, and we will add these results in the final version.

Now, please give us some time to compare decoupleQ with GPTQ and BRECQ in detail.

Both decoupleQ and GPTQ contain a step to minimize the loss between pre- and post-quantization within a linear layer. But minimizing the loss between pre- and post-quantization within a linear layer is a very simple and easy idea[1,2,3,4,5]. The core contribution of these papers is not that they use the layer minimization, but how to achieve the layer minimization.

For example, AWQ[1] achieves this goal via per-channel scaling that protects the salient weights; GPTQ[5] achieves this goal by updating the remaining elements while quantizing the previous elements, and adaround[2] by training the rounding up or down. While in decoupleQ, we achieve this goal by constrained mathematical optimization (6). We believe that formula (6) abstracts the essence of the problem, transforming a model quantization problem into a mathematical optimization problem, and it no longer needs to focus on some of the minutiae unique to quantization. $w$ , $s$ , and $z$ are now totally independent of the original weight $W_0$ in the optimization process, as long as the final output error is minimized. This is the essential difference between decoupleQ and previous methods.

When solving formula (6), we provide two approximate solutions, (10) and (11). Both formulas are common unconstrained convex optimization problems. We used the the idea from GPTQ to solve (11), but this is not the only solution, because we can choose to solve (10) to obtain a more accurate result.

The comparison with BRECQ is similar. Block minimization is used in works[3,6,7]. In BRECQ, they train the rounding for GEMM weights, whereas in decoupleQ, we train $(s,z)$ and norm layers while keeping the GEMM weights frozen (because it has already been quantized to integers at this step). This is a very lightweight training compared with BRECQ, because GEMM weights occupy most of the model parameters, and $(s,z)$ only occupies a very small part of the model parameters.

In addition, if we only consider the first stage, i.e., minimizing the loss between pre-quantization and post-quantization within the linear layer, without considering the second stage, our results still outperform the others (especially GPTQ, the fairest comparison), as shown in Table 4 and Table 3.

[1] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration
[2] Up or Down? Adaptive Rounding for Post-Training Quantization
[3] AffineQuant: Affine Transformation Quantization for Large Language Models
[4] QuIP: 2-Bit Quantization of Large Language Models With Guarantees
[5] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
[6] OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models
[7] BRECQ: Pushing the Limit of Post-Training Quantization by Block Reconstruction

Thank you again for your valuable time and high-quality comments. We sincerely look forward to further discussions with you.

审稿意见

评分: 3置信度: 52024-07-07

This paper proposes a linear and uniform quantization method, decoupleQ, which abandons the traditional heuristic quantization paradigm and decouples the model parameters into integer and floating-point parts, then transforming the quantization problem into integer and floating-point part. Experiments show decoupleQ achieves comparable acc as fp16/bf16 on 2-bit weight quantization setting.

优点

Experiments show decoupleQ achieves comparable acc as fp16/bf16 on 2-bit weight quantization setting.

缺点

Experiments are based on W2A16, lower activation bitwidth(<=8bit) should be experimented.
The novelty is limited. The core idea of decoupleQ is similar to Normalization(Batch-Norm or Layer-Norm). The learnable floating part of decoupleQ equals to a learnable Normalization parameters.
More existing Quantization methods should be compared, such as, NWQ[1], PD-Quant[2]

[1] Leveraging Inter-Layer Dependency for Post -Training Quantization [2] PD-Quant: Post-Training Quantization based on Prediction Difference Metric.

问题

What's the gain for lower activation bitwidth(<=8bit) of decoupleQ?
Experiment comparison over more existing Quantization methods.

局限性

作者回复

2024-08-04

Thanks for reviewing our paper, and we respond to your concerns as follows:

Weakness 1:

As we put in line 83, we focus on weight-only quantization.

In the era of large language models, weight-only quantization has important industrial value because during the inference process with latency constraints, the batch size of decoding will be very small, so that only a few tokens are decoded at a time, which makes the inference process IO-bound in most GPUs. In this situation, quantizing only weights can reduce IO overhead and thus speed up inference.

There are many weight-only quantization works in recent years, such as GPTQ[1], AWQ[2]

Weakness 2:

decoupleQ is a quantization method, and normalization is a kind of skill for effective model training. These are two completely different things.

Weakness 3:

In our paper, we focus on weight-only quantization, while NWQ and PD-Quant focus on weight-activation quantization, and not report their results on weight-only settings. Nevertheless, we find that these two papers are creative and we will cite them in the revision.

Question 1-2: please refer to the responses to Weaknesses

[1] GPTQ: Accurate Post-Training Quantization for Generative Pre-trained Transformers
[2] AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

2024-08-09

Thank authors for the rebuttal. The rebuttal does not solve my concern. The novelty is limited: DecoupleQ is the same as Batch-Normalization or Group-Normalization if you make the BN or GN 's parameter learnable and merge them into normal quantization process. The effect between them from my point of view is the same. Further, comparison to current CNN quantization methods is not fully provided. Thus I still keep my score and reject this paper: it does not reach the bar of NeurIPS, in shortage of novelty proof and enough experiment comparison.

评论- Clarification on the novelty

2024-08-09

Thank you for your reply. The core of our work is quantization. To this end, we transform a quantization problem into a mathematical optimization problem. In decoupleQ, the quantization process includes not only the solution of the floating-point part $(s,z)$ , but also the solution of the integer part $\widehat{W}$ .

However, Batch-Normalization or Group-Normalization is a skill for effective training. In some model structures, their parameters can indeed be merged into $(s,z)$ , but this is not the case for all model structures especially in LLM. If conventional quantization cannot obtain a good solution for $\widehat{W}$ , then no matter how we train the Normalization parameters, it will be difficult to obtain high model accuracy. decoupleQ is to solve the integer part $\widehat{W}$ and the floating-point part $(s,z)$ together, rather than just considering the floating-point part.

Thank you again for your reply. If we have not clarified your concerns, we sincerely hope that you can raise them in time, and we will reply to you as soon as possible.

审稿意见

评分: 5置信度: 42024-07-10

The paper presents decoupleQ, a post-training quantization method that improves the accuracy of quantized models, particularly at very low bit-widths (2-bit). It achieves this by separating model parameters into integer and floating-point components and formulating the quantization process as a constrained optimization problem. This approach eliminates the need for traditional quantization techniques like outlier handling and focuses on optimizing the core objective.

优点

The paper introduces a fresh perspective on quantization by abandoning traditional methods and reframing it as a constrained optimization problem.
decoupleQ demonstrates impressive results in 2-bit quantization, achieving accuracy comparable to higher precision formats like fp16/bf16 in large speech models.
The quantization process is linear and uniform, making it easier to implement in hardware compared to non-uniform methods.

缺点

The paper's writing lacks cohesion and clarity regarding its ultimate goal. The paper also has several spelling mistakes.
The authors claim to separate the model parameters into integers and floating-point components. However, as far as I understand, this practice is not a novel contribution but rather a common approach in quantization.
They address a portion of the optimization problem using GPTQ and another portion similar to BRECQ.
The authors acknowledge that their solution may not be optimal.
The quantization process in decoupleQ can be more time-consuming than other methods.

问题

The paper mentions achieving state-of-the-art accuracy in Llama-1/2. It would be helpful to see a more detailed comparison with other state-of-the-art quantization methods on this specific model.
The authors could elaborate on the potential impact of the weak correlation between PPL (perplexity) and loss in LLMs. How might this affect the practical application of decoupleQ for LLM quantization?
The paper briefly mentions the risk of overfitting. Providing more insights into mitigation strategies for this risk would be beneficial, especially when dealing with underdetermined matrices.
Given the longer runtime compared to other methods, it would be interesting to see a more comprehensive analysis of the trade-off between quantization time and model accuracy.
The authors could consider adding a section on future work, outlining potential directions for further research and improvement of the decoupleQ method.

局限性

Yes, however, it will be beneficial to divide the current Discussion section into separate Conclusion and Limitations sections.

作者回复

2024-08-04

Thank you so much for taking the time to review our work and then give an accurate summary and outline our strengths. We will explain your concerns in detail. Due to page limitations, we put the responses to weaknesses at the "Official Comment" box.

Question 1:

GPTQ, AWQ, OmniQuant are three of the influential works on quantization recently, and they all reported their results in LLama-1/2. We would be very grateful if you could provide some papers to compare with.

Question 2:

In line#283~290, we find that the block reconstruction loss decreases monotonically as the number of iterations $K$ increase, and that the model’s best PPL is where $K = 1$ and then fluctuates within a range as $K$ continues to increase. The loss we are talking about here is the loss between the pre- and post-quantization of the block. In the field of PTQ, we cannot directly optimize the PPL of the model but only some proxy. For example, GPTQ tries to minimize the loss of the linear layer, and BRECQ tries to minimize the loss of the block. In this situation, the weak correlation (if any) between block loss and PPL is a general problem, not specific to decoupleQ. However, we cannot deny the value of PTQ, because in a broad scope, reducing the loss can lower the PPL of the model.

Question 3:

When $H$ is an underdetermined matrix (although the size of calibration dataset is large enough), a very effective solution is to enhance the diagonal of the matrix, i.e. $H \leftarrow H + \lambda I$ . This is not just a mathematical trick, but has strong physical meaning. Our initial optimization goal (Eq.(5) in the original paper) has the following transformation: ${\arg\min}_{\widetilde{W}} ||X\widetilde{W}-XW_0||_2^2 =\text{tr}\{(\widetilde{W}-W_0)^TH(\widetilde{W}-W_0) \} => \text{tr}\{(\widetilde{W}-W_0)^T(H+\lambda I)(\widetilde{W}-W_0) \} = \text{tr}\{(\widetilde{W}-W_0)^TH(\widetilde{W}-W_0)+\lambda (\widetilde{W}-W_0)^T(\widetilde{W}-W_0)\}$ The last term $\lambda (\widetilde{W}-W_0)^T(\widetilde{W}-W_0)$ is independent of $H$ , thus the calibration dataset. It plays the role of regularization, regularizing $\widehat{W}$ to be close to $W_0$ under the naive L2 metric.

Question 4:

In our industrial practice, the runtime of quantization via decoupleQ depends mainly on the size of the calibration dataset. As for the number of iterations $N$ defined in Alg.1, we find that $N=2$ or $3$ , can make it converge well. The time cost mainly comes from the forward pass of the data in the model. So we plot the trade-off between the size of calibration datasets and model accuracy in Figure 5. Specifically, in the 2-bit quantized speech model that has been launched in our company, which contains 32 transformer blocks with 7 billion parameters, we use 8 millions of tokens as calibration dataset and train 1 epoch in each block-wise minimization process. The total time for this process is about 22 hours.

Question 5:

Thank you very much for your suggestion. After the submission, we plan to add a chapter. Future work mainly includes the step-by-step derivation of formulas when there is activation quantization. And we think further research on SFT (only trains the float-point part) is valuable and necessary. This is very lightweight training.

Thank you again for taking the time and effort to review our paper carefully and then asking these high-quality questions. We look forward to further discussions with you. We would be grateful if you could raise the rating after addressing your concerns.

(Due to page limitations, responses to weaknesses and questions must be placed in two boxes.)

评论- responses to weaknesses

2024-08-04

Weakness 1:

The ultimate goal of our paper is to propose a novel quantization method to contribute to the industry and the community. Specifically, in the organization of the paper, we first formulate the quantization problem as the optimization objective(6) step by step in Sec. 3.1 and Sec. 3.2, which is the core of our paper. Once the optimization objective(6) is proposed, it is natural for us to think about how to solve it. So we propose a solution in Sec. 3.3 and Sec. 3.4. After solving the optimization objective(6) with the proposed method, a linear layer within a Transformer block is then quantized. In Sec.3.5, we further fine-tune $(s,z)$ at the block level to further improve the model accuracy.

Thank you very much for your careful reading, and we will check for spelling mistakes carefully in the revised version.

Weakness 2:

In the traditional quantization paradigm, the integer part and the floating point part are calculated using the following procedure (or its various variations): $s=\frac{max(W_0)-min(W_0)}{2^N-1} \tag{a}$ $z=min(W_0)-\alpha s \tag{b}$ $\widehat{W} = \text{clip}(\lfloor \frac{W_0-z}{s} \rceil, \alpha, \beta) \tag{c}$ where the meanings of the above notations are explained as in Sec.3.1.

In this set of formulas, the solutions for the floating-point part and the integer part are interdependent. What we call "decouples the model parameters into integer and floating-point parts" aims to completely decouple the two and use the optimization formulation(6) to find the optimal solution. The feasible region of the $(s,z)$ and $w$ have no dependency and can be regarded as independent primal variables.

Judging from the results, all common quantization schemes will result in integer and floating-point parts after the model is quantized. However, we focus on the solution process and the ideas and methodology embodied in this process.

Weakness 3:

When solving formula (11), we used the solution of GPTQ. But this is only one optional step in our solution process. Our core contribution is not the solution to this step, but the construction of formula (6). Formula (6) is an optimization objective, and there should be many ways to solve it, and we just provide one. BRECQ aims to use block reconstruction to train the rounding of all the weights within the block, while in our work, we use block reconstruction to train the $(s,z)$ while freezing the integer part. In fact, we have given three levels of optimization solutions:

At the linear layer level, we should solve formula (6) to get the integer part and floating-point parts as the first step;
At block level, we freeze the integer part and only train the floating-point parts via block reconstruction to further improve the model accuracy;
At the model level, we can use labelled dataset to train all the floating-point parts in the model to further improve model accuracy, or adapt to the downstream sub-tasks.

All of the three levels revolve around a common idea, which is to optimize the integer part and the floating-point part of the model separately.

Weakness 4:

The value of our paper lies in treating the integer part and floating-point part of model quantization independently, and then transforming a quantization problem into a mathematical optimization problem (6). As for how to solve this formula (6) in isolation, it is not the core contribution of this paper. There are many solution methods in the field of mathematical optimization, and we only provide one. We admit that this solution may not be the best, but even a suboptimal solution can still achieve high quantization accuracy. Doesn’t this further illustrate the superiority of our method ( transforming a quantization problem into a mathematical optimization problem (6))?

Weakness 5:

We unabashedly admitted in the paper that our quantization method is more time-consuming than other methods such as GPTQ. However, we also provide specific time costs as a consideration for engineers when choosing the most appropriate time-accuracy trade-off for application. In industry, it takes less than 40 hours (reported in the original paper, but can be reduced to 20 hours at the moment, via setting $J=1$ , defined in line 208) to produce a quantized and high-accuracy large language model. We think it is of high industrial value.

2024-08-11

Thank you for the detailed rebuttal. The authors have adequately addressed my concerns, and I am prepared to increase my rating accordingly. However, I believe the original manuscript could benefit from improved structural clarity. To enhance readability, I recommend the authors focus on improving the overall organization of the paper for the final submission.

2024-08-11

Thank you for your recognition of our work and your valuable suggestions. We will carefully review the organization of the paper (especially Sec 1 and 3), introduce our ideas step by step, highlight the key points, and enhance the coherence of the text, . We will also check carefully to avoid spelling errors. Thank you again and wish you all the best.

审稿意见

评分: 6置信度: 42024-07-12

This paper proposes a novel post-training quantization method to achieve 2-bit uniform quantization on large language and speech models. The proposed method decouples the quantized values into integer and floating-point parts, which are then optimized via a constrained optimization problem that can be solved with off-the-shelf solutions. The proposed method allows uniform quantization down to extreme bits.

优点

This paper proposes a novel optimization-based method to conduct PTQ on large models. The proposed method is solid and unique from previous methods.
The proposed method achieves good performance with only uniform quantization, without special procedure for outliers etc., providing direct benefit to the runtime of the quantized model on general hardware.
The limitations and future directions are clearly discussed in the paper.

缺点

The distinction between the proposed decoupleQ and the traditional quantization methods are not clearly derived in Sec. 3.2. The statement that "(s,z) lost the traditional meaning" on line 138 is not clear. My understanding is that W, s, and z are now totally independent of the original weight w0 in the optimization process, as long as the final output error is minimized? I think adding a comparison with the optimization objective/procedure of the traditional quantization here will help.
The proposed method appears to be sensitive to the size of the calibration set, so that the calibration size reported in the experiments are much larger than that of the previous baselines. As it is understandable that the optimization process may require more data to avoid overfitting, it would be more fair if the baseline methods are also calibrated with the same dataset/training cost.
For the LLM experiments, only ppl is used as metric. However, the ppl has been shown to be an inaccurate metric to reflect the utility of the LLM after compression. More evaluations such as zero-shot performance on downstream tasks and the instruction following ability etc., as in SqueezeLLM and OmniQuant papers, would be helpful to see if the quantized model still retains the ability as the FP one.

问题

Please see the weakness section.

局限性

The limitations and potential social impacts are adequately addressed.

作者回复

2024-08-04

Thank you very much for reading our paper carefully and for your generally positive comments. We will respond to your concerns in detail as much as possible and would be grateful if you could point out any omissions.

Weakness 1:

In traditional meaning, the whitepaper[1] explained in Sec 2.1 that: "The scale specifies the step size of the quantizer and floating point zero maps to zero-point. Zero-point is an integer, ensuring that zero is quantized with no error. "
This quantization paper [2] said in Sec 2.1 that "the constant $S$ (for “scale”) is an arbitrary positive real number... The constant $Z$ (for “zero-point”) is of the same type as quantized values $q$ , and is in fact the quantized value $q$ corresponding to the real value 0.

Since then, most papers have calculated $(s,z)$ and $\widehat{W}$ from $W_0$ in such a (a-b-c) procedure or its variants [1,2,3,4]: $s=\frac{max(W_0)-min(W_0)}{2^N-1} \tag{a}$ $z=min(W_0)-\alpha s \tag{b}$ $\widehat{W} = \text{clip}(\lfloor \frac{W_0-z}{s} \rceil, \alpha, \beta) \tag{c}$ In such a context, $s$ stands for step size of the quantizer , which is generally a positive number. $s,z$ and $\widehat{W}$ are dependent on the full-precision $W_0$ .

While in decoupleQ， $s,z$ are just two primal variables without any constraints in problem(6). The optimal solution of $s$ can be 0 or negative, and we completely abandoned formula (3), which is a very important step in the traditional quantization method. As you understand, $s,z$ and $\widehat{W}$ are now totally independent of the original weight $W_0$ in the optimization process, as long as the final output error is minimized. We get the values $s,z$ and $\widehat{W}$ only from the optimization problem (6), not from the above (a-b-c) procedure.

Thank you for pointing out this misleading point. We can modify this sentence to " $s$ has deviated from the meaning in the traditional quantitative context and is now unconstrained optimization variables, whose optimal solution may be 0 or even negative."

[1]: Quantizing deep convolutional networks for efficient inference: A whitepaper

[2]:Quantization and Training of Neural Networks for Efficient Integer-Arithmetic-Only Inference

[3]: AWQ: Activation-aware Weight Quantization for LLM Compression and Acceleration

[4]: OmniQuant: Omnidirectionally Calibrated Quantization for Large Language Models

Weakness 2:

Very sorry we were not clear on this point. The data size of decoupleQ is the same as the previous baseline.

In ResNet compression, previous baselines such as GPTQ used calibration datasets consisting of 1024 random training samples, and then applied standard flipping and cropping augmentations to artificially increase the size of this dataset by 10×, resulting in a total of 10240 images as the input of the model. In our experiment, we randomly choose 10240 images, with standard augmentation but not increasing the size of dataset, resulting in a total of 10240 images as the input of the model. Thank you for pointing this out. Since we did not find the source code of GPTQ to process the ImageNet dataset, we directly took 10,240 training images without increasing size to save trouble. We sincerely admit that we did not strictly align them with the previous baseline.

In LLama compression, as we write in line#217-218 and line#251-253, we use the source code from GPTQ, and we use 128 segments, each of which contains 2048 elements. This setup is strictly consistent with the previous baseline. The fair comparison results are shown in Table 3, and our results outperform than previous by a clear margin in W2A16.

In addition, since our solution to Eq(6) is very dependent on the Hessian matrix $H$ , we find that increasing the size of the calibration data can further improve the accuracy of the quantization model in decoupleQ. This is shown in Figure 5. We think this is a very good feature. In the industry, we can use enough calibration data (The total time cost for quantization is less than 20 hours at the moment) to enable our large speech model to be quantized to W2A16g64 and have the same accuracy as the unquantized baseline, and it has been launched in our company's core products.

Weakness 3:

We sincerely admit that the lack of rich public experiments was indeed our shortcoming, although our PPL in Table 3 is lower than others by a large margin in W2A16. We will make our code public on GitHub (regardless of whether the paper is accepted or rejected finally), and continue to add more experimental results. We hope that reviewers could pay more attention to our innovation in theory and scheme. We also believe that the novelty of a method may outweigh the number of experiments, especially in NeurIPS. In addition, the work is tangible and can be applied to industry. We do have launched 2-bit quantized large speech models in multiple consecutive releases of our company's core products. After the reviewing period is completed, the identity of our company and the products launched will be made public. Also, we have released the W2A16 CUDA kernel used in our products, which is under the merging review into NVIDIA tensorRT repo. We believe that our work will make certain contributions to the industry and open-source community.

Finally, thank you again for your valuable time reviewing our work! We are very happy and look forward to discussing any issues with you further. And we will reply to all your comments immediately.

评论- Thanks for the rebuttal

2024-08-09

I would like to thank the author for the rebuttal. The rebuttal resolves my concern on the calibration size. Generally speaking, as I agree with the other reviewers that more experiments can be done to show the effectiveness of the proposed method, I believe the existing evidence is enough to show the proposed method is appliable. Given the novelty and significance of the proposed method, I increase my score to weak accept.

评论- Thanks for the reply

2024-08-09

Thank you very much for your reply and for your recognition of our novelty and significance.

Thanks again!

评论- the value of the work

2024-08-09

Dear all reviewers,

We are very grateful to all reviewers for their valuable comments, and thankful for most of the reviewers in general appreciating that the work is novel, solid, fresh, and new, as well as for accurately summarizing our work and outlining its strengths.

The work is tangible and can be applied to industry. We do have launched 2-bit quantized large speech models in multiple consecutive releases of our company's core products, with the model accuracy near to bf16. After the reviewing period is over, the identity of our company and the products launched will be made public. Also, we have released the W2A16 CUDA kernel used in our products, which is under the merging review into NVIDIA tensorRT repo.

We believe that the value of research lies in the novelty and its industrial application, rather than the number of experiments, especially in NeurIPS. And we believe that our work will make certain contributions to the industry and open-source community.

We deeply appreciate all of reviewers’ time for reviewing and hope all the best!

Thanks a lot,
Authors

最终决定Reject

2024-09-25

During the discussion period, there were much disputes about novelty. There was in particular a mention that the proposed method is very similar to batch norm or layer norm. There were also concerns about lack of sufficient experimental comparison. Treating integer and floating point parts separately is not a new direction, but the particular method suggested seems to actually decouple the quantization process, which is new, as the authors rebut. decoupleQ's quantization does involve normalization, and there are splitting views on the novelty on this aspect of the paper. The approach also utilizes many existing components, although that in itself may not be a weak point. Overall, there is certainly a value in this paper but the authors should work further on including fair comparisons with more recent weight-focused quantization methods as well as improving presentation quality, especially with regards to selecting proper baselines.