5.0

/10

Rejected4 位审稿人

最低5最高5标准差0.0

3.8

置信度

正确性2.8

贡献度2.0

表达2.8

NeurIPS 2024

Nearly Lossless Adaptive Bit Switching

Haiduo Huang,Zhenhua Liu,Tian Xia,Wenzhe zhao,Pengju Ren

OpenReview PDF

提交: 2024-04-27更新: 2024-11-06

摘要

Model quantization is widely applied for compressing and accelerating deep neural networks (DNNs). However, conventional Quantization-aware training (QAT) focuses on training DNNs with uniform bit-width. The bit-width settings vary across different hardware and transmission demands, which induces considerable training and storage costs. Hence, the scheme of one-shot joint training multiple precisions is proposed to address this issue. Previous works either store a larger FP32 model to switch between different precision models for higher accuracy or store a smaller INT8 model but compromise accuracy due to using shared quantization parameters. In this paper, we introduce the ${\bf Double Rounding}$ quantization method, which fully utilizes the quantized representation range to accomplish nearly lossless bit-switching while reducing storage by using the highest integer precision instead of full precision. Furthermore, we observe a competitive interference among different precisions during one-shot joint training, primarily due to inconsistent gradients of quantization scales during backward propagation. To tackle this problem, we propose an Adaptive Learning Rate Scaling (${\bf ALRS}$) technique that dynamically adapts learning rates for various precisions to optimize the training process. Additionally, we extend our Double Rounding to one-shot mixed precision training and develop a Hessian-aware Stochastic Bit-switching (${\bf HASB}$) strategy. Experimental results on the ImageNet-1K classification demonstrate that our methods have enough advantages to state-of-the-art one-shot joint QAT in both multi-precision and mixed-precision. Our codes are available at https://anonymous.4open.science/r/Double-Rounding-EF78/README.md.

关键词

Model QuantizationOne-Shot Mixed-PrecisionMulti-PreciisonQuantization-Aware Training

评审与讨论

审稿意见

评分: 5置信度: 52024-07-05

The paper addresses challenges in model quantization for deep neural networks (DNNs), focusing on optimizing quantization-aware training (QAT) across multiple bit-widths with weight-sharing. To this end, this paper introduces a novel quantization method that exploits the highest integer precision to achieve nearly lossless bit-switching, reducing storage without relying on full precision. Key contributions include: (1) Adaptive Learning Rate Scaling: A technique that dynamically adjusts learning rates for different precisions to address competitive interference and inconsistent gradient issues during one-shot joint training. (2) Double Rounding: An extension for one-step rounding quantizer in fixed-precision quantization to improve accuracy. Experimental results on the ImageNet-1K dataset show that the proposed methods surpass state-of-the-art approaches in both multi-precision and mixed-precision scenarios, achieving higher efficiency and accuracy.

优点

This submission is well-written, as well as with good figures in Sec.4.
The authors conduct extensive experiments on multiple datasets and multiple networks.

缺点

Some analysis is missing. For example, I'm wondering whether the second rounding leads to more quantization errors, as the first rounding is used to produce INT8 weights and second rounding is then performed to quantize lower bit-width, the twice quantization is possible to cause more clipping errors and rounding errors, some analysis could enhance the strength of proposed methods.
Some designs should be further clarified, e.g., why ALRS is applied only for the scaling factors? Intuitively, weights of small bit-width is induced large gradient variance by STE, and thus the weights of small bit-width should also benefit from using smaller LR.
Fig. 1 is a bit confusing, some colored arrows are not well explained.
This works essentially lies in the research of mixed-precision quantization, so I think it is better to compare more MPQ (e.g., HAQ, DNAS, LIMPQ, etc) research in the Sec.4. Moreover, some recent papers on multi bit-width quantization are missed on the , e.g., [1] (PTQ-based) and [2][3] (QAT-based), which could be included into the Related Work.

[1] Xu, Ke, et al. "PTMQ: Post-training Multi-Bit Quantization of Neural Networks." Proceedings of the AAAI Conference on Artificial Intelligence. Vol. 38. No. 14. 2024.

[2] Tang, Chen, et al. "Retraining-free model quantization via one-shot weight-coupling learning." Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2024.

[3] Zhong, Yunshan, et al. "MultiQuant: A Novel Multi-Branch Topology Method for Arbitrary Bit-width Network Quantization." arXiv preprint arXiv:2305.08117 (2023).

问题

Please refer to the weaknesses.

局限性

Please refer to the weaknesses. Overall, this paper currently needs more experiments and analysis to reveal some designs are reasonable.

作者回复

2024-08-06

We would like to thank you for your careful reading, helpful comments, and constructive suggestions, which have significantly improved the presentation of our manuscript. We are delighted with the identification of the novelty and effectiveness of the proposed method. We have carefully addressed all comments, and below are our responses:

W1: Whether the second rounding lead to more quantization errors?
A: Thank you for your valuable comment. Firstly, we acknowledge that twice rounding introduces more quantization error compared to once rounding. However, compared to previous methods, the advantage of our method is that only saves the quantization parameters (i.e., scales) for the highest bit(e.g., int8). This allows us to save only the highest bit weights and switch to lower bits while eliminating the multiplication and division operations brought by multiple scales during adaptive bit-switching to accelerate the model's inference. Additionally, model training can compensate for this loss, achieving nearly lossless bit-switching. The ultimate goal of this approach is to achieve faster inference with adaptive bit-switching on hardware.

W2: Why ALRS is applied only for the scaling factors and not used in weights of small bit-width?
A: Thank you for your kind comments. We apologize for the lack of clarification in that statement. Firstly, ALRS is only applicable to the quantization scale factor because it is quite sensitive during quantization training and determines the final model's convergence performance. In our multi-precision or mixed-precision scenarios, weights are shared, and directly scaling their values would cause other precisions to fail to converge due to severe competition between different precisions. Additionally, applying the ALRS strategy to the quantization scales of different precisions (including small bit-widths) is indirectly equivalent to scaling the weights of that bit-width, because
$weight_i = Round(weight_{i-1}) / scale_i) * scale_i + delta_w(x) * lr$ ,
$scale_i = scale_{i-1} + delta_s(x) * ALRS$ ,
where the $delta_w$ and $delta_s$ denote the gradients of the $weight_{i-1}$ and $scale_{i-1}$ respectively, and $ALRS$ denotes the adaptive learning rate of the quantization scaling factors and $lr$ denotes the learning rate of the weights. Even so, we also tried applying ALRS to the weights but did not achieve better performance. We hope this explanation clarifies the rationale behind the argument.

W3: Fig. 1 is a bit confusing, some colored arrows are not well explained.
A: Thank you for your kind feedback. In Figure 1: (1) The red arrows indicate the expression of different precisions obtained through double rounding, then the green arrows indicate the combined optimization using the ALRS technique to achieve multi-precision, followed by the HASB technique to further achieve a mixed-precision model. (2) The black arrows in the network structure indicate the data flow between networks. (3) The gray dashed lines represent the techniques required at different stages. We will further refine Figure 1 in the revised manuscript.

W4: It is better to compare more MPQ (e.g., HAQ, DNAS, LIMPQ, etc) research and could include the [1] (PTQ-based) and [2][3] (QAT-based) into the Related Work.
A: Thank you for your suggestions. We have carefully read the literature you provided: HAQ [4], DNAS [5], and LIMPQ [6]. We find that DNAS mainly designs a differentiable NAS to search for a hardware-aware efficient network without considering quantization. However, we will incorporate DNAS's efficient NAS techniques to further enhance the performance of our mixed-precision quantization. Regarding HAQ, it uses reinforcement learning to integrate the hardware accelerator’s feedback into the design loop for the quantization strategy. However, the networks learned are hardware-customized, and the learned networks do not have specific bit-width standards, so a direct fair comparison with our method may not be possible. Nevertheless, we will attempt to combine HAQ's techniques in the future to further improve the performance of our method. For LIMPQ, please refer to our response to reviewer bT7w W4. Lastly, we promise to include the literature you suggested [1] (PTQ-based) and [2][3] (QAT-based) in the related work section of the revised manuscript.

[4]HAQ: Hardware-Aware Automated Quantization with Mixed Precision
[5]FBNet: Hardware-Aware Efficient ConvNet Design via Differentiable Neural Architecture Search
[6]Mixed-Precision Neural Network Quantization via Learned Layer-wise Importance

2024-08-11

Dear reviewer T2Zq,

Thanks for your positive comments. We will add the additional results, discussions, and references in the final revision. We are pleased to answer your questions and concerns. We sincerely implore that you can consider raising the final score. In this case, we will be greatly inspired and will contribute to the community permanently. Thanks again for your time and efforts in this paper.

Best regards

Authors

2024-08-13

Dear Reviewer T2Zq,

Sorry to bother you again. Thank you for your valuable reply. We have carefully addressed the points raised and have incorporated the suggestions provided to further strengthen the manuscript. Notably, other reviewers have expressed positive evaluations, which we have reflected upon and integrated into the revisions. Given these enhancements and the overall positive reception, we kindly ask if you would consider re-evaluating your score, taking into account the improvements made.

Your thoughtful reassessment would be greatly appreciated and would contribute significantly to the overall quality and impact of our work.

Thank you once again for your time and consideration.

Best regards,

Authors

2024-08-11

Thanks for the response. I have carefully read the rebuttal and other reviews, the response addressed my concerns. I will increase my rating while I hope the authors can incorporate these additional results, discussions, and references into the revision. BTW, DNAS refers to the paper entitled "Mixed precision quantization of convnets via differentiable neural architecture search".

审稿意见

评分: 5置信度: 32024-07-08

The paper proposes a QAT scheme to jointly optimize a single model with different precisions. The authors apply their scheme on various CNN-based models on CIFAR-10 and ImageNet datasets.

优点

The paper is well-written
The ablation study is strong in my opinion and they evaluate various aspects of their scheme

缺点

I think the main limitation of the paper is the models and datasets. I believe that the study should be done on larger models (LLMs for example) as a architecture goal. For example, the authors show that they do not save a FP32 master copy of the model in their scheme. However, ResNet style models (or MobileNet) are easy to fit in even moderate GPUs and I don't think FP32 master copy is a big problem in that case (please correct me if I'm wrong).
I couldn't find a source-code to reproduce the results of the paper in my side.

问题

The paper assumes a power-of-two relation between different precision. Do you have any theoretical intuition for this assumption? or this is just because of fast HW implementation?
What is the cost HMT in different networks? How can we compare it against the e2e runtime?

局限性

yes.

作者回复

2024-08-06

We would like to thank you for your careful reading, helpful comments, and constructive suggestions, which have significantly improved the presentation of our manuscript. We have carefully addressed all comments, and below are our responses:

W1: The study should be done on larger models (LLMs), and the FP32 master copy is not a big problem in moderate GPUs.
A: Thank you for your kind comments. Sorry for the confusion. Firstly, the advantage of saving the highest integer bit-width (int8) we proposed may lie in small terminal devices (e.g., smartphones, small drones, etc.). Hardware implementation is not limited to GPUs but may include CPUs or FPGAs with limited resources and communication bandwidth, as well as Arm processors. Secondly, we acknowledge your point that the technique we proposed may have greater advantages in the field of LLMs. Due to limited time and resources, we conduct Multi-precision experiments on small LLMs[1] without using ALRS and distillation, please refer to Table 1 of the attached PDF in the global response. Note, except for not quantizing the embedding layer and head layer, due to the sensitivity of SiLU activation causing non-convergence, we don’t quantize the SiLU activation in the MLP and set batch size=16. The training process is shown in Figure 1(TinyLLaMA-1.1B) of the attached PDF.

W2: Couldn't find a source-code to reproduce the results of the paper.
A: Thank you for your kind feedback. Our open-source code can be accessed by clicking the last word "here" in the abstract, which is a hyperlink. As reviewer rFvS also points out, one of our advantages is that "Code is given."

Q1: Have any theoretical intuition for assuming a power-of-two relation between different precision or this is just because of fast HW implementation?
A: Thanks for the nice question. We apologize for the lack of clarification in that statement. The “power-of-two relation” arises because we use the same quantization parameters, i.e., scale, for models of different precisions when learning the weights. Switching from a higher bit-width to a lower bit-width is equivalent to clipping the lower bits of the weight values, retaining the higher bits. For example, the first four bits of the 8-bit weight are the same as the 4-bit weight. This is also equivalent to multiplying the scale value by 2 to the power of k, where k is the difference between the two bit-widths. Of course, in conventional multi-precision training where quantization parameters are not shared among different precisions, there is no “power-of-two relation” in their scales. The ultimate goal of this approach is to achieve faster inference with adaptive bit-switching on hardware. We hope this explanation clarifies the rationale behind the argument.

Q2: What is the cost HMT in different networks? How can we compare it against the e2e runtime?
A: Thank you for your kind question. The cost of computing the HMT (Hessian Matrix Trace) in different networks takes a few minutes (e.g., approximately 2 minutes for ResNet18 and 5 minutes for ResNet50). Since this computation is done offline and then directly used for end-to-end inference, this part of the computation can be considered negligible.

[1]TinyLlama: An Open-Source Small Language Model.

评论- Reply

2024-08-11

Thank you so much for answering my questions and providing more results. For some reason, I couldn't find the code :-)

I will increase my score.

2024-08-11

Dear reviewer Aa14,

Thanks for your positive comments. We are sorry that you can't find the code we provided for some reason. We will forward an anonymous code link to you via AC, and we promise to make the code publicly available on GitHub later. We are pleased to answer your questions and concerns. We will add the relevant discussions in the final version. We sincerely implore that you can consider raising the final score. In this case, we will be greatly inspired and will contribute to the community permanently. Thanks again for your time and efforts in this paper.

Best regards

Authors

2024-08-13

Dear Reviewer Aa14,

Your thoughtful reassessment would be greatly appreciated and would contribute significantly to the overall quality and impact of our work.

Thank you once again for your time and consideration.

Best regards,

Authors

审稿意见

评分: 5置信度: 32024-07-11

This paper discusses advanced methods in multi-bit model quantization. Specifically, this paper proposes a method for one-shot joint training of multiple precisions. To this end, the authors introduce a double-rounding quantizer that leverages the highest integer precision to achieve nearly lossless bit-switching while reducing storage requirements. Moreover, they also propose an Adaptive Learning Rate Scaling technique that adjusts learning rates dynamically for different precisions. Two proposed techniques mitigate the competitive interference between bit-widths caused by inconsistent gradients of different precisions during biased gradient estimation. They also extend their Double Rounding method to support one-shot mixed precision training and develop a Hessian-aware Bit-witdh sampling strategy. Experimental results on the ImageNet-1K classification task show that their methods outperform state-of-the-art one-shot joint QAT in both multi-precision and mixed-precision scenarios.

优点

Eliminating the costs of retraining for mixed-precision quantization is a meaningful and challenging topic.
The end-to-end experiments are sufficient, and the presentation is good.

缺点

More uniquness analysis needed. The using of Hessian information seems a bit trivial, each layer's Hessian is just used to compare with the averaged Hessian trace. Firstly, as shown in recent zero-cost NAS research [1], the architectural proxies will be less effective as the training goes on, I'm not sure the Hessian information obtained on the initial full-precision model will remain useful as the quantization-aware training continues. Moreover, the sampling probability is modified with a simple ascending heuristic, which is not Hessian-aware.
Also applies here: the design of the double-rounding quantizer is similar to Bit-Mixer, Adabits, and ABN. Specifically, ABN also uses
ALRS needs further ablations. In ALRS, the authors use a fixed scaling ratio to bit-widths, e.g., 8-bit is 1, 6-bit is 0.1, and 4-bit is 0.01, the choice of these scaling factors still requires more ablation studies and discussions.
More comparisons needed. Since this paper adopts an ILP-based search algorithm to find optimal subnets, it is better to compare with these ILP-based mixed-precision quantization papers, e.g., [2] and [3].

[1] A Deeper Look at Zero-Cost Proxies for Lightweight NAS [2] Mixed-precision neural network quantization via learned layer-wise importance, ECCV 2022. [3] Hawq-v2: Hessian aware trace-weighted quantization of neural networks, NIPS 2020.

问题

How do the authors perform KD for the proposed method? Is an external teacher used or only distilled from the highest precision with the in-place distillation?

局限性

作者回复

2024-08-06

W1: Using Hessian information seems a bit trivial, and the sampling probability is modified with a simple ascending heuristic, which is not Hessian-aware.
A: Thank you for your valuable comment. Firstly, we acknowledge that this method still has room for improvement, and we will refine it in the future by incorporating zero-cost NAS research or more advanced algorithms. However, the sampling probability is indeed modified using a simple heuristic approach based on the sensitivity (Hessian trace) of different layers during the training phase. During the inference phase, when using ILP for inference, the Hessian information is also used as a constraint factor for aligning the training phase, thereby avoiding the need for retraining, which maybe can be considered Hessian-aware. We hope this explanation clarifies the rationale behind the argument.

W2: The design of the double-rounding quantizer is similar to Bit-Mixer, Adabits, and ABN. Specifically, ABN also uses.
A: Thank you for your thoughtful comment. Similar to Bit-Mixer, Adabits, and ABN, our method learns different bit quantization parameters through shared weights, but the specific implementations are different. Firstly, Adabits and Bit-Mixer primarily achieve bit-switching through the Floor operation, while our method switches using twice Rounding operation. Secondly, ABN's formula appears similar to our doubling rounding, but the key difference is that different precisions use different quantization parameters. In contrast, our double rounding only updates the quantization parameters of the highest INT weight during training. Finally, in bit-switching scenarios, storing only one shared quantization parameter of our double-rounding is hardware-friendly and reduces the computation overhead of scale (floating-point value).

W3: The scaling factors of ALRS need further ablations.
A: Thank you for your valuable suggestion. We have conducted more ablation experiments on scaling factors, and the results can be seen in Table 4 of the attached PDF in the global response. It can be observed that the performance differences of the model under different scaling factor settings are not significant, further proving the effectiveness of ALRS.

W4: It is better to compare with these ILP-based mixed-precision quantization papers, e.g., [2] and [3].
A: Thanks for your kind suggestion. We have carefully read the literature [2] and [3] you provided. However, we find that only the results in Table 2 of [2] regarding ResNet18's 3MP (Top-1: 69.7) can be fairly compared with ResNet18's 3MP (Top-1: 69.92) in Table 3 of our main text. The other configurations either refer to ResNet50's w-bits=3MP, a-bits=4MP in literature [2] or ResNet50's w-bits=2MP, a-bits=4MP in literature [3]. We will further attempt similar bit-width configurations for a fair comparison in the revised version of the paper.

Q1: How do perform KD for the proposed method?
A: Thank you for your valuable question. We drew inspiration from the progressive in-place distillation of [4], i.e., self-distillation. However, we apply it to the multi-precision quantization method in this paper. Specifically, higher bit-widths distill to their neighboring lower bit-widths, such as 8-bit distills to 6-bit, 6-bit distills to 4-bit, and 4-bit distills to 2-bit.

[4] self-knowledge distillation with progressive refinement of targets.

2024-08-12

Thanks for addressing the questions and comments in the previous round. I also read the others' comments and remain positive for the rating.

2024-08-12

Dear Reviewer bT7w,

Thanks for your positive comments. We are pleased to address your questions and concerns. We will add the relevant discussions in the final version. We sincerely implore that you can consider raising the final score. In this case, we will be greatly inspired and will contribute to the community permanently. Thanks again for your time and efforts in this paper.

Best regards,

Authors

2024-08-13

Dear Reviewer bT7w,

Your thoughtful reassessment would be greatly appreciated and would contribute significantly to the overall quality and impact of our work.

Thank you once again for your time and consideration.

Best regards,

Authors

审稿意见

评分: 5置信度: 42024-07-12

The authors propose a bit-switching quantization method using Double Rounding, which applies rounding twice to achieve nearly lossless switching without storing a full-precision model. They also introduce Adaptive Learning Rate Scaling (ALRS) to adjust learning rates dynamically across precisions, ensuring consistent quantization updates. Additionally, they develop Hessian-Aware Stochastic Bit-switching (HASB) for one-shot mixed-precision training, optimizing bit-width distribution based on layer sensitivity, thus eliminating retraining stages.

优点

ALRS heuristic can help practitioners who wish to train mutli-precision model
Authors made extensive experiments on vision models and compare to previse methos
Most sections are well written
Code is given

缺点

Novelty is limited and I am not highly motivated that the problem is important.

The main contribution is to not same 32bit weight and different quantization parameters but only the high bidwith using a pretty straightforward idea of double rounding during training
The ALRS is based on observation and heuristic to fix it. It is nice and helps for when trying to use 2 bits as well. Yet, I am not sure it is important for methods that don’t use the double rounding.

Motivation

Since we usually don’t switch models based on data I am not sure why this is important. Do we really have edge device that switch on a daily base model precision and thus need to store in small local memory the 32bit model? Can you elaborate why and where multi precision is really important.
No results on more recent models (LLMs)

问题

I don't understand why you claim the ALRS method was inspired by LARS. The only similarity seems to be the name. Can you explain the connection?

Can you provide a scenario where your method is particularly important?

You state that “if different precision losses separately compute gradients and directly update shared parameters at each forward process, it attains better accuracy when combined with our ALRS training strategy.” However, this involves updating the gradients four times more frequently, which is inefficient (4x backward pass and optimizer). This seems equivalent to small batch versus large batch training. Have you considered simply increasing the learning rate with the conventional multi-precision training approach? It might achieve the same results.

In Table 4, have you run the same number of forward/backward passes with the {8, 6, 4, 2} and {4, 3, 2} bit-widths? Since you have only three bit-widths in the latter, you might need to run 4/3 more iterations to ensure the total number of updates is the same. Have you tried that?

局限性

The authors partially discuss limitation

作者回复

2024-08-06

W1: using a pretty straightforward idea of double rounding during training.
A: Thank you for your comment. Although the concept of double rounding appears simple, learning shared quantization parameters that maintain the characteristics of low integer precision makes it possible to switch to other lower bits almost losslessly. Our study verifies both the feasibility and efficiency of this method.

W2: Is ALRS important for methods that don’t use double rounding?
A: Thanks for your constructive comment. In fact, ALRS is a general technology for multi-precision. We have conducted ablation experiments on other methods, please refer to Table 2 of the attached PDF in the global response. The results further validate the effectiveness of our proposed ALRS.

W3: Do have an edge device for switching daily model precision, why and where is multi-precision important?
A: Thank you for your valuable comment. Unfortunately, it seems that there are no specific edge devices that implement adaptive bit-switching during the inference phase currently. However, existing hardware already has the capability for such bit-switching (for example, the Nvidia A100 supports INT8, INT4, and Binary [1], and small AI chips support INT4 and INT2 [2]). This potential has not yet been fully developed, but we believe this technology will become widespread in the future for the following reasons:

Model Compression: For different terminal devices, the storage and computational resources provided vary. To reduce the cost of repeated quantization training and simultaneously provide multiple versions of models with different sizes [3], multi-precision is an effective means to solve this problem.
Scenario-Based Precision Switching: Depending on different scenarios, real-time switching of model precision can be implemented. For instance, in the field of autonomous driving, complex road scenarios require high precision to ensure real-time accuracy, while simple road conditions can switch to lower precision to save energy.
Large Language Models (LLMs): For complex logical tasks, a high-precision model may be needed to provide more reliable answers (similar to GPT4-pro), whereas for simple conversational tasks, a low-precision model can provide reliable answers (similar to GPT4o or GPT4o-mini) to accelerate inference.

W4: No results on more recent models (LLMs)?
A: Thank you for your nice comment. Due to limited time and resources, we conduct Multi-precision experiments on small LLMs[4] without using ALRS and distillation, please refer to Table 1 of the attached PDF. Note, except for not quantizing the embedding layer and head layer, due to the sensitivity of SiLU activation causing non-convergence, we don’t quantize the SiLU activation in the MLP and set batch size=16. The training process is shown in Figure 1(TinyLLaMA-1.1B) of the attached PDF.

Q1: Explain the connection between the ALRS method inspired by LARS?
A: Thank you for your comment. LARS first uses a separate learning rate for each layer instead of each weight. Secondly, the magnitude of updates is controlled according to the weight norm to better manage the training speed. Our ALRS refers to LARS's weight norm update strategy and modifies it to a gradient norm update strategy to adapt the learning rate of different precision. We hope this explanation clarifies the rationale behind the argument.

Q2: provide a scenario where your method is particularly important.
A: Please refer to Answers of W3.

Q3: Simply increasing the learning rate with the conventional multi-precision training approach?
A: Thanks for your kind comment. Initially, we attempted conventional multi-precision training by simply increasing the learning rate, but this led to non-convergence and even training collapse. We analyze that this is due to the excessive sensitivity of the quantization scale and the severe competition between different precisions.

Q4: Need to run 4/3 more iterations to ensure the total number of updates.
A: Thanks for the nice suggestion. We have conducted experiments on {8, 6, 4, 2}-bit and {4, 3, 2}-bit using more epochs. Please refer to Table 3 of the attached PDF, where we find a slight improvement.

[1]Once-for-all: Train one network and specialize it for efficient deployment.
[2]NVIDIA A100 Tensor Core GPU Architecture.
[3]A 7-nm Four-Core Mixed-Precision AI Chip With 26.2-TFLOPS Hybrid-FP8 Training, 104.9-TOPS INT4 Inference, and Workload-Aware Throttling.
[4]TinyLlama: An Open-Source Small Language Model.

评论- Answer to the authors rebuttal

2024-08-11

I would like to thank the reviewers for their detailed responses. The authors addressed my questions effectively. Due to their additional experiments and the fact that ALRS has been shown to improve accuracy for other multi-precision methods, I have raised my score to 5.

2024-08-11

Dear reviewer rFvS,

Thanks for your positive comments. We are pleased to answer your questions and concerns. We will add the relevant discussions in the final version. We sincerely implore that you can consider raising the final score. In this case, we will be greatly inspired and will contribute to the community permanently. Thanks again for your time and efforts in this paper.

Best regards

Authors

2024-08-13

Dear Reviewer rFvS,

Your thoughtful reassessment would be greatly appreciated and would contribute significantly to the overall quality and impact of our work.

Thank you once again for your time and consideration.

Best regards,

Authors

作者回复

2024-08-06

We extend our sincerest gratitude to the AC and reviewers for their constructive comments, which greatly improve this work!

最终决定Reject

2024-09-25

The paper proposes algorithms for multi-bit width quantization where parameters can be chosen at different bit-width and trained in a single-shot manner. Double-rounding quantizer helps reduce storage requirements, ALRS dynamically updates learning rate per precision. All the reviewers highlight limited novelty, missing comparisons to recent literature in multi bit-width quantization, weak results on domain outside of CV (LLMs) and limited ablation studies. I recommend rejection in its current state, but encourage the authors to address the reviewer comments and strengthen the manuscript for a future publication.