4.8

/10

Rejected4 位审稿人

最低3最高6标准差1.1

3.8

置信度

正确性2.8

贡献度2.5

表达2.8

ICLR 2025

Dissecting Bit-Level Scaling Laws in Quantizing Vision Generative Models

Xin Ding,Shijie Cao,Ting Cao,Zhibo Chen

OpenReview PDF

提交: 2024-09-27更新: 2025-02-05

摘要

关键词

quantizationvisual generative modelsscaling laws

评审与讨论

审稿意见

评分: 5置信度: 32024-10-30

This paper explores scaling laws for model quantification. Besides, TopKLD is introduced to lift the decoder-only model's bit-level scaling performance.

优点

This paper conducted many experiments based on VAR and DIT to explore the scaling law at the bit level.
The language-based model enjoys a better bit-level scaling law. The conclusion is interesting.
TopKLD seems effective in various quantitative aspects of VAR.

缺点

The paper is more like an experimental report than a research paper. I think the comparison between VAR and DIT is too lengthy and the TopKLD is short.
The model size of VAR is small. Is the necessity of quantifying small models sufficient?
Can you provide a direct visualization result that clearly shows the bit-level scaling law?

问题

See weakness.

2024-11-20

Reference

[1]Dhariwal P, Nichol A. Diffusion models beat gans on image synthesis[J]. Advances in neural information processing systems, 2021, 34: 8780-8794.

[2] Ho J, Saharia C, Chan W, et al. Cascaded diffusion models for high fidelity image generation[J]. Journal of Machine Learning Research, 2022, 23(47): 1-33.

[3] Rombach R, Blattmann A, Lorenz D, et al. High-resolution image synthesis with latent diffusion models[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 10684-10695.

[4] Peebles W, Xie S. Scalable diffusion models with transformers[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 4195-4205.

[5] Gao S, Zhou P, Cheng M M, et al. Masked diffusion transformer is a strong image synthesizer[C]//Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023: 23164-23173.

[6] Liu Q, Zeng Z, He J, et al. Alleviating Distortion in Image Generation via Multi-Resolution Diffusion Models[J]. arXiv preprint arXiv:2406.09416, 2024.

[7] Gu S, Chen D, Bao J, et al. Vector quantized diffusion model for text-to-image synthesis[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2022: 10696-10706.

[8] Tang Z, Gu S, Bao J, et al. Improved vector quantized diffusion models[J]. arXiv preprint arXiv:2205.16007, 2022.

[9] Chang H, Zhang H, Jiang L, et al. Maskgit: Masked generative image transformer[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 11315-11325.

[10] Li T, Katabi D, He K. Self-conditioned image generation via generating representations[J]. arXiv preprint arXiv:2312.03701, 2023.

[11] Yu L, Lezama J, Gundavarapu N B, et al. Language Model Beats Diffusion--Tokenizer is Key to Visual Generation[J]. arXiv preprint arXiv:2310.05737, 2023.

[12] Yu Q, Weber M, Deng X, et al. An Image is Worth 32 Tokens for Reconstruction and Generation[J]. arXiv preprint arXiv:2406.07550, 2024.

[13] Weber M, Yu L, Yu Q, et al. Maskbit: Embedding-free image generation via bit tokens[J]. arXiv preprint arXiv:2409.16211, 2024.

[14] Razavi A, Van den Oord A, Vinyals O. Generating diverse high-fidelity images with vq-vae-2[J]. Advances in neural information processing systems, 2019, 32.

[15] Esser P, Rombach R, Ommer B. Taming transformers for high-resolution image synthesis[C]//Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2021: 12873-12883.

[16] Lee D, Kim C, Kim S, et al. Autoregressive image generation using residual quantization[C]//Proceedings of the IEEE/CVF Conference on Computer Vision and Pattern Recognition. 2022: 11523-11532.

[17] Yu J, Li X, Koh J Y, et al. Vector-quantized image modeling with improved vqgan[J]. arXiv preprint arXiv:2110.04627, 2021.

[18] Tian K, Jiang Y, Yuan Z, et al. Visual autoregressive modeling: Scalable image generation via next-scale prediction[J]. arXiv preprint arXiv:2404.02905, 2024.

[19] Sun P, Jiang Y, Chen S, et al. Autoregressive Model Beats Diffusion: Llama for Scalable Image Generation[J]. arXiv preprint arXiv:2406.06525, 2024.

[20] Chung Y A, Tang H, Glass J. Vector-quantized autoregressive predictive coding[J]. arXiv preprint arXiv:2005.08392, 2020.

2024-11-20

Weakness 2

Thank you very much for your suggestion. To address your concerns regarding the model, we conducted the same experiments on other models, as detailed in Appendix C. It can be observed that due to the influence of the continuous representation space, MAR, despite exhibiting excellent scaling laws, do not demonstrate superior bit-level scaling laws,similar to DiT. In contrast, LLaMaGen, which shares the discrete representation space with VAR, exhibits outstanding bit-level scaling laws.

Additionally, we provide an explanation of the impact of model size on bit-level scaling laws based on their underlying principles. Firstly, Our research focuses on analyzing the differences in scaling trends between models that already exhibit superior scaling laws, rather than being influenced by specific model sizes. To ensure clarity, we have aligned the initial total bits in Figure 1 of paper, providing you with a clearer understanding. To assess whether a model exhibits strong bit-level scaling laws, one must compare the internal change trends of the model (e.g., comparing 8-bit VAR vs. 16-bit VAR). As shown in Figure 1 in paper, we can observe that, regardless of the quantization method, when the full-precision VAR is quantized to lower bit precision, the overall scaling law of the model shifts towards the lower-left corner. However, DiT does not exhibit this behavior. This outstanding characteristic displayed by the discrete model enables us to increase the model parameters through quantization under limited resources, leading to better generative performance, which is not possible for continuous models. This is precisely what bit-level scaling laws aim to demonstrate.

More importantly, our work shows that quantization is no longer just about reducing model size. By optimizing both model design and quantization techniques to achieve superior bit-level scaling laws, we can obtain better generative performance under the same resource constraints. This outstanding feature is something that researchers should pay more attention to.

Weakness 3

We apologize for any confusion caused by the phrasing in the paper. We hope that the explanation of Figure 1 in the main text helps clarify the concept of bit-level scaling laws and this interesting phenomenon.. As shown in the figure 1, quantized VAR, after being quantized to lower bit precision, demonstrates a shift in its scaling law curve towards the lower-left region, which exhibits its superior bit-level scaling laws. By leveraging this outstanding feature, we can increase the model parameters under limited resources (e.g., in specific deployment scenarios such as mobile devices or edge computing) while maintaining efficiency, ultimately improving generative capabilities. In contrast, for continuous diffusion-style models, regardless of the quantization method used, the quantized model shows "almost" no improvement compared to full precision. Bit-level scaling laws serve as a strong predictor of model performance.

This paper indicates that achieving optimal bit-level scaling behavior requires a synergistic interaction between model design and quantization algorithms. Our study is an essential step towards understanding how various models and quantization methods influence bit-level scaling behavior, and it also provides valuable recommendations for future work.

2024-11-20

Weakness 1

We greatly appreciate your valuable review comments. We apologize for any confusion caused by the phrasing in the paper and hope our response can clarify your concerns regarding the statement: "The paper is more like an experimental report than a research paper." Furthermore, TopKLD is just one part of our research. The goal of this paper is not merely to propose an improvement to existing methods, but rather to conduct an in-depth study of the bit-level scaling laws in vision generative models, addressing the "What," "Why," and "How" from the perspective of extensive experimental design.

Exploration of bit-level scaling laws must be based on the internal patterns derived from a large number of experiments. These patterns can guide future research, which is why we conducted numerous experiments, as it is essential for uncovering these insights.

The analysis of VAR and DiT represents research into two mainstream development directions in the vision generative model field. As shown in Table 1 above, there has been ongoing debate regarding the use of discrete versus continuous representation spaces (e.g., [17,18,19,20]). Both approaches have shown strong performance in terms of scaling laws. This work, however, takes a different perspective by investigating the impact of these representation spaces on the scaling laws in quantized models. We find that, despite achieving comparable performance at full precision, discrete autoregressive models consistently outperform continuous models across various quantization settings. To validate the effectiveness and broad applicability of our conclusions for you, we conducted the same experiments on other models, as detailed in Appendix C. This indicates that our work provides general guidance for subsequent model design and applications in specific deployment scenarios (e.g., mobile devices, edge computing).

Secondly, while low-bit precision representation often focuses on trading performance for efficiency, this work demonstrates that by optimizing either the model or quantization algorithm, models can achieve superior bit-level scaling laws. This outstanding characteristic enables the use of lower bit precision to increase model parameters, ultimately enhancing generative capability without sacrificing efficiency — is a key feature that we hope researchers will pay particular attention to.

As such, you can see the tremendous potential of low-bit precision in the context of bit-level scaling laws. However, existing methods fail to further improve the bit-level scaling laws of models. To address this, we introduced TopKLD, which enhances the bit-level scaling behaviors of language-style models by one level.

Our study is an essential step toward understanding how various models and quantization methods influence bit-level scaling behavior, and it also provides the following recommendations for future work: We hope the reviewer will take into account the contributions of this work to model design and the application of quantization algorithms. Thank you again!

2024-11-20

Rebuttal Revision Paper Modifications

We greatly appreciate your valuable review comments. We have revised the paper according to your suggestions and submitted the rebuttal version. For detailed modifications, please refer to the rebuttal version PDF and appendix C: Supplementary materials for rebuttal. Below, we address your identified weaknesses and questions, hoping to resolve your concerns and improve our score.

Table 1

Model Type	Discrete/Continuous	model	#para	FID	IS	dates	Scaling ability
Diffusion-style	continuous	ADM[1]	554M	10.94	101	2021.07	No
Diffusion-style	continuous	CDM[2]	-	4.88	158.7	2021.12	No
Diffusion-style	continuous	LDM-8[3]	258M	7.76	209.5	2022.04	No
Diffusion-style	continuous	LDM-4	400M	3.6	247.7		No
Diffusion-style	continuous	DiT[4]	458M	5.02	167.2	2023.03	Yes
Diffusion-style			675M	2.27	278.2
Diffusion-style			3B	2.1	304.4
Diffusion-style			7B	2.28	316.2
Diffusion-style	continuous	MDTv[5]	676M	1.58	314.7	2024.02	No
Diffusion-style	continuous	DiMR[6]	505M	1.7	289	2024.07	No
Diffusion-style	Discrete	VQ-diffusion[7]	370M	11.89	-	2022.03	No
Diffusion-style	Discrete	VQ-diffusion-V2[8]	370M	7.65	-	2023.02
Language-style	Discrete	MaskGIT[9]	177M	6.18	182.1	2022.02	No
Language-style	Discrete	RCG(cond.)[10]	502M	3.49	215.5	2023.12	No
Language-style	Discrete	MAGVIT-v2[11]	307M	1.78	319.4	2023.04	No
Language-style	Discrete	TiTok[12]	287M	1.97	281.8	2024.07	No
Language-style	Discrete	MaskBit[13]	305M	1.52	328.6	2024.09	No
Language-style	Discrete	VQVAE[14]	13.5B	31.11	45	2019.06	No
Language-style	Discrete	VQGAN[15]	1.4B	5.2	175.1	2021.07	No
Language-style	Discrete	RQTran[16]	3.8B	3.8	323.7	2022.03	No
Language-style	Discrete	VITVQ[17]	1.7B	3.04	227.4	2022.07	No
Language-style	Discrete	VAR[18]	310M	3.3	274.4	2024.04	yes
Language-style			600M	2.57	302.6
Language-style			1B	2.09	312.9
Language-style			2B	1.92	323.1
Language-style	Discrete	LlamaGen[19]	343M	3.07	256.06	2024.07	yes
Language-style			775M	2.62	244.1
Language-style			1.4B	2.34	253.9
Language-style			3.1B	2.18	263.3
Language-style	continuous	MAR[20]	208M	2.31	281.7	2024.07	yes
Language-style			479M	1.78	296
Language-style			943M	1.55	303.7

2024-11-22

Dear reviewer:

Thanks you for your great efforts in reviewing out paper and providing constructive suggestions/comments. To address the weaknesses you raised, we have conducted extensive experiments in appendix C and figure 1 of main paper to alleviate concerns regarding the size of the VAR model. Additionally, this work focuses on investigating the impact of whether the representation space in vision generation models is continuous or discrete. Furthermore, we propose strategies to optimize bit-level scaling laws under various quantization scenarios. Our exploration of model design and quantization methods provides significant insights for guiding future applications in specific deployment scenarios, such as mobile devices and edge computing. If our rebuttal does not address your concerns, you are warmly wecomed to raise questions. If our responses have addressed your concerns, we sincerely request that you consider raising our score.

Best Wishes!

Authors

2024-11-27

Dear reviewer:

Thanks you for your great efforts in reviewing out paper and providing constructive suggestions/comments. To address the weaknesses you raised, we have conducted extensive experiments in appendix C and figure 1 of main paper to alleviate concerns. If our rebuttal does not address your concerns, you are warmly wecomed to raise questions. If our responses have addressed your concerns, we sincerely request that you consider raising our score.

Best Wishes!

Authors

2024-11-29

Dear reviewer:

Thanks you for your great efforts in reviewing out paper and providing constructive suggestions/comments. To address the weaknesses you raised, we have conducted extensive experiments in appendix C and figure 1 of main paper to alleviate concerns. If our rebuttal does not address your concerns, you are warmly wecomed to raise questions. If our responses have addressed your concerns, we sincerely request that you consider raising our score.

Best Wishes!

Authors

审稿意见

评分: 5置信度: 42024-11-03

This paper investigates the impact of quantization on the performance of image generation models. By comprehensive experiments in many aspects, such as “model bits (MT), compute bits (CT)”, “post-training quantization (PTQ), quantization-aware training (QAT)”, “diffusion model (DiT), auto-regressive model (VAR)”, the authors observe that image generation models have bit-level scaling laws. And they further discover that VAR is more robust to quantization than DiT due to its discrete representation space. Finally, they propose a knowledge distillation based quantization method, called TopKLD, to improve the bit-level scaling laws of VAR.

优点

This paper demonstrates the bit-level scaling laws of image generative models through comprehensive experiments in terms of model bits and compute bits. By analysis of the reconstruction error of middle representations in VAR and DiT, the paper draws the conclusion that VAR is more robust to quantization and could generalize to other discrete auto-regressive models. And further, the paper proposes TopKLD, a quantization-aware training process, to improve scaling behavior of VAR at low bits region.

缺点

Bit-level scaling laws and the robustness of discrete auto-regressive models seem to be intuitive and straightforward, therefore the main contribution of this paper is the proposed quantization method, TopKLD. As a knowledge distillation based quantization-aware training method, the comparison and ablation studies are not enough.

问题

TopKLD should be compared to more distillation loss functions besides of forward and reverse KL Divergence, such as Logits MSE, JS Divergence and so on.
How does the parameter of “top-K sampling” affect the scaling behavior should be studied.
The “Figure 5” in line 427 should be “Figure 7(a)”

2024-11-21

Weakness1

We greatly appreciate your valuable review comments and hope that our response addresses your concerns regarding the statement: "Bit-level scaling laws and the robustness of discrete auto-regressive models seem to be intuitive and straightforward."

Firstly, as shown in table 1 above, in the field of vision generative models, there has been ongoing debate regarding the use of discrete versus continuous representation spaces (e.g., [17,18,19,20]). Both approaches have shown strong performance in terms of scaling laws. This work, however, takes a different perspective by investigating the impact of these representation spaces on the scaling laws in quantized models. We find that, despite achieving comparable performance at full precision, discrete autoregressive models consistently outperform continuous models across various quantization settings. To validate the effectiveness and broad applicability of our conclusions for you, we conducted the same experiments on other models, as detailed in Appendix C. This indicates that our work provides general guidance for subsequent model design and applications in specific deployment scenarios (e.g., mobile devices, edge computing).

Secondly, while low-bit precision representation often focuses on trading performance for efficiency, this work demonstrates that by optimizing either the model or quantization algorithm, models can achieve superior bit-level scaling laws. This outstanding characteristic enables the use of lower bit precision to increase model parameters, ultimately enhancing generative capability without sacrificing efficiency.

To validate the effectiveness and broad applicability of our conclusions for you, we conducted the same experiments on other models, as detailed in Appendix C. It can be observed that due to the influence of the continuous representation space, MAR, despite exhibiting excellent scaling laws,similar to DiT, do not demonstrate superior bit-level scaling laws. In contrast, LLaMaGen, which shares the discrete representation space with VAR, exhibits outstanding bit-level scaling laws.

This work provides a deeper, foundational understanding of bit-level scaling laws in visual generative models, from both the model design and quantization algorithm perspectives, supported by rigorous experimental design.

We hope the reviewer will take into account the contributions of this work to model design and the application of quantization algorithms. Thank you again!