Outlier-Aware Post-Training Quantization for Discrete Graph Diffusion Models
摘要
评审与讨论
This paper introduces Bit-DGDM, a post-training quantization framework for Discrete Graph Diffusion Models (DGDMs), addressing the long inference times caused by huge computational load and the presence of outliers in weights and activations. It proposes decomposing activations into dense, easily quantizable parts and sparse, non-quantizable parts based on precomputes thresholds. For weights, it introduces a decomposition algorithm based on the assumption that weights can be split into a sparse part with 𝛼-Sparsity and a low-rank dense part. The approach ultimately enables low-bit dense computation and high-bit but sparse-dense computation, and the implementation of computational kernels achieves significant practical acceleration results.
给作者的问题
-
I'm curious about what theoretical advancements the ill-conditioned low-rank decomposition proposed in this article has compared to previous work.
-
The discussion of work on low-rank factorization should be added, along with other discussions that divide weights into quantized and unquantized (or higher bit precision) work by importance, e.g. PB-LLM[1].
-
What is the basis of rank selection for a low-rank components? Is there any experimental verification?
[1] Shang, Y., Yuan, Z., Wu, Q., & Dong, Z. (2023). Pb-llm: Partially binarized large language models. arXiv preprint arXiv:2310.00034.
论据与证据
The claims made in the submission supported by clear and convincing evidence.
方法与评估标准
The proposed methods evaluation criteria make sense.
理论论述
All proofs for theoretical claims are correct.
实验设计与分析
Yes
补充材料
Yes, the code of kernel is validity.
与现有文献的关系
The parer design ill-conditioned low-rank decomposition of weight algorithm inspired by [1] and [2].
[1] Candès, E. J., Li, X., Ma, Y., & Wright, J. (2011). Robust principal component analysis?. Journal of the ACM (JACM), 58(3), 1-37.
[2] Tong, T., Ma, C., & Chi, Y. (2021). Accelerating ill-conditioned low-rank matrix estimation via scaled gradient descent. Journal of Machine Learning Research, 22(150), 1-63.
遗漏的重要参考文献
There are some quantization methods should be discussed and compared [a,b,c,d]
[a] QVD: Post-training Quantization for Video Diffusion Models, in ACM MM
[b] Ptq4sam: Post-training quantization for segment anything, in CVPR
[c] Post-training Quantization on Diffusion Models, in CVPR
[d] Towards Accurate Post-training Quantization for Diffusion Models, in CVPR
其他优缺点
Strengths:
-
The paper is well-written and clearly presents its contributions. The reader can easily follow the author's logic.
-
This article actually proposes a general quantization framework that can accelerate the model when there are outliers in both the activation values and weights.
-
Compared with many other LLM-oriented baseline methods, the advantages of this method are demonstrated.
-
The supplementary material is provided detailly, which enhance the reproducibility and transparency of the research.
Weaknesses:
-
The proposed method has no advantage in reducing the footprint of memory at runtime compare to BF16 baseline, although the memory usage is indeed relatively small.
-
The idea of dividing weights and activations into difficult to quantize parts and easy to quantize parts is common.
-
Whether DGDM has practical and wide application is worth discussing.
其他意见或建议
No
We very much appreciate your positive comments of our paper.
Q1: Some quantization methods [a,b,c,d], low-rank decomposition and importance-based weight (e.g. PB-LLM) dividing methods should be discussed.
A1: We sincerely appreciate your valuable suggestion. Due to the page limit of the rebuttal, we have included the comparison with these related works in this link. We will further expand our related work section to incorporate a detailed discussion of these studies.
Q2: The proposed method shows no advantage over the BF16 baseline in runtime memory.
A2: Thank you for your valuable comment. Our Bit-DGDM shows notable memory usage advantages compared to the BF16 baseline. In Table 1, on the QM9 dataset, Bit-DGDM only requires 2.3GB memory compared to BF16's 3.4GB usage, with a 32% reduction. The results are consistent in other datasets. Since DGDM is primarily a computation-intensive model rather than a large-parameter model, our main focus remains on improving inference speed while maintaining performance. We will incorporate memory analysis in our experimental results.
Q3: The idea of dividing weights and activations into difficult to quantize parts and easy to quantize parts is common.
A3: Thank you for your feedback. Though the idea of separating weights and activations has been explored before, our method introduces several key innovations that distinguish it from prior work. (i) For easy-to-quantize weights, we are the first to propose ill-conditioned low-rank weight decomposition, enabling stable decomposition into low-rank components even in the presence of significant outliers. (ii) For difficult-to-quantize weights, we propose the use of α-Sparsity, a structured sparsity property that facilitates subsequent inference acceleration. (iii) For activation outliers, our approach eliminates the need for calibration data to select thresholds, making it more practical.
Q4: Whether DGDM has practical and wide application is worth discussing.
A4: Thank you for your insightful question. Compared to Gaussian denoising diffusion models, DGDM possesses unique capabilities in modeling discrete data, enabling multiple real-world applications, particularly in biological and chemical tasks. In this work, we validate DGDM’s effectiveness through two critical applications: molecular generation and inverse protein folding. Furthermore, our proposed Bit-DGDM is specifically designed to facilitate the practical deployment of DGDM under real-world computational constraints.
Q5: What theoretical advancements the proposed ill-conditioned low-rank decomposition has compared to previous work.
A5: The key theoretical advancement of our ill-conditioned low-rank decomposition lies in its fundamental difference from conventional SVD-based methods when handling matrices with significant outliers. In standard SVD, the presence of outliers introduces several critical limitations [1,2]. They distort the singular value spectrum by amplifying small singular values associated with high-frequency noise components, consequently degrading the signal-to-noise ratio in the resulting low-rank approximation. This distortion manifests most prominently in the impaired rank selection capability, where the outlier-contaminated singular value spectrum exhibits slowed decay, making traditional truncation criteria unreliable. Built on prior ill-conditioned decomposition works [3,4], our method establishs rigorous recovery guarantees for the ill-conditioned low-rank composition of weights. The theoretical framework ensures stable outliers isolation properties even under significant outlier, while simultaneously maintaining better control over the dense matrix condition number throughout the decomposition process.
[1] Estimating the number of hidden neurons in a feedforward network using the singular value decomposition. 2006.
[2] Robust PCA via outlier pursuit. Neurips, 2010.
[3] Accelerating ill-conditioned low-rank matrix estimation via scaled gradient descent. JMLR, 2021.
[4]Learned robust PCA: A scalable deep unfolding approach for high-dimensional outlier detection. Neurips, 2021.
Q6: What is the basis of rank selection for a low-rank components?
A6: Thank you for raising this important question. The selection of rank is based a systematic trade-off analysis between computational efficacy and precision. Through experiments on CATH dataset, we observed that low ranks (rank=8) led to significant degradation in graph generation quality, and higher ranks (rank=32) provided marginal quality improvements while incurring substantial latency overhead. We will include detailed studies on the rank selection in the revised manuscript.
| Perplexity | Recovery(%) | Speedup | Mem.(GB) | |
|---|---|---|---|---|
| Rank=8 | 5.1 | 46.5 | 2.6 | 4.7 |
| Rank=16 | 4.5 | 51.6 | 2.5 | 4.9 |
| Rank=32 | 4.5 | 51.8 | 2.2 | 5.4 |
This paper focuses on the quantization of discrete diffusion models for graph data. To achieve this, the authors introduce sparse-dense activation quantization and low-rank decomposition with hardware support. Experimental results demonstrate that the proposed method enhances quantization performance while improving speed.
给作者的问题
None
论据与证据
-
In lines 60 -- 65, the authors argue that existing quantization methods are insufficient to address the challenges posed by computation boundaries, citing several quantization techniques for LLMs. However, prior work specifically targeting quantization for diffusion models [1,2,3] also addresses these challenges. These methods should be discussed in this context to provide a more comprehensive comparison.
-
In lines 275 -- 277, the authors state that the matrix is ill-conditioned but provide limited explanation of its impact. It would be helpful to clarify how this ill-conditioning affects the convergence of SGD and whether it introduces optimization difficulties. Additional theoretical or empirical analysis would strengthen this claim.
[1] Li, Xiuyu, et al. "Q-diffusion: Quantizing diffusion models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. [2] Shang, Yuzhang, et al. "Post-training quantization on diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023. [3] So, Junhyuk, et al. "Temporal dynamic quantization for diffusion models." Advances in neural information processing systems 36 (2023): 48686-48698.
方法与评估标准
-
The proposed method can be viewed as a combination of existing techniques, and its components lack sufficient novelty. As I understand, the authors decompose both weights and activations into outliers and a central part. For weight decomposition, the approach closely resembles techniques used in LoRA [1]. Additionally, the concept of decomposed outliers has been previously introduced in [2]. Given these similarities, the paper should better highlight its unique contributions beyond existing work. While the method primarily builds on existing techniques, I acknowledge its contribution, as the implementation of this combination is non-trivial. The approach requires careful design and integration, which adds value despite the lack of fully novel components.
-
There exists increased computational cost due to the formulation. The proposed method relies on multiple integer multiplications to compensate for quantization errors, which may introduce additional computational overhead. It would be beneficial for the authors to provide a more detailed analysis of the computational complexity and conduct experiments to evaluate the actual impact on efficiency.
[1] Hu, Edward J., et al. "Lora: Low-rank adaptation of large language models." ICLR 1.2 (2022): 3. [2] Dettmers, Tim, et al. "Gpt3. int8 (): 8-bit matrix multiplication for transformers at scale." Advances in neural information processing systems 35 (2022): 30318-30332.
理论论述
Correct
实验设计与分析
Strength
-
The experimental evaluation is fair. The authors compare both memory usage and computation cost in terms of real acceleration, which requires CUDA implementation. This adds credibility to the experiments and strengthens their impact.
-
The proposed method demonstrates good performance, as it effectively improves quantization accuracy. While there is a minor trade-off between memory and computation cost, such trade-offs are expected in this field and do not detract from the overall contribution.
Weakness
While there are existing quantization techniques for diffusion models, they are not included in the experimental comparisons [1, 2, 3]. A direct comparison with these methods would provide a clearer understanding of the proposed approach's advantages and limitations. [1] Li, Xiuyu, et al. "Q-diffusion: Quantizing diffusion models." Proceedings of the IEEE/CVF International Conference on Computer Vision. 2023. [2] Shang, Yuzhang, et al. "Post-training quantization on diffusion models." Proceedings of the IEEE/CVF conference on computer vision and pattern recognition. 2023. [3] So, Junhyuk, et al. "Temporal dynamic quantization for diffusion models." Advances in neural information processing systems 36 (2023): 48686-48698.
补充材料
Checked the implementation of CUDA.
与现有文献的关系
None
遗漏的重要参考文献
None
其他优缺点
Strength
-
The paper addresses an important but underexplored problem—quantization for graph discrete diffusion models. It also highlights the issue of weight outliers, which is crucial for improving quantization performance.
-
The paper is well-structured, with a clear presentation of the motivation, background, methodology, and experiments, making it easy to follow.
-
The inclusion of a CUDA implementation enhances the practical applicability of the method by supporting hardware acceleration.
-
The paper provides a theoretical guarantee for the proposed method, strengthening its validity and reliability.
其他意见或建议
None
We highly appreciate your positive reviews and constructive suggestions.
Q1: These methods [1,2,3] should be discussed in this context to provide a more comprehensive comparison.
A1: Thank you for your constuctive suggestion. We note that these methods [1,2,3] were designed for image diffusion models (IDMs). Specifically, [1,2] optimize quantized weights by minimizing the MSE between quantized and full-precision continuous outputs. [3] introduces time-step specific encoding for image diffusion models. However, these approaches cannot be directly applied to Discrete Graph Diffusion Models (DGDMs) due to fundamental incompatibilities. As shown in Remark 3.1, in DGDMs, the intermediate node attributes and graph structures are obtained through discrete sampling from categorical distributions. While IDMs relay on Gaussian noise and continuous denoising processes. Our method advances existing method in three innovations. (i) Recognizing the significant outliers in model weights, we first propose an ill-conditioned low-rank weight decomposition. This contrasts with basic SVD approach, allowing our method to achieve better numerical stability. (ii) For the residual component derived from the raw weight and low-rank decomposition, our method enforces α-sparsity, enabling efficient sparse matrix multiplication during inference. (iii) For activation outliers, our approach eliminates the need for calibration data to select thresholds, making it more practical.
We will add detailed discussion of these works in our related work. We sincerely appreciate your valuable feedback.
[1] Q-diffusion: Quantizing diffusion models. In CVPR, 2023.
[2] Post-training quantization on diffusion models. In CVPR, 2023.
[3] Temporal dynamic quantization for diffusion models. In Neurips, 2023.
Q2: In lines 275--277, the authors state that the matrix is ill-conditioned but provide limited explanation of its impact. It would be helpful to clarify how this ill-conditioning affects the convergence of SGD and whether it introduces optimization difficulties. Additional theoretical or empirical analysis would strengthen this claim.
A2: We sincerely appreciate the reviewer's insightful suggestion. To better illustrate how ill-conditioning affects the optimization of Eqn. (9) (), we have conducted additional empirical analysis examining the relationship between the number of outliers and the reconstruction error (measured by Frobenius norm between the reconstructed matrix and the original matrix). These results are provided in this link. We observe that the reconstruction error (Frobenius norm) increases monotonically with the number of outliers. This correlation indicates that stronger ill-conditioning (caused by more outliers) leads to greater optimization difficulties in matrix reconstruction. Moreover, in this link, we present a comprehensive derivation of these two scale terms and in Eqn.(13).
We will include a more detailed description of this mechanism in the revised version to better clarify these points. Thank you for this valuable suggestion to improve the clarity of our work.
Q3: It would be beneficial for the authors to provide a more detailed analysis of the computational complexity and conduct experiments to evaluate the actual impact on efficiency.
A3: We sincerely thank the reviewer for raising this important point regarding the computational cost of our method. We agree that the additional multiplications introduced to compensate for quantization errors incur some computational overhead, but this overhead is marginal in practice.
To evaluate this, we conducted an ablation study (Figure 4 and Sec 5.4). The results show that the ablation variant (iii) Bit-DGDM-, which removes multiplications for sparse components, achieves only a modest speed improvement (from 2.5× to 2.6×) but suffers significant degradation in generation quality, as evidenced by the perplexity increase (4.5 → 4.7). This performance decline highlights the critical role of these operations in mitigating high-magnitude outliers. Furthermore, compared to threshold-based alternatives (variant (ii)), our method not only preserves competitive generation quality but also delivers better inference speed (2.5× vs. 2.2×). This efficiency gain stems from our use of ill-conditioned low-rank decomposition. Collectively, these experiments demonstrate that the additional computational overhead is negligible and does not substantially impact inference speed.
In the revised version, we will expand our empirical analysis to better demonstrate the impact of each proposed component in our work. Once again, we deeply appreciate this valuable feedback, which will undoubtedly help improve the clarity and rigor of our work.
Thanks for the reply of the authors. I will keep the positive rating for now.
Thank you for your response. If you have any further suggestions, feel free to tell us. Thank you again for your constructive comments.
This paper presents Bit-DGDM, an advanced post-training quantization framework developed for Discrete Graph Diffusion Models. The proposed framework introduces two innovations, i.e., (1) a sparse-dense activation quantization mechanism and (2) an ill-conditioned low-rank weight decomposition technique, to effectively addresses two key challenges, i.e., the computational bottleneck in DGDMs and the outliers in weights and activations. The efficacy of the framework is rigorously validated through extensive experiments across a variety of graph generation tasks, demonstrating superior performance compared to existing quantization baselines.
给作者的问题
See above. Besides, I would like to know the time cost for the quantization of DGDMs.
论据与证据
The paper's claims is supported by clear evidence.
方法与评估标准
The framework proposed in this study shows a high degree of alignment with the challenges intrinsic to DGDMs. The authors present an incisive analysis of prevailing issues, particularly emphasizing the outliers and computational intensity. The evaluation is rigorously designed, incorporating well-established benchmark datasets and scientifically validated metrics, thereby substantiating the efficacy of the proposed method.
理论论述
I have checked the proofs and theoretical claims presented in the paper. The authors provide clear references for the work they rely on and explicitly delineate the underlying premises and assumptions that validate their theoretical proposition. The theoretical exposition is systematically organized.
实验设计与分析
I have reviewed the paper's experimental design and results. The selected baselines are the most adavanced methods in the fields of LLM and image diffusion models, and the datasets utilized for validation are the mainstream benchmarks of graph generations. The experimental design is convincing, and the validation procedures demonstrate the efficacy of the proposed method.
补充材料
I have checked the theoretical proofs, the analysis of time complexity and the implementation details of the kernel. The proofs are well-organized, leveraging basic principles from robust PCA, with each lemma and theory appropriately referenced, thereby ensuring academic integrity. The time complexity analysis is exceptionally thorough. Besides, the kernel implementation is clear with well-documented pseudocode. The supplementary materials are comprehensive and provide robust support for the paper's claim.
与现有文献的关系
The contributions of this paper significantly extend the scope of prior quantization research by uncovering critial insights tailored to discrete graph diffusion models. The authors proves that computations, instead of memory loading, consitute the primary bottleneck in DGDMs. Besides, the study reveals the presence of outliers in both activations and weights within DGDMs. To solve this, the authors innovatively leverage robust PCA for weight decomposition and employ sparse matrix to handle outliers.
遗漏的重要参考文献
The important references I familiar have been mentioned in this paper.
其他优缺点
The other strengths of the paper are as follows:
(1) The paper is well-structured and easy to follow.
(2) The proposed quantization method is effectively in solving specific challenges in DGDMs.
(3) The experiment results show the effectiveness of proposed method in both generation performance and inference speed.
Weaknesses:
(1) The font in the figures of the paper should be consistent to make it easier to read.
(2) The time complexity analysis should be extensively introduced in the main paper.
其他意见或建议
I suggest that the authors use a consistent font in the figures of the paper to make it easier to read.
We very much appreciate your constructive comments. For your concerns:
Q1: The font in the figures of the paper should be consistent to make it easier to read.
A1: Thank you for your helpful feedback. We have updated all figures to use Times New Roman font, ensuring formatting consistency. The updated figures are avaible here: https://anonymous.4open.science/r/ICML_Refined_Figures_9670
Q2: The time complexity analysis should be extensively introduced in the main paper.
A2: We appreciate your valueble suggestion. Due to space limitations in the main manuscript, we have included the detailed complexity analysis in Appendix D. In our future version, we will add a concise summary of the key complexity results in Sec 4.3 of the main paper.
Q3: The time cost for the quantization of DGDMs.
A3: Thank you for your insightful question. As detailed in Appendix F.3, the quantization process demonstrates practical efficiency. Specifically, for the DiGress model on the QM9 dataset, the ill-conditioned decomposition requires 13.6 minutes and memory usage remains manageable at 2.1 GB. We emphasize that this computational overhead is highly acceptable given the significant inference acceleration achieved through quantization.
This paper proposes post-training quantization (PTQ) methods to quantize discrete graph diffusion models (DGDM). The paper first analyzes outlier distributions of weights and activations in DGDM. For activations, the proposed method split activation matrices into high-precision sparse matrix (outliers) and low-precision dense matrix. Then, weights are split into sparse matrices and low-rank matrices which represent dense matrices. By utilizing low-bit and sparse-dense matrix multiplication CUDA kernels, the proposed method not only achieves better accuracy and perplexity, but also fast computation.
给作者的问题
The questions of the reviewer can be summarized as follows:
- What are the differences between the proposed method and SVDQuant?
- How significant is the tradeoff between the efficiency and effectiveness of the proposed method? This question is related to the design of the method as it contains high-precision multiplications in addition to the low-bit quantized operations.
论据与证据
The paper’s claim is convincing and supported by both empirical and theoretical evidence.
方法与评估标准
The proposed method makes sense and remedies the computation bottlenecks of DGDM. Also, the motivational studies show that the proposed method is valid and aims to solve realistic problems. The paper also shows extensive evaluation, ablation, and sensitivity studies.
However, the reviewer finds that SVDQuant (Li et al., 2024) shares a similar idea regarding outliers and low-rank computation. Therefore, the reviewer suggests adding more comparisons with SVDQuant.
理论论述
No issues.
实验设计与分析
The experimental results show clear advantages of the proposed method in terms of accuracy and perplexity. However, the proposed method shows lower speedup and higher memory usage compared to some of the baselines. Therefore, as there exists some trade-off between accuracy and efficiency, the reviewer suggests adding more analysis regarding the tradeoff.
补充材料
The reviewer appreciates algorithms, detailed experimental settings, and more analyses.
与现有文献的关系
The contribution of the paper is related to neural network quantization as the paper analyzes and reduces the impact of outliers in target matrices.
遗漏的重要参考文献
N/A
其他优缺点
The paper is well-written and provides a detailed explanation of the preliminary and related work section.
其他意见或建议
The reviewer suggests using Times font in the figures helps to make the paper more consistent in format.
We very much appreciate your positive comments and constructive suggestions.
Q1: What are the differences between the proposed method and SVDQuant?
A1: Thank you for your insightful question. Our method introduces two key innovations over SVDQuant. (i) Recognizing the presence of significant outliers in model weights, we propose an ill-conditioned low-rank weight decomposition method. This contrasts with SVDQuant's conventional SVD approach, allowing our method to achieve better numerical stability. (ii) For the residual component derived from the raw weight and low-rank decomposition, while SVDQuant maintains a dense structure, our method enforces -sparsity. This constraint ensures that no more than proportion of elements in each row and column are non-zero, enabling efficient sparse matrix multiplication during inference. Consequently, our approach not only preserves precision but also achieves consistent acceleration, whereas SVDQuant’s dense residuals incur higher computational costs.
Q2: How significant is the tradeoff between the efficiency and effectiveness of the proposed method? This question is related to the design of the method as it contains high-precision multiplications in addition to the low-bit quantized operations.
A2: Thank you for your question. The trade-off between computational efficiency and model effectiveness is a central consideration in our model design. Our experimental results (Figure 4) demonstrate that while the high-precision multiplications do introduce some computational overhead, they are essential for maintaining the quality of generated graphs in precision-critical applications like molecular generation and protein inverse folding. Specifically, while removing the multiplication operations for sparse componenets(as in variant (iii) Bit-DGDM-) yields a slight speed improvement (2.5× → 2.6×), it comes at a significant cost in generation quality, as seen in the perplexity degradation (4.5 → 4.7). This confirms that the additional operations play a crucial role in handling high-magnitude outliers effectively. Moreover, compared to threshold-based variant (ii) removing weight outliers through thresholds, our method not only maintains competitive generation quality but also achieves better inference speed (2.5× vs. 2.2×). This gain stems from our use of ill-conditioned low-rank decomposition, which remains computationally efficient and model effective in significant outlier scenarios.
Q3: The reviewer suggests using Times font in the figures helps to make the paper more consistent in format.
A3: Thank you for your constructive suggestion. We have revised the font in all figures to Times font to ensure consistency with the paper’s format. The updated figures are available at the following link: https://anonymous.4open.science/r/ICML_Refined_Figures_9670. Please let us know if you have any additional feedback. We would be happy to make further improvements.
The author's rebuttal successfully resolved the reviewer's concerns, so this reviewer stays in a positive rating for this paper. I appreciate the authors' efforts and time in preparing the rebuttal.
Thank you for your response. If you have any further suggestion, feel free to tell us. Thank you again for your constructive comments and suggestions.
This paper introduces Bit-DGDM, a post-training quantization framework tailored for Discrete Graph Diffusion Models (DGDMs), addressing computational inefficiencies and outlier sensitivity in both weights and activations. The approach leverages a novel sparse-dense activation quantization and an ill-conditioned low-rank weight decomposition with α-sparsity to enable hardware-efficient inference without calibration data.
Pros:
- the paper addresses a practically important and underexplored problem—quantization in DGDMs—backed by solid theoretical motivation and well-executed empirical validation.
- all reviewers praised the methodological soundness, implementation rigor, and comprehensive evaluation across benchmarks and ablations.
- the rebuttal effectively clarified novelty over prior methods (e.g., SVDQuant, Q-Diffusion) and addressed concerns on theoretical contributions, computational trade-offs, and practical relevance.
- CUDA implementation and time/memory analyses enhance the work's real-world applicability.
Cons:
- some components (e.g., decomposition of quantizable/unquantizable parts) overlap with prior ideas.
- memory reduction over BF16 is modest, though justified by the compute-bound nature of DGDMs.
- additional references were requested and addressed in the rebuttal.
Overall, he paper makes a meaningful, well-substantiated contribution to the efficient deployment of DGDMs, combining theoretical depth with practical impact. All concerns raised during review were satisfactorily resolved.