6.8

/10

Poster4 位审稿人

最低4最高5标准差0.4

3.3

置信度

创新性3.0

质量2.8

清晰度2.8

重要性2.5

NeurIPS 2025

AccuQuant: Simulating Multiple Denoising Steps for Quantizing Diffusion Models

Seunghoon Lee,Jeongwoo Choi,Byunggwan Son,Jaehyeon Moon,Jeimin Jeon,Bumsub Ham

OpenReview PDF

提交: 2025-05-11更新: 2025-10-29

摘要

We present in this paper a novel post-training quantization (PTQ) method, dubbed AccuQuant, for diffusion models. We show analytically and empirically that quantization errors for diffusion models are accumulated over denoising steps in a sampling process. To alleviate the error accumulation problem, AccuQuant minimizes the discrepancies between outputs of a full-precision diffusion model and its quantized version within a couple of denoising steps. That is, it simulates multiple denoising steps of a diffusion sampling process explicitly for quantization, accounting the accumulated errors over multiple denoising steps, which is in contrast to previous approaches to imitating a training process of diffusion models, namely, minimizing the discrepancies independently for each step. We also present an efficient implementation technique for AccuQuant, together with a novel objective, which reduces a memory complexity significantly from $\mathcal{O}(n)$ to $\mathcal{O}(1)$, where $n$ is the number of denoising steps. We demonstrate the efficacy and efficiency of AccuQuant across various tasks and diffusion models on standard benchmarks.

关键词

Network QuantizationDiffusion Models

评审与讨论

审稿意见

评分: 5置信度: 42025-06-20

The paper proposes a method to quantize diffusion models for image generation post training that is more robust than state-of-the-art techniques, providing better image quality at the same quantization bit width. The main idea is to optimize the quantization process by jointly considering multiple timesteps of the diffusion process in order to prevent accumulation of quantization errors. This is also implemented efficiently via a gradient approximation approach that avoids backpropagation though the selected group of timesteps in sampling chain.

优缺点分析

Strengths:

the paper is overall well written and clearly motivated
the method seems novel and it is is scientifically sound. The idea to jointly consider multiple timesteps when optimizing the quantization process indeed allows to attenuate the error accumulation process of models optimizing individual timesteps
experimental results are extensive enough to properly characterize the performance of the method compared to the state-of-the-art under various quantization bit widths and to assess the design decisions such as the gradient approximation

Weaknesses:

some minor presentation issues. For example: the FID2FP32 metric is mentioned in the introduction before being defined and may confuse readers. Also, it needs some explanation why we need this metric and it is not enough to just measure the regular FID
results are good but they are not always the best. It seems that for the experiments where numbers are reported, TAC [43] is a strong baseline. The authors should better discuss this method.
The ablation on the group size suggest that beyond a certain size the proposed method will result in (significantly) worse performance than working independently on each timestep. It is not very clear how one would choose the optimal group size.
It would be interesting to check the performance of the method on recent models using very few timesteps, so that a single group could be formed for optimization

问题

The fact that the approximated gradients work better than the "correct" counterpart is quite surprising. The authors attempt an explanation suggesting that that the removed term is quite noisy but I am not fully convinced by this explanation. The method without approximation should minimize Eq. (11) having the full precision result as ground truth and the current denoiser, so how is it possible that the correct objective is noisier than the simplified objective? Is it possible this might be due to small batch size for the experiment without approximation?

局限性

Some limitations are acknowledged in the Appendix. The authors should integrate them in the main text. Societal impact of more efficient generative models is not discussed.

最终评判理由

The method is sufficiently novel, theoretically justified and has good experimental performance. Authors addressed some of the minor points I raised in a satisfactory manner.

格式问题

Formatting is ok.

作者回复

2025-07-31

We sincerely appreciate the reviewer W9wt for a thorough review and valuable comments. Below are our responses to the reviewer's concerns.

W1. Some minor presentation issues. For example: the FID2FP32 metric is mentioned in the introduction before being defined and may confuse readers. Also, it needs some explanation why we need this metric and it is not enough to just measure the regular FID.

FID2FP32 measures the FID score between the outputs of the quantized and full-precision models. In diffusion model quantization, the standard FID alone is insufficient because it only evaluates the generative quality with respect to real images, not how well the quantized model replicates the behavior of its original full-precision counterpart. Therefore, FID2FP32 is necessary to specifically quantify the performance drop caused by quantization by directly measuring the similarity between quantized and full-precision model outputs. We will add a clear definition of FID2FP32 and an explanation of why it is needed.

W2. Results are good but they are not always the best. It seems that for the experiments where numbers are reported, TAC is a strong baseline. The authors should better discuss this method.

TAC assumes that the output of the quantized model can be represented as a simple linear combination of the full-precision output and two coefficients (i.e., $\tilde\epsilon = a\epsilon + b$ ), and compensates for the quantization error using these coefficients obtained during calibration. However, in low-bit quantization settings, we observe that this formulation is not valid due to the large accumulated errors. We report in Table R11 the $R^2$ (coefficient of determination from linear regression) scores between the output of a full-precision model and the quantized models for both $\epsilon_t$ and $x_t$ . Notably, the $R^2$ value is lower for the 3/6 and 4/6-bit setting than for the 4/8-bit setting, indicating that the full-precision outputs are not accurately regressed with the quantized outputs under low-bit settings. This explains the suboptimal performance of TAC in low-bit settings, as observed in Table 1 (e.g., 3/6 bits for DDIM CIFAR-10 and 3/8 bits for LDM Churches). Additionally, TAC requires extra computation to compensate for the error during the inference phase, which can further limit its efficiency.

In contrast, AccuQuant minimizes the accumulated quantization error directly by optimizing the quantization parameters through a gradient-based approach. This allows AccuQuant to more effectively capture the non-linear relationship between quantized and full-precision outputs, resulting in superior performance, especially in challenging low bit-width regimes. This is why AccuQuant achieves superior performance compared to TAC in low-bit settings. Furthermore, our method does not require any additional operations during the inference phase.

Table R11: $R^2$ scores between the outputs of the quantized and full-precision models for $\mathbf{x}_t$ and $\boldsymbol{\epsilon}_t$ . The $R^2$ scores are computed using the scikit-learn package.

bit-width	4/8	4/6	3/6
$x_t \to \tilde x_t$	0.91	0.68	0.39
$\epsilon_t \to \tilde \epsilon_t$	0.92	0.71	0.62

W3. The ablation on the group size suggest that beyond a certain size the proposed method will result in (significantly) worse performance than working independently on each timestep. It is not very clear how one would choose the optimal group size.

Therefore, the group size should balance the trade-off between capturing long-range accumulated error and maintaining stable optimization. In our settings, we treat group size as a hyperparameter, chosen to balance two factors: (1) capturing accumulated errors across the multiple timesteps, and (2) maintaining stable optimization of quantization parameters. We empirically found that splitting the whole timesteps into 10~20 chunks shows a good result in practice. We believe finding the optimal group size for different settings would be a promising direction for future research, as mentioned in the limitations in the Appendix.

W4. It would be interesting to check the performance of the method on recent models using very few timesteps, so that a single group could be formed for optimization

Thank you for this excellent suggestion. In response, we present in Table R12 our quantization results on a distilled diffusion model with very few timesteps. Specifically, we use PD [1] with a total of 8 steps (W4A8 bits, CelebA 256×256), applying quantization to the pre-trained distilled model provided from the official GitHub. We summarize our findings as follows: (1) AccuQuant consistently outperforms Q-Diffusion across all metrics. This indicates that AccuQuant generalizes well even to distilled models with very few denoising steps. (2) Group size 1 yields lower performance compared to group sizes 4 and 8. This suggests that mitigating accumulated error remains important, even for models distilled to only a few timesteps. (3) Group size 4 is superior to group size 8 for certain metrics. This implies that, even when the total number of timesteps is small, setting the group size to include all steps (i.e., a single group) is not always optimal. This finding reinforces our earlier discussion regarding the trade-offs of group size selection.

Table R12: Quantitative results for PD [1] on CelebA 256x256. We report FID2FP32, LPIPS, PSNR, and SSIM

Method	FID2FP32	LPIPS	PSNR	SSIM
QDiff	33.31	0.2696	22.3109	0.8312
Ours (group size 1)	23.154	0.2051	24.1612	0.9193
Ours (group size 4)	13.49	0.1768	25.3458	0.9282
Ours (group size 8)	14.219	0.1648	24.7452	0.9299

references: [1] Progressive Distillation for Fast Sampling of Diffusion Models, ICLR, 2022

Q1. how is it possible that the correct objective is noisier than the simplified objective? Is it possible this might be due to small batch size for the experiment without approximation?

Although the full gradient without approximation (Eq. (11)) is theoretically expected to provide the most accurate optimization direction, our empirical results demonstrate that this is not always true in practice. The key reason lies in the Jacobian term, which is present in the full gradient but omitted in our approximation.

As discussed in answer L1 of YJ1y, the Jacobian component is significantly smaller in average magnitude compared to the dominant scalar component, but it exhibits a high dynamic range—introducing substantial noise into the full gradient. This high-variance noise causes the quantization parameters to change inconsistently at each step, leading to an unstable and unreliable calibration process. In contrast, our approximation excludes the noisy Jacobian term and utilizes only the dominant scalar component, effectively removing this undesirable noise while preserving the overall mean gradient magnitude and the relative proportions of gradients across timesteps. As a result, our method enables a more stable and reliable optimization process.

评论- Comments on authors response

2025-08-03

Thank you for your response. My comments have been addressed and I will keep my score.

2025-08-03

Thank you for your comments and for considering our response. We will carefully revise the paper accordingly.

审稿意见

评分: 4置信度: 32025-06-27

This paper proposes AccuQuant, a novel post-training quantization method for diffusion models that reduces accumulated quantization errors by simulating multiple denoising steps during quantization. It also introduces a memory-efficient objective and demonstrates strong performance across various tasks and models.

优缺点分析

Strength: The paper presents a strong motivation by identifying and analyzing the unique challenge of error accumulation in quantized diffusion models. The presented AccuQuant is a thoughtful approach by simulating multiple denoising steps during quantization, aligning closely with the actual inference behavior of diffusion models. This is a meaningful departure from prior per-step quantization strategies. The thorough evaluation on standard benchmarks strengthens showcases real-world applicability. Weakness: 1: In the paper, the method is tested across various diffusion models and benchmarks, it’s unclear if tasks beyond generation (e.g., image restoration, inpainting, or editing) were considered, if these tasks can’t be evaluated, it is better to give reasonable explanations. 2: The work has a strong assumption that minimizing output discrepancies over a few denoising steps is sufficient to mitigate long-range error accumulation. But some error behaviors might emerge only over longer sampling chains, the authors should explain this issue more. 3. Although the paper introduces an efficient implementation (O(1) memory), there may still be computational overhead in simulating multiple denoising steps during quantization, especially for large-scale models. A more detailed analysis of runtime and energy trade-offs during quantization would help others understand its deployment feasibility.

问题

1:How does AccuQuant determine the optimal number of denoising steps to simulate during quantization to balance accuracy and efficiency?

2: How good does AccuQuant generalize to tasks beyond image generation, such as image restoration ( a small scale experiment will do)?

3: What are the trade-offs regarding computational cost between AccuQuant and previous step-wise quantization methods during the quantization process?

局限性

The major limitation of this work is its adaptation to other domains such as inpainting, restoration etc.

最终评判理由

I thank the authors for their efforts in addressing all my concerns. As all my questions have been satisfactorily answered, I will raise my score accordingly.

格式问题

N.A.

作者回复

2025-07-31

We sincerely appreciate the reviewer iFPY for a thorough review and valuable comments. Below are our responses to the reviewer's concerns.

W1, Q2, L1. it’s unclear if tasks beyond generation (e.g., image restoration, inpainting, or editing) were considered, if these tasks can’t be evaluated, it is better to give reasonable explanations.

Thank you for this valuable suggestion. The experiments reported in the main paper were conducted using the standard benchmarks for diffusion model quantization. However, following your comment, we present experimental results on tasks beyond image generation, including image restoration (Gaussian deblurring) and style transfer.

For image restoration, we leverage the pre-trained DDPG [1] model from the official GitHub repository on the CelebA dataset (256 × 256 resolution, 1K images), injecting noise with σᵧ = 0.05. Also, we use 100 total timesteps with a group size of 5 and calibrate with 64 samples obtained from a full-precision model. In Table R8, AccuQuant consistently outperforms Q-Diffusion across both PSNR and LPIPS metrics, demonstrating its effectiveness in the restoration setting.

For style transfer, we use Stable Diffusion v1.4 with the prompt ‘A cartoon style’, evaluated on the COCO validation set, with the entire timestep of 35 using the DDIM sampler. In Table R9, AccuQuant outperforms all evaluated metrics, highlighting its robustness for diverse generative tasks beyond standard image synthesis.

Table R8 : Quantitative results for Gaussian deblurring on CelebA 256×256. We report PSNR and LPIPS for w4a8 and w3a8 settings, comparing our method with Q-Diffusion.

Method	W/A	PSNR	LPIPS (w.r.t GT)
Full precision	32/32	31.23	0.0584
Q-Diffusion	4/8	27.63	0.2131
Ours	4/8	30.57	0.0984
Q-Diffusion	3/8	24.7	0.4574
Ours	3/8	26.15	0.3979

Table R9: Quantitative results for style transfer on the COCO validation set (512×512). We report SSIM, PSNR, LPIPS, style loss, content loss, and FID2FP32. All scores are calculated using the output of the full-precision model, since ground truth images are not available for style transfer.

Method	W/A	SSIM	PSNR	LPIPS	Style loss	Content loss	FID2FP32 (5k)
Q-Diffusion	4/8	0.6645	18.5	0.331	0.0012	13.2267	14.2
Ours	4/8	0.7444	20.49	0.2594	0.0006	9.3687	10.81

[1] Image Restoration by Denoising Diffusion Models with Iteratively Preconditioned Guidance, CVPR, 2024

W2. Is it sufficient to minimizing output discrepancies over a few denoising steps?

It may seem intuitive to use a very large group size to directly account for long-range error accumulation. However, our experiments show that dividing the timesteps into smaller groups is an effective way to substantially reduce accumulated error in diffusion model quantization. By optimizing quantization parameters for each group, we can address error accumulation locally. This approach yields much lower total accumulated error compared to considering long-range error accumulation with excessively large group sizes, as shown in Table 4. This is because, when the group size is excessively large, the accumulated error becomes substantial. This makes it difficult to optimize the quantization parameters effectively, resulting in suboptimal performance. Therefore, the group size should balance the trade-off between capturing long-range accumulated error and maintaining stable optimization.

Q1. How to determine the optimal number of group size?

The group size is treated as a hyperparameter that balances two objectives: (1) sufficiently capturing accumulated errors across multiple timesteps, and (2) ensuring stable optimization of quantization parameters. As shown in the results, if the group size is too small, it may fail to capture the long term error accumulation; if it is too large, the optimization becomes unstable as a single set of parameters must account for the diverse behaviors of many timesteps. The optimal group size can vary depending on the model and dataset, but we empirically found that dividing the timesteps into 10 to 20 groups yields consistently strong results across our experiments. As mentioned in the limitation section in Appendix, we believe finding the optimal group size for different settings would be a promising direction for future research.

W3, Q3. A more detailed analysis of runtime and energy trade-offs compared to previous step-wise quantization methods during quantization.

In Table R10, we compare AccuQuant with Q-Diffusion in terms of computational overhead, specifically calibration time (GPU hours on A6000 GPU) and average energy consumption during calibration (Wh). We can see that AccuQuant achieves both faster calibration time and superior performance compared to Q-Diffusion. Although AccuQuant may require more computational cost for one loss calculation, we find that 50 epochs per group are sufficient for the quantization parameters to converge. In contrast, Q-diffusion calibrates each layer and residual block for 5000 epochs. Although AccuQuant incurs a marginal increase in energy during calibration due to the multi-step optimization, AccuQuant achieves substantially enhanced performance with shorter calibration time, making it a favorable trade-off for real-world deployment

Table R10: Calibration computational overhead for DDIM CIFAR-10 with 100 timesteps.

Method	Bits (W/A)	Batch size	Calibration time (hours)	Energy (Wh)	FID↓	FID2FP32↓
FP	32/32	8	–	–	4.26	0
Q-Diffusion	6/6	8	5.97	519.39	30.46	35.24
Ours	6/6	8	5.56	561.56	5.79	3.3

评论- Comments on authors response.

2025-08-03

I thank the authors for their efforts in addressing all my concerns. As all my questions have been satisfactorily answered, I will raise my score accordingly.

2025-08-03

Thank you for considering our rebuttal. We will carefully incorporate the results and discussions into the revised paper.

审稿意见

评分: 4置信度: 32025-07-02

This paper proposes that PTQ for diffusion models should not be based solely on single-step denoising alignment, but rather on multi-step joint alignment to better account for accumulated errors. To address the high computational cost of aligning multiple steps, the authors introduce an efficient gradient approximation method based on empirical observations. Extensive experiments demonstrate that the proposed approach achieves state-of-the-art PTQ performance for diffusion models.

优缺点分析

Strengths

The paper presents a clear and well-motivated rationale for aligning multiple denoising steps to mitigate cumulative errors, and accordingly introduces a novel and reasonable optimization objective.
The authors thoroughly consider the computational complexity involved in aligning cumulative errors and effectively address it through mathematical reformulation and gradient approximation techniques.
Extensive experiments demonstrate that the proposed method consistently outperforms existing PTQ approaches for diffusion models across multiple evaluation dimensions.

Weaknesses

Increasing the group size consistently leads to suboptimal results, which is rather puzzling. This may suggest that the alignment objectives defined in Equations (10) and (12) are potentially inaccurate or not well-justified.
The manuscript does not report the overall cost of the proposed method, including actual runtime and memory efficiency, which are crucial for evaluating its practical applicability.

问题

Could enforcing multi-step alignment lead to suboptimal results? For example, if we keep the same calibration process but only include the loss between the final-step outputs of the quantized and full-precision models, could that potentially lead to better optimization? This concern may be reflected in Table 4, where increasing the group size actually leads to performance degradation.
What is the actual computational overhead of the proposed multi-step joint optimization? Given the complexity, would a lightweight QAT approach achieve better results with similar resource costs?
Why does removing gradient approximation lead to worse results? The results in Table 5 are counterintuitive. This may suggest that the optimization objective in Equation (12) is inherently flawed, as also questioned in Point 1 above.

局限性

No further discussion is needed.

最终评判理由

Although the choice of group size is based on practical considerations rather than perfect theoretical backing, I still acknowledge the innovative contributions presented in the manuscript.

格式问题

作者回复

2025-07-31

We sincerely appreciate the reviewer kHiT for a thorough review and valuable comments. Below are our responses to the reviewer's concerns.

W1, Q1: Increasing the group size consistently leads to suboptimal results, which is rather puzzling. This may suggest that the alignment objectives defined in Equations (10) and (12) are potentially inaccurate or not well-justified.

It may seem ideal to set the group size equal to the total number of timesteps, as it minimizes the final accumulated error directly; however, this approach often leads to suboptimal quantization parameters in practice. When the group size is excessively large, the accumulated error becomes substantial. This makes it difficult to optimize the quantization parameters effectively, resulting in suboptimal performance. Therefore, the group size should balance the trade-off between capturing long-range accumulated error and maintaining stable optimization. We believe finding the optimal group size for different settings would be a promising direction for future research, as mentioned in the limitations in the Appendix.

W2: The manuscript does not report the overall cost of the proposed method, including actual runtime and memory efficiency, which are crucial for evaluating its practical applicability.

In Table R5, we compare AccuQuant with Q-Diffusion in terms of computational performance and generation quality. We can see that AccuQuant achieves superior performance with shorter calibration time compared to Q-Diffusion. We note that although AccuQuant may require more computational cost for one loss calculation, it only requires 50 epochs per group. In contrast, Q-diffusion calibrates each layer and residual block for 5000 epochs. Although AccuQuant incurs a marginal increase in energy consumption, it delivers a favorable trade-off for real-world deployment.

In addition, we evaluate the inference efficiency of our quantized diffusion model against both Q-Diffusion and the full-precision (FP) baseline. Since the official PyTorch quantization API does not support bit-widths lower than 8, we quantize both weights and activations to 8 bits and measure the memory usage and runtime latency using ONNX Runtime with the Intel Xeon Gold 6226R CPU. As shown in Table R6, both quantized models achieve over a 3 times speedup compared to the FP model. Our method incurs a marginally higher latency (by less than 0.18%) and larger model size (by 0.006 MB) compared to Q-Diffusion, which we attribute to the separate quantization parameters per group.

In summary, although our method marginally increases latency and model size compared to Q-Diffusion, the benefits of reduced calibration time and improved generation quality offer a favorable trade-off for practical deployment.

Table R5: Calibration computational overhead for DDIM CIFAR-10 with 100 timesteps.

Method	Bits (W/A)	Batch size	Calibration time (hours)	Energy (Wh)	FID↓	FID2FP32↓
FP	32/32	8	–	–	4.26	0
Q-Diffusion	6/6	8	5.97	519.39	30.46	35.24
Ours	6/6	8	5.56	561.56	5.79	3.3

Table R6 : Computational cost of real-time CPU inference. Experiments are conducted in CIFAR-10 with DDIM over 100 timesteps

Method	Bits (W/A)	Batch size	GBops	Memory(MB)	CPU latency(s)	Model size(MB)	x Speed up
FP	32/32	64	6597	1726	94.589	143.08	x1
Q-Diffusion	88	64	798	1541.7	31.322	36.57	x3.02
Ours	88	64	798	1541.7	31.38	36.58	x3.01

Q2: What is the actual computational overhead of the proposed multi-step joint optimization? Given the complexity, would a lightweight QAT approach achieve better results with similar resource costs?

We compare in Table R7 AccuQuant with a lightweight QAT approach, EfficientDM [1]. We report the results both under the same resource constraints (denoted as EfficientDM*) and using the official training recipe with larger training data and longer training time (denoted as EfficientDM-Full). We find that under an identical setting, AccuQuant achieves substantially better performance than EfficientDM*, suggesting that QAT frameworks cannot converge sufficiently within limited resource budgets. Furthermore, although EfficientDM-Full benefits from extended training time and larger datasets, AccuQuant still outperforms it in terms of FID2FP32, demonstrating the effectiveness of our method once again.

Table R7: Comparison against lightweight QAT. Experiments are conducted in ImageNet 256×256 with LDM-4 over 20 timesteps

Method	W/A	Training time (hours)	# of training data	FID↓	sFID↓	IS↑	FID2FP32↓
EfficientDM*	4/8	4.34	5120	12.43	25.07	197.86	14.55
EfficientDM-Full	4/8	6.5	32000	9.92	7.40	351.79	1.63
Ours	4/8	4.34	5120	9.39	7.41	356.48	0.65

[1] : EfficientDM: Efficient Quantization-Aware Fine-Tuning of Low-Bit Diffusion Models, ICLR, 2024

Q3: Why does removing gradient approximation lead to worse results? This may suggest that the optimization objective in Equation (12) is inherently flawed

Although the full gradient without approximation (Eq. (14)) is expected to provide the most accurate optimization direction, our empirical results demonstrate that it leads to unstable and difficult convergence. The key reason is that the quantization parameters have not yet converged, which results in large output changes in response to small input variations and consequently leads to a noisy Jacobian term ( $\frac{\partial\tilde \epsilon_{t}}{\partial\tilde x_t}$ ). In Table R4 (refer to the answer L1 in YJ1y), the Jacobian component is smaller in average magnitude compared to the dominant scalar component, but it exhibits a high dynamic range, which introduces substantial noise into the full gradient (up to 21% at peak, and above 6% for most timesteps). Therefore, we approximate the full gradient by omitting the Jacobian term and utilizing only the dominant scalar component. This removes undesirable noise effectively while preserving the overall mean gradient magnitude and the relative proportions of gradients across timesteps. As a result, our method enables a more stable and reliable optimization process. In summary, the results in Table 5 do not indicate a fundamental flaw in the optimization objective itself, but rather show that, in practice, the real gradient is dominated by high-frequency noise due to the Jacobian term.

2025-08-04

Thank you for your detailed response. Although the choice of group size is based on practical considerations rather than perfect theoretical backing, I still acknowledge the innovative contributions presented in the manuscript. I will maintain my rating.

2025-08-04

Thank you for considering our response. We will carefully revise the paper accordingly.

审稿意见

评分: 4置信度: 32025-07-17

Prior work on post-training quantization (PTQ) has attempted to compress diffusion models by calibrating quantization parameters one step at a time, but ignores the fact that quantization errors accumulate across the sequence. This paper introduces AccuQuant, a novel PTQ method that explicitly addresses error accumulation. A key technical innovation is an efficient gradient approximation technique that reduces memory usage. Empirically, it achieves stronger fidelity to the full-precision model than prior quantization methods on a variety of benchmarks.

优缺点分析

Strengths:

Novel multi-step calibration: Unlike previous PTQ approaches that minimize error per step in isolation, AccuQuant explicitly calibrates over groups of steps. This is a significant innovation: by “simulating” the sampling trajectory, the method directly accounts for the accumulated error from previous steps. This better aligns the quantized model’s outputs with the full-precision model over time, addressing a core limitation of existing methods.
Memory-efficient implementation: The paper presents a clever gradient approximation that avoids storing all intermediate activations when backpropagating through multiple steps. This reduces memory complexity from $O(n)$ to $O(1)$ , making multi-step calibration practical for large models.

Weaknesses:

Reproducibility (no released code): The authors state they do not release code due to copyright restrictions. Especially, some experimental results show their performance closer to benchmark.
Variable gains at higher bit-widths: In some settings, the improvements over baselines are modest. For instance, on CIFAR-10 at 6/6 bits the gap between AccuQuant (FID=5.79) and the best prior (TFMQ-DM, FID=7.84) is clear, but at 4/8 bits the advantage is smaller (FID 4.75 vs TAC’s 4.89). This suggests that when models are already near full precision, the benefit may be incremental. It would be helpful to understand when multi-step calibration provides the biggest payoff (e.g. extremely low bits vs. moderate bits).

问题

Accumulation Quantization Error has been used in CNN architectures or Transformer architectures. Compared to literature works, how is current method different?
The paper primarily compares with quantization methods specifically designed for diffusion models. However, it would strengthen the evaluation to include more recent diffusion-focused approaches such as PTQ4DM and Vidit-Q.

Additionally, since diffusion models often incorporate Transformer-based components, it would be valuable to compare against advanced Transformer quantization methods such as SmoothQuant and RepQ-ViT, to better assess general applicability and robustness

[1] Repq-vit: Scale reparameterization for post-training quantization of vision transformers

[2] Vidit-q: Efficient and accurate quantization of diffusion transformers for image and video generation.

[3] Post-training quantization on diffusion models.

[4] Smoothquant: Accurate and efficient post-training quantization for large language models.

Understand the authors have conducted experiments on group size. However, will the optimal group size changes when dataset or model changes?
The authors have conducted experiments analyzing the gradient approximation. Would it be possible to also compare the loss computed using full-precision models, rather than focusing solely on the quantized models?

局限性

Gradient approximation effects: The proposed gradient approximation makes the method feasible, but it is inherently an approximation. The paper does not deeply analyze whether this approximation affects the optimality of calibration. It is possible that the loss surface changes and the learned quantization parameters are suboptimal compared to a full-gradient approach (if that were practical). Quantifying any trade-off would strengthen the work. Not sure how the gradient approximation affect the overall loss.

最终评判理由

Thanks for authors' reponse, it has addressed most of my concerns. However, the response to the questions of "Variable gains at higher bit-widths" is not convincing and aligned with the result in Table 1 to me. Even the performance may be related to the specific combination of precision of weights and activation. Therefore, I will still keep my score.

格式问题

作者回复

2025-07-31

We sincerely appreciate the reviewer YJ1y for a thorough review and valuable comments. Below are our responses to the reviewer's concerns.

W1. Reproducibility

For reproducibility, we have included a detailed description of the AccuQuant pipeline in Section 3. We have also provided an algorithm table and detailed hyperparameter settings for each experiment in the Appendix. Additionally, we present the following PyTorch-style pseudocode, which will be included in our paper upon acceptance.

# M : group size
# x_T : random gaussian noise
# fp_model : full-precision model
# quant_model : quantized model
# gather_output() : one iteration of diffusion denoising step
# sg() : stop gradient operation

# Initialize
x_t = x_T
t = T
group_index = 0

for i in range(number_of_groups):
    # Gather x_(t-M) from Full Precision Model
    with torch.no_grad():
        fp_x_M = gather_output(fp_model,x_t,t)

    # Reconstruction stage
        quant_model.set_group(group_index)
    optimizer=torch.optim.Adam(quant_params,learning_rate)

    for epoch in range(epochs):
        # Gather \tilde x_(t-M) with stop gradient
        with torch.no_grad():
            sg_quant_x_M = gather_output(quant_model,x_m,m,quant_params)
        
        optimizer.zero_grad()

        # Accumulate the gradient
        for m in range(M):
            # Calculate gradient scaling factor
            g_m =sqrt_alpha_M / sqrt_alpha_m

            # Gather \tilde x_(t-m)
            quant_x_m = gather_output(quant_model,x_m,m,quant_params)

            # Compute \tilde x_(t-M) with stop gradient
            quant_x_M = sg_quant_x_M - sg(quant_x_m) + quant_x_m

            # Compute Loss for current step
            loss_accuquant = torch.mse(fp_x_M - quant_x_M) * g_m

            # Update quantization parameters with accumulated gradients
            loss_accuquant.backward()
        
        optimizer.step()

    # update indices
    x_t = fp_x_M
    t = t-M
    group_index += 1
    ...

W2. Variable gains at higher bit-widths

AccuQuant mitigates accumulated quantization errors, which grow significantly in low-bit settings. We verify this in Table R1 by measuring the cosine distances between the outputs of full-precision and quantized model. By explicitly addressing the problem, AccuQuant delivers strong performance in such scenarios. While the problem is less severe in high-bit settings, AccuQuant still offers consistent improvements over prior methods, confirming its robustness.

Table R1: Cosine distance between the outputs of the full-precision model and the quantized model across timesteps for various bit-width settings. Min-max quantization is applied to weights and activations.

Bitwidth	80	60	40	20	0
4/8	0.0084	0.0269	0.0796	0.1698	0.2032
4/6	0.0782	0.122	0.2098	0.3799	0.4523
3/6	0.0819	0.1624	0.3298	0.5842	0.6864

Q1. Accumulation Quantization Error has been used in CNN architectures or Transformer architectures. Compared to literature works, how is current method different?

In typical CNNs and transformers, quantization errors are accumulated only across the layers. However, in diffusion models, quantization errors are not only propagated through the layers but are also accumulated across different denoising steps. As a result, quantizing diffusion models requires a specialized approach that considers both the architectural and temporal aspects, which is more challenging and cannot be addressed sufficiently with conventional methods.

Q2. Comparisons with recent diffusion-focused approaches and transformer quantization methods.

We compare in Table R2 AccuQuant to recent diffusion-focused quantization methods (PTQ4DM, PTQ4DiT) and advanced transformer quantization methods (RepQ-ViT). We can see that AccuQuant achieves the best performance across all metrics, suggesting that AccuQuant is robust to architectures (e.g., CNNs, transformers).

Please note that our comparison with PTQ4DiT can be viewed as an indirect comparison with SmoothQuant as PTQ4DiT adopts a similar approach. Unfortunately, we were unable to include ViDiT-Q in our experiments. This method targets large-scale models (e.g., OpenSora), which require extensive computational resources, and adapting AccuQuant to these models was not feasible during the short rebuttal period.

Table R2 : Quantization results of DiT-XL on ImageNet (256×256).

Method	W/A	FID	sFID
Full-precision	32/32	12.41	19.23
PTQ4DM	4/8	213.66	85.11
RepQ-ViT	4/8	224.14	81.24
TFMQ-DM	4/8	143.47	61.09
PTQ4DiT	4/8	28.90	34.56
Ours	4/8	18.60	18.83

Q3. Will the optimal group size changes when dataset or model changes?

Yes, the optimal group size can vary depending on the model architecture and dataset. We choose the group size to balance two factors: (1) capturing accumulated errors across the multiple timesteps, and (2) maintaining stable optimization. Using a small group size may fail to reflect errors accumulated over multiple timesteps, whereas using a large group size could complicate optimization due to the substantial accumulated error. We empirically found that splitting the whole timesteps into 10~20 chunks shows a good result in practice. As mentioned in the limitation section in the Appendix, we believe finding the optimal group size for different settings would be a promising direction for future research.

Q4. Would it be possible to also compare the loss computed using full-precision models, rather than focusing solely on the quantized models?

We would like to clarify that computing gradients w.r.t the loss from a full-precision (FP) model is not feasible in our setup. This is because, as described in Eq. (11), we treat the FP model as a fixed ground truth reference while optimizing the quantized model. In this formulation, since the FP model is frozen during optimization, its loss gradients are not available for comparison.

Instead of gradients w.r.t the loss, we compare in Table R3 the gradient between consecutive denoising steps (last term in Eq. (14)) between using the quantized diffusion model ( $\frac {\partial \tilde x_{t-1}} {\partial \tilde x_t}$ ) and the full-precision model ( $\frac{\partial x_{t-1}}{\partial x_t}$ ). We can see that the gradients of the quantized model have a much higher range than those of the full-precision model. As discussed in the paper, the Jacobian term ( $c_t \frac{\partial \tilde\epsilon_t}{\partial \tilde x_t}$ ) in the full gradient is noisy when the quantization parameters have not yet converged, and our gradient approximation removes this noise of the full gradient, resulting in a more stable and reliable optimization process for the quantized model.

Table R3 : Mean and min–max range of gradients computed every 20 timesteps. The first to third rows show results for the full-precision model, the quantized model, and our approximated gradient, respectively.

Timestep	100	80	60	40	20
$\frac{\partial\tilde x_{t-1}}{\partial\tilde x_t}$	1.022 (0.112)	1.059 (0.059)	1.027 (0.034)	1.008 (0.034)	1.001 (0.071)
$\frac{\partial x_{t-1}}{\partial x_t}$	1.032 (0.044)	1.060 (0.050)	1.028 (0.029)	1.008 (0.023)	1.001 (0.027)
$\sqrt{\frac{\alpha_{t-1}}{\alpha_{t}}}$	1.146	1.068	1.029	1.009	1.001

L1. Loss landscape comparisons between full gradient and approximated gradient.

We’d like to clarify that both using the full gradient and our approximation share the same forward pass, meaning that the loss value is identical. Therefore, the loss surface itself does not change.

Although the loss surface remains unchanged, the gradient approximation stabilizes optimization. We show in Table R4 the same statistics as in Figure 3 of the paper, but report the min-max range to better illustrate the variability of the gradients. The full gradients are highly variable and unstable, since unconverged quantization parameters result in a noisy Jacobian term ( $\frac{\partial\tilde\epsilon_{t}}{\partial\tilde x_t}$ ), as discussed in Q4. In contrast, our gradient approximation omits the noisy Jacobian term and uses only the scalar coefficient, resulting in much more stable gradients. This leads to more stable convergence during calibration and ultimately yields more optimal results, as demonstrated in Table 5 in the paper.

Table R4: Comparison of the full gradient and Jacobian term ( $\frac{\partial\tilde \epsilon_{t}}{\partial\tilde x_t}$ ). The first row shows the full gradient between consecutive denoising steps from the quantized model, the second row shows the Jacobian term, and the third row shows our approximation. We report the mean and min-max range (max – min) of the gradient at every 20 timesteps.

Timestep	100	80	60	40	20
$\frac{\partial\tilde x_{t-1}}{\partial\tilde x_t}$	1.0221 (0.1119)	1.0586 (0.0586)	1.0273 (0.0343)	1.0082 (0.0341)	1.0008 (0.0711)
$\frac{\partial\tilde \epsilon_{t}}{\partial\tilde x_t}$	-0.1236 (0.1119)	-0.0097 (0.0586)	-0.0015 (0.0343)	-0.0006 (0.0341)	-0.0002 (0.0711)
$\sqrt{\frac{\alpha_{t-1}}{\alpha_{t}}}$	1.1457	1.0683	1.0288	1.0088	1.0010

2025-08-07

Thank you for your acknowledgment of our response. If you have any remaining questions or concerns, please let us know, and we will be glad to provide further clarification.

最终决定Accept (poster)

2025-09-17

This paper proposes AccuQuant, a novel post-training quantization method for diffusion models that reduces accumulated quantization errors by simulating multiple denoising steps during quantization. The proposed method is well-motivated, scientifically sound, and demonstrates consistent improvements over prior PTQ approaches across multiple benchmarks. Before the rebuttal, the reviewers raised concerns, including limited code availability, modest gains at higher bit-widths, and some open questions regarding computational overhead, group size selection, and generalization to other tasks. The authors have addressed these concerns during the rebuttal; therefore, I would like to recommend accepting this paper.