PaperHub
4.8
/10
Poster3 位审稿人
最低2最高3标准差0.5
2
3
3
ICML 2025

VCT: Training Consistency Models with Variational Noise Coupling

OpenReviewPDF
提交: 2025-01-23更新: 2025-07-24
TL;DR

Improving Consistency Training with a learned data-noise coupling.

摘要

关键词
Consistency ModelsGenerative Models

评审与讨论

审稿意见
2

The authors propose an improved consistency training (CT) method by introducing a variational noise coupling scheme. The core idea involves training a data-dependent noise emission model using an encoder architecture inspired by Variational Autoencoders (VAEs). The method is theoretically linked to the VAE framework by deriving a loss function analogous to the Evidence Lower Bound (ELBO). Empirical evaluations on multiple image datasets demonstrate the superior of the proposed method.

给作者的问题

  1. β\beta selection: Table 1 shows β\beta significantly impacts performance. How was β\beta chosen for each experiment? Can the authors provide guidelines for selecting β in practice?

论据与证据

Yes

方法与评估标准

Yes

理论论述

Yes. The theoretical derivation in Appendix B (consistency lower bound) is basically correct but contains minor error. See Other Comments Or Suggestions for details.

实验设计与分析

Yes. Two issues should be addressed:

  1. Gradient clipping: The proposed method uses gradient clipping (clipping value=200), while baselines do not. It is unclear whether this technique alone contributes to performance gains.

  2. Training iterations for ImageNet: The authors increased baseline training iterations from 100k to 200k for fairness but did not report results at 100k. This obscures the true computational trade-offs. Including 100k results and discussing training costs would strengthen the comparison.

补充材料

Yes,particularly the proof of consistency lower bound in appendix B.

与现有文献的关系

The work builds on consistency training [1] and leverages VAE-inspired coupling [2]. The idea of enhancing noise-data coupling aligns with [3], but differs by learning the coupling via an encoder instead of relying on the prediction of the consistency model itself during training. This connection is appropriately discussed.

[1] Song, Y., Dhariwal, P., Chen, M., and Sutskever, I. Consistency models. In International Conference on Machine Learning, 2023c. URL https://api.semanticscholar.org/CorpusID:257280191.

[2] Kingma, D. P. Auto-encoding variational bayes. International Conference on Learning Representations, 2013.

[3] Issenhuth, T., Santos, L. D., Franceschi, J.-Y., and Rakotomamonjy, A. Improving consistency models with generator-induced coupling. arXiv preprint arXiv:2406.09570, 2024.

遗漏的重要参考文献

No essential references appear missing.

其他优缺点

Strengths:

  • The method is intuitively reasonable and grounded in established frameworks (VAEs, Flow Matching).

  • Extensive experiments validate the approach across datasets.

Weaknesses:

  • Theoretical justification: While the proposed loss is derived as an upper bound to the VAE loss, the tightness of this bound is not discussed. In VAEs, the ELBO is a tight bound that achieves equality when the variational posterior matches the true posterior (i.e., optimality). However, the paper does not clarify whether the proposed upper bound can similarly reach equality or under what conditions this would occur. This raises questions about the theoretical validity of the method compared to the VAE framework, as a loose upper bound might weaken the connection to the original ELBO’s guarantees.

  • (Minor) Marginal gains on ImageNet: The improvement (5.13 to 4.93 in 1-step FID) is modest, raising questions about scalability to higher resolutions and more complex datasets.

其他意见或建议

  1. Typos: In Appendix B, Equations (43, 48, 49) should use \geq instead of leqleq.
作者回复

We thank the reviewer for the insightful comments.

W1. On Eq. (8), Typos, and Its Tightness

R1. The correct form of Eq. (8) should be:

N\\sum_{i=0}^N||f_\\theta(\\psi_{t_{i+1}}(x_0; x_1),t_{i+1})- f_{\\theta^-}(\\psi_{t_i}(x_0; x_1),t_{i})||^2.$$ This follows from the CS inequality. We establish its connection to the continuous-time CM as $ N\\rightarrow\infty $ and analyze their optimality. Applying a Taylor expansion under the assumption that $ \\Delta t := t_{i+1} - t_i = \frac{1}{N} $ (which can be relaxed) and the above inequality, we obtain: $$- \\log p_{\\theta}(x_0) \\leq \\frac{1}{2\\sigma^2}\\, E_{q_{\\phi}(x_1| x_0)} \Big\Vert x_0 - f_\\theta(x_1, 1)\Big\Vert^2 +\text{KL}(q_{\phi}(z | x_0)||p(z)) + C$$

\leq \frac{1}{2\sigma^2} N\sum_{i=0}^N \Big\Vert \frac{d}{dt} f_\theta(\psi_t, t) \Big\vert_{t=t_i} \Big\Vert^2 (1/N)^2 + \text{KL}(q_{\phi}(z | x_0) || p(z)) + C$$

for a constant CC.

Taking NrightarrowinftyN\\rightarrow \\infty is reasonable even in practical scenarios, as CT and ECT propose designing the NN scheduler in a coarse-to-fine manner. Additionally, we observe that at the optimal values of theta\\theta and phi\\phi, the reconstruction losses (i.e., the first term in the upper bounds) specifically recover the consistency function. Consequently, the bound becomes tight at the optimum. We will incorporate this discussion in the camera-ready version.

W2. On Gradient Clipping.

R2. To evaluate the impact of gradient clipping, we applied this technique to our baseline model, iCT-VE, on CIFAR10. In the following table, we report results for both baseline and out VC method with and without gradient clipping (GC):

Method1-Step FID2-Step FID
iCT-VE (w/o GC)3.612.79
iCT-VE (w/ GC)3.522.57
(Ours) iCT-VE-VC (w/o GC)3.202.45
(Ours) iCT-VE-VC (w/ GC)2.862.32

From the table it is clear that the main improvement comes from the learned coupling. Interestingly, in this case using GC for the baseline resulted in a performance improvement, even though generally GC is not mentioned in other CM literature, and we originally added it to our method to prevent early training instabilities for the learned noise distribution. However, we agree that applying GC to the baselines ensures a fairer comparison. We will include this discussion and report the results for all baselines with GC.

W3. On ImageNet.

R3. We also run the ImageNet experiments with 100k iterations. For ECM-LI-VC we increased β\beta to β=100\beta=100 as β=90\beta=90 diverged during training. Note that for runs with 100k iterations, we sometimes encountered divergences for our models with small β\beta and sometimes also for the baselines, while it seems to be solved when training for 200k iterations. The 1-step/2-step FID results are as follows:

Method1-Step FID2-Step FID
ECM-VE5.663.78
ECM-LI5.633.48
(Ours) ECM-VE-VC5.673.67
(Ours) ECM-LI-VC6.343.77

For the settings with 100k iterations, our method performs similarly or slightly worse than the baseline. We believe this is due to the fact that the encoder for our model requires more iterations to learn the coupling, as demonstrated by the improved results for 200k iterations. We agree with the reviewer that including these results in the paper is important. We will add them, along with the corresponding results with OT coupling, to the camera-ready version.

W4. On β\beta-Selection.

R4. In our experiments, β\beta was tuned with a coarse grid search with different values with a gap 1010. For iCT on CIFAR10, we initially tested the values β=[10,20,30,40]\beta=[10, 20, 30, 40], of which β=30\beta=30 gave the best performance, then tested also for β=[25,35]\beta=[25, 35] which did not improve the performance. Similarly, for ECM, we tested for β=[10,20,30,40]\beta=[10, 20, 30, 40], and after achieving the best performance for β=10\beta=10, we tested β=[5,15]\beta=[5, 15] which did not improve the performance. The tuning was done with the VE kernel and the best values were used also for the LI kernel. We used the best values of β\beta also in FashinMNIST and FFHQ without additional tuning. For ImageNet we observed on early runs that β\beta needed to be much bigger, so we initially tuned for β=[30,60,90,120]\beta=[30, 60, 90, 120]. After achieving the best results with the VE kernel for β=90\beta=90, we further tuned for β=[70,80,90,100,110]\beta=[70, 80, 90, 100, 110] for both VE and LI kernels, and found β=100\beta=100 to be the best for VE and β=90\beta=90 for LI. We will add a discussion about this process in the camera ready version, as we agree with the reviewer that guidelines on how to choose β\beta are important for the practitioners.

审稿人评论

Thanks the authors' response. One of my concerns remains unsolved: Regarding the tightness of the upper bound, if we take NN\to \infty, the following equality should be proved according to the authors' response Ex1qϕ(x1x0)[x0fθ(x1,1)2]=01E[ddtfθ(ψt.t)2]dt.E_{x_1 \sim q_\phi(x_1|x_0)}[\Vert x_0 - f_\theta(x_1, 1) \Vert^2] = \int_0^1 E[\Vert \frac{d}{dt} f_\theta(\psi_t. t) \Vert^2] dt.

So I have the following question for this equality

  • What is the input of ψt\psi_t?
  • Why does this equality hold?
作者评论

We thank the reviewer for the additional comments. In the derivation, fθ(ψt,t)f_\theta(\psi_t,t) refers to the network evaluated at time tt with the corresponding input given by ψt\psi_t. ψt\psi_t is defined in the paper as the flow function ψt(x0;x1)\psi_t(x_0;x_1) conditioned on both x0x_0 and x1x_1. The bound is given by 12σ2Eqϕ(x1x0)x0fθ(x1,1)2=12σ201Ex1x0[ddtfθ(ψt(x0;x1),t)2]dt,\frac{1}{2\sigma^2} E_{q_{\phi}(x_1 \mid x_0)}||x_0 - f_{\theta}(x_1, 1)||^2 = \frac{1}{2\sigma^2} \int_0^1 E_{x_1 \mid x_0} \left[ \left|\left| \frac{d}{dt} f_\theta(\psi_t(x_0;x_1), t) \right|\right|^2 \right] dt,

Regarding the equality, in the optimal case, the reconstruction loss in ddtfθ\frac{d}{dt} f_\theta is minimized, meaning that fθf_\theta is a constant and equals the trajectory origin following the consistency function’s definition, which holds also for x0=fθ(x1,1)x_0 = f_\theta(x_1, 1).

审稿意见
3

The paper aims to improve the training dynamics of Consistency Training (CT) by replacing the independent joint distribution between the source (data) and target (Gaussian noise) with a learned coupling. This coupling is parameterized as an encoder that maps each data point to a conditional noise distribution. Both the consistency model and the encoder are trained end-to-end using a mixture of consistency loss and a KL-divergence term between the outputs of the encoder and the prior. Experimental results show the benefits of this approach, with the learned coupling achieving improved sample quality, as measured by FID across multiple image datasets.

Update after rebuttal

I thank the authors for their response. While I still have some reservations about the novelty of the core idea, the paper convincingly demonstrates the effectiveness of the proposed approach for consistency training through extensive experiments. The method also shows promise for integration with other consistency-based techniques. In light of this, I will raise my score.

给作者的问题

  1. Is it possible to expand upon the baselines you compare against to also include more recent consistency works, as well as other few-step generative models?

论据与证据

All claims made in the paper are properly explained and well supported, except for Eq. 8 which seems either incomplete or involves a typo.

According to the triangle inequality, the left hand side (unsquared) is less than or equal to the summation on the right hand without squared terms. The inequality written is incorrect, and requires either removing the squared terms or including a constant multiplier NN on the right hand side to fix it.

方法与评估标准

The proposed method makes sense. The idea of changing the independent coupling is a fairly popular one and has been shown time and time again to lead to improvements in standard diffusion- and flow-based generative modelling.

The evaluation criteria is also reasonable with a good selection of datasets chosen. However, as this is a few-step generative modelling work, the selection of baselines compared against is quite small and including more methods will be better. Furthermore, the reported FID numbers are, at times, significantly worse compared to the baselines which raises concerns about the fairness of the evaluation, and how well the baseline was finetuned compared to the new method.

理论论述

Only a single theoretical claim is made, that connects the proposed loss function with a VAE-style ELBO, which is sufficiently explain in Appendix B.

实验设计与分析

The experiments are sound, if a bit lacking in the amount of baselines.

补充材料

I have read through all appendices.

与现有文献的关系

As mentioned earlier, the main idea of learnt coupling between the source and target distributions is quite popular and has been applied in the context of diffusion and flow-based models before [4]. These ideas have also been used in the context of distillation, for example, minibatch OT has been shown to improve consistency models in [1]. However, the reported FID numbers in the experiments are significantly worse than more recent SOTA few-step consisteny training models such as sCT [2] and SCT [3], as well as distillation-based approaches.

[1] Li, Yiheng, et al. "Immiscible diffusion: Accelerating diffusion training with noise assignment." arXiv preprint arXiv:2406.12303 (2024). [2] Lu, Cheng, and Yang Song. "Simplifying, stabilizing and scaling continuous-time consistency models." arXiv preprint arXiv:2410.11081 (2024). [3] Wang, Fu-Yun, Zhengyang Geng, and Hongsheng Li. "Stable Consistency Tuning: Understanding and Improving Consistency Models." arXiv preprint arXiv:2410.18958 (2024). [4] Albergo, Michael S., et al. "Stochastic interpolants with data-dependent couplings." arXiv preprint arXiv:2310.03725 (2023).

遗漏的重要参考文献

The paper references the majority of prior works needed to understand the full context and all essential references are included.

其他优缺点

Strengths:

  • The paper is very well written and nicely structured.
  • The task of improving CMs is an important one, and the idea of changing the independent coupling is an interesting one.
  • A good amount of ablations are performed that clarify the impact of each design decision.

Weaknesses:

The paper lacks a bit of novelty as the main idea has already been shown to work in flow matching scenarios. Other similar ideas have shown to transfer over and also work well on consistency models, so it isn't incredibly surprising that this idea also falls in the same category. The improvements achieved over the baselines are quite marginal, and the final reported FID scores are mostly much worse than those of more recent consistency-based approaches.

其他意见或建议

No comments.

作者回复

We are grateful to the reviewer for the careful and thoughtful review. Below we address some of the points raised by the reviewer, especially the correctness of Eq. (8), how our results compare with the baselines, and the novelty of the method.

W1. About Eq. (8).

R1. We thank the reviewer for pointing out the mistake in Eq. (8). The correct form of Eq. (8) should be:

N\\sum_{i=0}^N||f_\\theta(\\psi_{t_{i+1}}(x_0; x_1),t_{i+1})- f_{\\theta^-}(\\psi_{t_i}(x_0; x_1),t_{i})||^2.$$ This follows from the Cauchy–Schwarz inequality. We notice that this modification does not affect our discussion. In particular, we demonstrate the connection between our training objective (an upper bound of the negative log-likelihood as in VAE) and the continuous-time Consistency Model as $N\\rightarrow\\infty$ in **R1.** to the **Reviewer zkoD**. We will incorporate this correction and discussion in the camera-ready version. ### **W2. On Related Baselines and Their Comparisons.** **R2.** We agree with the reviewer that having additional baselines can be beneficial for the paper. Upon camera-ready version, we plan to add a more comprehensive table, similarly to Table 1 from TCM [3], and including results from other relevant consistency model works such as SCT from [5], sCT from [4], and TCM. ### **W3. On FID comparison.** **R3.** Regarding our FID results, on CIFAR-10 the aforementioned methods achieve 1-step/2-steps FID of 2.92/2.02 (SCT), 2.85/2.06 (sCT), and 2.46/2.05 (TCM), which are comparable to ours especially in the 2-steps regime. It is also important to consider that those improvements are orthogonal to ours and could in principle be combined. On Imagenet $64\times 64$, our results are generally worse than the ones reported by the aforementioned methods, but it is important to take into account that we used minimal settings in terms of network size and training budget due to computational constraints. Regarding the iCT baseline, there is no official open source implementation available, and not being able to reproduce the exact results seems to be a common problem found in other papers too (see for example [1,2]). ### **W4. On Novelty.** **R4.** We agree that the concept of coupling and its advantages have been explored before in other works. However, in the context of Flow Matching, it has been shown that coupling generally results in straighter trajectories which result in improved generation with less function evaluation, which does not entail that coupling would necessarily result in improved performance in CMs. While, as pointed by the reviewer, minibatch OT-coupling was already explored in CMs, our learned coupling shows improved scalability compared to minibatch OT with respect to data dimensionality and batch size, as we can see from our experiments on $64\times64$ images. ### **References** [1] Issenhuth, T., Santos, L. D., Franceschi, J.-Y., and Rakotomamonjy, A. Improving consistency models with generator-induced coupling. arXiv preprint arXiv:2406.09570, 2024. [2] Lee, J., Park, J., Yoon, J., and Lee, J. Stabilizing the training of consistency models with score guidance. In ICML 2024 Workshop on Structured Probabilistic Inference & Generative Modeling, 2024a. [3] Lee, S., Xu, Y., Geffner, T., Fanti, G., Kreis, K., Vahdat, A., and Nie, W. Truncated consistency models. arXiv preprint arXiv:2410.14895, 2024b. [4] Lu, C. and Song, Y. Simplifying, stabilizing and scaling continuous-time consistency models. arXiv preprint arXiv:2410.11081, 2024. [5] Wang, F.-Y., Geng, Z., and Li, H. Stable consistency tuning: Understanding and improving consistency models. arXiv preprint arXiv:2410.18958, 2024.
审稿意见
3

This paper proposes a method that combines VAE and Consistency Model, specifically using the encoder to predict the noise corresponding to the data. The resulting data-noise coupling is used to train the consistency model. The authors claim that this approach can reduce the variance in consistency model training. Experimental results in the paper validate the effectiveness of this method in improving generation performance.

update after rebuttal

I thank the author's response and I will maintain my score as Weak Accept. However, I still think that the comparison with CD is necessary, otherwise the importance of this paper will be reduced.

给作者的问题

In Line 300, it looks like beta does not affect weighting. Can the authors explain this?

论据与证据

The paper attempts to demonstrate that the proposed method can reduce the training variance of CT, but the experimental results (Figure 3) do not provide sufficient direct support for this claim, and the improvement is not significant enough.

方法与评估标准

Yes.

理论论述

This paper makes no new theoretical claims.

实验设计与分析

It is recommended to add additional baselines such as consistency distillation in the experiment.

补充材料

I reviewed the experimental part in the supplementary materials.

与现有文献的关系

This paper proposes a method to reduce the variance of consistency model training, which may have an impact on the field of generative models and visual generation.

遗漏的重要参考文献

This article discusses relatively comprehensive related work.

其他优缺点

Strengths

  • The paper is clearly written and easy to understand. The proposed method is simple and straightforward, and looks promising.
  • The robustness experiments with beta enhance the method's effectiveness.

Weaknesses

  • In Figure 3, for gradient variance and FID metrics, iCT-VC does not show a clear advantage over iCT, with some intervals performing worse than iCT.
  • It is recommended to include a comparison with CD, as one of the disadvantages of CT compared to CD is training variance, and this method aims to alleviate that variance, making a comparison with CD meaningful.

其他意见或建议

It is recommended to conduct experiments on continuous consistency model training because its theoretical upper limit is higher.

作者回复

We thank reviewer for the comments and feedback. We address here some of the points and concerns raised.

W1. Figure 3 Does Not Have Enough Support.

R1. We believe that the initial disadvantage of our method compared to the baseline is due to the fact that the encoder is still in early training, which results in an initial increase in variance. As training proceeds and the encoder’s quality improves, the variance reduces accordingly and the FID performance of our method surpasses the one of the baseline. While at first the improvement can seem only marginal, we would like to highlight that for models that already perform relatively well on a given dataset, small FID improvements can correspond to a significant image quality enhancement.

W2. On Comparison with CD.

R2. We appreciate the reviewer’s suggestion that including additional baselines could strengthen the paper. Our focus is on training-from-scratch methods that do not assume access to a pre-trained teacher model. Thus, we primarily compare our approach with the CT counterpart. Nevertheless, we recognize the importance of a comprehensive evaluation and will include the FID results in a table for clarity in the camera-ready version.

W3. β\beta-Weighting.

R3. We thank the reviewer for spotting the mistake. The correct formula for λkl\lambda_{kl} is βλct(tN)\beta \lambda_{ct}(t_N) when using the adaptive loss, while simply β\beta when using the weighting like in EDM. We will update the paper accordingly in the camera ready version.

最终决定

Authors proposed a method that combines VAE and Consistency Model, specifically using the encoder to predict the noise corresponding to the data. Two reviewers are positive with one initially negative. After rebuttal, the negative reviewer seems to be satisfied by the theoretical justification and evaluations on ImageNet.