Improving Consistency Models with Generator-Augmented Flows
摘要
评审与讨论
The paper studies some theoretical aspects of the Consistency models (in particular, the consistency training regime). As the practical contribution, the authors propose to augment consistency training with so-called Generator-Augmented flows. The idea is to substitute the basic independent/mini-batch OT coupling used in consistency training with more sophisticated coupling which (typically) depends on the trained consistency generator itself. The authors provide some theoretical reasoning behind their methodology and conduct practical benchmarking.
给作者的问题
- How do the results of Theorem 1 correspond to those by [Song et. al]. What was lost in the analysis by [Song et. al.] that they “missed” the discrepancy between CT and CD? Some more elaborate discussion should be done on the comparison of the theoretical results.
- Theorem 1, case . How could it be the case that limiting gradients are equal, but the objectives themselves are not, and the difference between objectives depends on .
[Song et. al.] Consistency model.
Some misprints/minors:
- What is variance of data, , lines 147-148
- In lines 116-117 it is noted that . From which it follows? Some references?
- Line 254-255, second column should be with subscript.
论据与证据
Everything more-or-less ok
方法与评估标准
The practical validation methodology seems to be ok
理论论述
I went over the proof of Theorem 1 (not very attentive) - seems to be ok
I only take a glance at the proof of Theorem 2, but (seems to) understand the idea behind it. Other theoretical results are not carefully checked.
Some questions/issues/clarifications:
- [C] Proof of theorem 2, lines 791-793. “Most importantly, the velocity term is not , but .” But, as I understand, ?
实验设计与分析
I think the authors have a good suit of experiments. In particular, the authors make several theoretical claims/insights about their method and try to support these claims/insights practically. At the end of the manuscript, the authors compare their approach with baselines on several benchmark datasets. This is a good experimental design.
Some questions/issues/clarifications:
- [Q] I do not completely understand the experimental setup in sections 4.2.1 and 4.2.2. What was chosen as an “ideal consistency model” () in this experiment? I didn’t find experimental details.
- [C] Figures 2, 3, x-axis. What is “timesteps ”? Is it overall time T/ delta t?
- [I] The gap in ’s values in Figure 2 does not seem to be significant. Do we really need to fight for such an improvement?
- [I] Table 1: You report the FID metric for iCT-IC baseline. In the original work about iCT [Song and Dhariwal, 2024], the authors also provide FID score, see Table 2 in their paper, and their obtained FID values (~2.5-3). Why do you report ~ 2 times larger FID values? Also, how many function evaluation steps (1, 2, or more) was used to obtain the FID?
补充材料
I checked some proofs, took a glance at the other parts.
与现有文献的关系
To the best of my knowledge, the idea with Generator-Augmented flow and its theoretical investigation are novel (at least, in the context of Consistency Models).
Regarding the survey of Schrodinger bridge-related papers (lines 209-211, second column), some good references:
- Solving schrodinger bridges via maximum likelihood. (Entropy)
- Diffusion bridge mixture transports, schrodinger bridge problems and generative modeling. (JMLR)
- Adversarial Schrödinger Bridge Matching (NeurIPS’24)
遗漏的重要参考文献
No
其他优缺点
Consistency models are known to be not very stable, especially when step approaches zero and in continuous-time limit. I expect that the utilization of Generator flow should lead to even worse stability. Do the authors experience such problems in practice?
其他意见或建议
To be honest, I do not understand why the discrepancy between CT and CD losses matters. Ok, the losses and are a bit different (e.g., in continuous time-limit), but why is it an issue that needs to be addressed, as outlined in the abstract and contributions? As I understand, the aim of CT is to learn a generative model - why does the difference between and cause poorer performance of CT? I expect that the main problem (it is also mentioned in the text) is the variance of estimator. Please elaborate more on this point.
We thank the reviewer for their positive and constructive review. We address the concerns below.
Theoretical Claims / 1.[C] Lines 791-793: as I understand, ?
where . However, a pair of training GC points starts from the same , which sets the velocity term to .
Experimental Designs Or Analyses / 1.[Q] I do not completely understand the experimental setup in sections 4.2.1 and 4.2.2. What was chosen as an "ideal consistency model”?
The ideal consistency model in all experiments of Sections 4 and 5 is a consistency model trained in a standard manner on independent coupling (iCT-IC), as described in Section 5.1. We apologize for the confusion and will clarify this point in Section 4.
Experimental Designs Or Analyses / 2.[C] Figures 2, 3, x-axis. What is “timesteps ”?
It is the diffusion noise's standard deviation , as in .
Experimental Designs Or Analyses / 3.[I] The gap in ’s values in Figure 2 does not seem to be significant. Do we really need to fight for such an improvement?
Note that on large timesteps, there is an order of magnitude of difference between IC and GC. Then, the relationship between improvement in ’s values and improvement in final sampling quality is unknown. We hypothesize that a small improvement of could lead to significant change on sampling quality, as our analysis suggests.
Experimental Designs Or Analyses / 4.[I] Table 1: In the original work about iCT [Song and Dhariwal, 2024], the authors report FID values ~ 2 smaller than yours. Why such a difference? Also, how many function evaluation steps was used to obtain the FID?
We report FID with 1 NFE. The difference with iCT stems from the following differences: iCT uses 400k iterations with batch size 1024, while we use 100k with batch size 512. For fairness, all models were trained and evaluated in the same setting. Note that our ECT experiments follow the original setting from the authors, and that we present new results (see rebuttal to Reviewer PJfJ).
Other Strengths And Weaknesses. Does Generator flow worsens stability issues of consistency models?
We observed instabilities in two settings: with mixed precision in the iCT setting; with large pre-trained models in the short training setting of ECT. In both cases, instabilities arose for both IC and GC.
Other Comments Or Suggestions. I do not understand why the discrepancy between CT and CD losses matters. [...] Why is it an issue that needs to be addressed, as outlined in the abstract and contributions? [...] I expect that the main problem (it is also mentioned in the text) is the variance of estimator. [emphasis ours]
CD provably approximates the target diffusion flow, ensuring convergence to the data distribution. We show that CT is not equivalent to CD due to a Jacobian regularization term (Theorem 1), which prevents CT from learning the true diffusion flow. This challenges widely accepted knowledge and suggests that the generator learned by CT may not converge to the data distribution. We hypothesize that addressing this discrepancy improves performance, as confirmed by our experiments.
Questions / 1. How do the results of Theorem 1 correspond to those by [Song et. al]. What was lost in the analysis by [Song et. al.] that they “missed” the discrepancy between CT and CD? [emphasis ours]
The difference comes from two issues:
- in their Theorem 2 (), the is actually too large compared to the other term, and consequently the result is uninformative;
- their Theorem 6 (limiting gradient equality) is stated with a general distance function, but the requirements on its Hessian restrict the theorem's validity to a quadratic loss.
We will clarify those differences in the next version of the paper.
Questions / 2. Theorem 1, case . How could it be the case that limiting gradients are equal, but the objectives themselves are not, and the difference between objectives depends on .
The case is particular: objectives are not equal, and gradients of the limit loss are not equal. However, limiting gradients are equal because of the interplay between the "stop-gradient trick" (whose effect disappears only in continuous time and not in the limit) and the quadratic loss (cf proof of Theorem 1, L668). This equality is in some sense a coincidence.
Minor / 1. What is ?
As in EDM (Karras et al., 2022) and later works, in our experiments.
Minor / 2. It is noted that . [...] Some references?
This can be seen in Liu et al. (2023), Definition D.1.
Update
My criticism is minor, and I think the authors have adequately addressed it. As I have already given a positive review, I decided not to change my evaluation.
Old Summary
Consistency model is a type of generative model that sends points along the sampling trajectory of a diffusion model to the last point (i.e., a noiseless data item) on that trajectory. There are well-known two training algorithms for training consistency models. The first is consistency distillation (CD), whichi involves distilling a teacher diffusion model. The second is consistency training (CT), which does not require a teacher. Models trained with CT generally underforms those trained by CD.
The paper points out a theoretical difference betweem CD and CT, which may help explain the discrepancy in observed performance. In particular, it shows that the loss functions for CD and CT are different even at the limit of the time step size approaching zero. Moreover, when the distance function used in the losses are not the L2 distance, the gradients of the losses are different. This implies that CD and CT would likely lead to different converged models.
The paper then proposes a way to improve CT by altering the points at which the model under training are evaluated during training. Let denote the model under training. In CT, we would sample and , and a time step . Then, we would try to minimize the distance between and . Assuming access to idealized consistency model , the paper proposes computing the final point and then minimizing the distance between and instead. This modification is called "generator-augmented coupling" (GC).
The paper mathematcally shows that GC has two benefits. The first is that GAF reduces a "proxy term" that measures the discrepancy between the loss functions of CT and CD. The second is that it reduces the expected L2 distance between a Gaussian noise and the point it is mapped to.
However, GC as just stated is useful because, if we already have the idealized consistency model, we would not be interested in training a new consistency model in the first place. The paper offers two remedies.
The first remedy is to train a consistency model with CT and use it as an approximation of in the next round of CT, but now with GC. The paper finds that this results in faster convergence in the 2nd round. Still, this approach is not practical as it makes the training process twice longer.
The second remedy is to modify CT with probability , CT-with-GC is used instead of the normal CT loss calculation method, and the model under training is used to approximate when doing GC. The paper found that, with yielded models that achieved better scores (FID, KID, and IS) on the CIFAR-10, ImageNet 32x32, CelebA 64x64, and LSUN Church 64x64 datasets.
给作者的问题
N/A
论据与证据
Experiment results were easy to understand. Improved FID scores on 4 datasets over vanilla CT and OT coupling methods clearly show that GC has merit.
方法与评估标准
The benchmark datasets and the metrics used to evaluate trained models seem reasonable. However, using datasets with higher resolution images would make the experimental results stronger. Baselines to compare GC against (i.e., iCT-IC and iCT-OT) are reasonable as well.
理论论述
I think the theoretical results are clearly stated. I also briefly checked the proof, and I found no glaring issues.
However, I think there might be an issue with the use of the proxy term . The paper claims that it is "a proxy term for " where captures the discrepancy between the CT loss and the CD loss but does not provide any mathematical justification for this. Honestly, I failed to find a clear connection between and by myself. What I noticed is that measures the variance of target velocity values in the conditional flow matching loss [1], and its being lower is better for training a flow matching model. However, we are training consistency models with with a different loss here, and it is unclear how target variance for the conditinal flow matching loss would relate to the consistency training loss. I encourage the authors to make this connection clearer.
Citation
- Lipman et al. "Flow Matching for Generative Modeling." 2022.
实验设计与分析
I think the experiments are sound.
补充材料
I read through the proofs and found no major issues. I skimmed the sections on implementation details and found no issues either.
与现有文献的关系
The paper provides new insights on why CT underforms CD, and provides a new way to improve the CT training algorithm. The authors rightly situate their approach among methods that use more sophisticated couplings between the data distributions and the noise distributions.
遗漏的重要参考文献
I believe the references are adequate.
其他优缺点
N/A
其他意见或建议
- "A most remarkable" -> "The most remarkable"
- "consists in" -> "involves"
- Dash in Latex is "---" instead of "--". Moreover, there should be no spaces before and after it. Hence, "velocity field -- thereby changing the target flow -- to reduce" should have been "velocity field---therby changing the target flow---to reduce"
- "Two Diracs" -> "Two Dirac delta funcitons" or "Two impulses"
- "" -> ""
- " the variance of the data" -> "\sigma_d is the standard deviation of the data"
- "leaving the (asymptotic) quadratic loss" -> "except the version of the loss where the L2 distance function is used."
- "While not fully solving the alignment issue either" -> remove "either"
- "At -> "For "
- "we observe fast convergence but early divergence" -> "the FID score decreases faster than other configurations early in the training process, but it soons increases as training progresses further."
We would like to thank the reviewer for their detailed and positive assessment. We address the raised questions/weaknesses below.
The paper claims that [] is "a proxy term for ", [...] but does not provide any mathematical justification for this. [...] I encourage the authors to make this connection clearer. [emphasis ours]
We can make this connection clearer in the case of the quadratic loss (). Indeed, we can bound the regularization term with the proxy term thanks to the Jacobian's maximum singular value , which is bounded as typical networks are Lipschitz:
Moreover, under some other assumptions on , for example the fact that it is close to a scaling function for large (see Corrolary 2), if , then we would have:
.
We will include this discussion in the next version of our paper.
Typos. We thank the reviewer for all the suggestions and the careful reading. We will modify the paper accordingly.
The paper analyzes and discusses a discrepancy between consistency distillation and consistency training in consistency models, and proposes a novel consistency training procedure to ameliorate the problem by leveraging the solution of the probability flow ODE learned by the model during training.
Update after rebuttal
I confirm that the approach seems promising and well motivated. However, the fact that the authors use different training settings makes it hard to compare the experimental results with the ones from related literature. Therefore I will keep my score as is.
给作者的问题
-
- How would you explain that using minibatch optimal transport on LSUN slightly outperforms the generator coupling? I would expect the minibatch OT coupling to work worse given the small batch size.
论据与证据
Yes, the claims are supported by empirical evaluation i sections 4 and 5, see below.
方法与评估标准
The experiments are suited to evaluate the proposed method. In particular, the datasets are standard benchmarks for conistency models, and the baselines iCT and ECT are the two most common baselines for consistency training. However, the experimental results in iCT are not convincing as they are quite far from the results reported in the original paper, and the model has been tested only on small settings for ECT. See 'Other strengths and weaknesses'.
理论论述
I checked the correctness of the claims in section 4 and the proofs in appendix A.
实验设计与分析
The authors provide extensive evaluation of the method compared to baselines using independent coupling and minibatch OT, showing the effective advantage of using the proposed method. The analysis in figures 2 and 3 confirm the claims made in section 4.
补充材料
I reviewed and appreciated the extended details provided in sections A, B, C and D. The hyperparameters for iCT differ from the one reported in the iCT paper, which could explain the difference in FID.
与现有文献的关系
Consistency Models are a promising recent generative models for one-step generation. The proposed method exploits an improved training procedure, which is relevant both for advancing the understanding of the models and for practical applications.
遗漏的重要参考文献
I would suggest adding "Minimizing Trajectory Curvature of ODE-based Generative Models" from Sangyun Lee, Beomsu Kim, and Jong Chul Ye, to the discussion of coupling methods in section 3.
其他优缺点
As a weakness, the reported empirical results differ significantly from the original baseline iCT, which makes it hard to asses the validity of the proposed method. One problem is surely due to the lack of an open source official implementation of iCT, but the differences are a bit too large. However, the authors do include some experiments on ECT, of which the original implementation is open sourced, and there the results are convincing. I would suggest the authors to improve their iCT baseline, and to include more experiments based on ECT, as when using a pretrained network, the experiments are relatively fast and can be done with small batch size.
A clear strength of the proposed method is the fact that it does not add any extra model parameters, and maintains the same sampling speed of equivalent baselines with independent coupling, for only a minor overhead in training time due to the extra forward pass. The model can potentially scale to much higher data dimensions and smaller batch size compared ot minibatch optimal transport.
其他意见或建议
I wonder if the authors have tried annealing the parameter from 0 to 1 during training. Intuitively, I would expect that at early training iterations, the model would benefit from using mostly the independent coupling, as the learned solution is probably still quite off, and then as training progresses and the sample quality improves, the model can rely more and more on the generator.
We would like to thank the reviewer for their thorough and constructive review. We address the raised questions/weaknesses below.
1. How would you explain that using minibatch optimal transport on LSUN slightly outperforms the generator coupling? [emphasis ours]
One of our contributions is to explain why using batch-OT in consistency models is a strong baseline via our theoretical analysis (cf. Sections 2.3 and 3). This further highlights the overall better performance of GC as suggested in Section 4.2.2 and confirmed in Table 1. We do not have a clear reason why LSUN is an exception. Note however that GC and OT are nearly in the same confidence interval of the FID on this dataset, while GC is significantly superior to batch-OT on CIFAR-10 and CelebA.
2. As a weakness, the reported empirical results differ significantly from the original baseline iCT. One problem is surely due to the lack of an open source official implementation of iCT. I would suggest the authors to improve their iCT baseline, and to include more experiments based on ECT, as the experiments are relatively fast and can be done with small batch size. [emphasis ours]
The difference in FID values stems from the following differences: the original iCT baseline uses 400k iterations with a batch size of 1024, while we use 100k iterations with a batch size of 512. Still, the comparison of all models in our paper is fair: they are trained and evaluated in the same setting, yielding sound conclusions. Moreover, in the compute-efficient ECT setting, we are still superior to independent coupling. As recommended, we provide new results in the ECT setting on a larger-resolution ImageNet and FFQH (new dataset) that we will include in Section 5.3:
ECT Results (FID)
| Model | FFHQ-64 (Short) | FFHQ-64 (Long) | ImageNet-64 (Short) | ImageNet-64 (Long) |
|---|---|---|---|---|
| ECT-IC | 13.29 ± 0.11 | 9.68 ± 0.08 | 10.82 ± 0.18 | 5.84 ± 0.21 |
| ECT-GC (μ=0.3) | 11.73 ± 0.09 | 8.51 ± 0.12 | 10.31 ± 0.22 | 6.39 ± 0.20 |
Overall, GC is superior to IC on all datasets for iCT (4/4), and on 5/6 settings for ECT. Note that in the ECT setting, we could run the experiments with the same training time than in the original paper, and thus report similar FID values. This validates the benefits of using GC in practice.
3. I wonder if the authors have tried annealing the parameter from to during training. Intuitively, I would expect that at early training iterations, the model would benefit from using mostly the independent coupling, as the learned solution is probably still quite off, and then as training progresses and the sample quality improves, the model can rely more and more on the generator. [emphasis ours]
Indeed, this is an interesting question. We did experiment with this type of annealing which resulted in lower performance than with a fixed . One explanation is that it does not interact well with the scheduling on the number of timesteps that is periodically increased during training: the generator is not optimal on newly introduced timesteps.
4. I would suggest adding "Minimizing Trajectory Curvature of ODE-based Generative Models" from Sangyun Lee, Beomsu Kim, and Jong Chul Ye, to the discussion of coupling methods in section 3.
We thank the reviewer for this interesting reference which we will include in our discussion. In this work, the authors propose to learn an encoder from data to noise, and use this encoder as a way to construct a coupling when training a flow model. In comparison to our method, their algorithm is based on a joint training of the velocity field and of the noise encoder which then requires an additional network.
This paper examines consistency models, a technique for achieving single-step (or few-step) sampling in diffusion-based generative modeling and proposes to modify the data-noise coupling used during training so as to reduce the discrepancy between consistency training and consistency distillation. Specifically, the work introduces generator-augmented flows, which leverage an auxiliary generator (i.e., a consistency model proxy) to form “better” couplings between data points and noise. The authors show that this approach can reduce the variance in the velocity-field estimation (i.e., bridging the gap between training and distillation) and also decrease the transport cost from noise to data.
The paper also clarifies that consistency training and consistency distillation do not coincide, even in the continuous-time limit, because the Monte Carlo estimate of the velocity field used in consistency training differs from the true velocity in distillation. The authors propose generator-augmented coupling (GC), in which an already-trained or concurrently-trained consistency model generates better endpoints for the intermediate noised data. This approach yields reduced discrepancy and lower transport cost compared to the standard “independent coupling” of data and noise.
给作者的问题
Please refer to the "weaknesses" section.
论据与证据
The paper’s theoretical proofs and experiments are well aligned.
方法与评估标准
The authors adopt mainstream metrics (FID, KID, IS) for generative image quality, consistent with typical evaluation practice in diffusion and consistency modeling. The theoretical framework is couched in well-established techniques (Wasserstein distances, “transport cost” definitions, analyzing velocity fields in PF-ODE form).
As for the datasets and benchmarks, the chosen datasets (CIFAR-10, CelebA, ImageNet , LSUN Church ) are standard, ensuring results are comparable to prior consistency and diffusion papers.
理论论述
The derivations about the discrepancy between training and distillation (Theorem 1) are well grounded. The arguments for reduced transport cost are fairly straightforward but well explained (they rely on the fact that can be shown to decrease under certain conditions).
实验设计与分析
The authors train on well-known tasks with standard settings for consistency or diffusion-based generative modeling (e.g., iCT from Song & Dhariwal 2024). They measure the relevant metrics (FID, KID, IS) using recognized library code. The authors also compared their results with iCT-IC, iCT-OT and examine short vs. long training schedules.
补充材料
Yes, I reviewed the entire supplement.
与现有文献的关系
The paper provides a refined theoretical view on the difference between consistency training and distillation, clarifying a claim about them “coinciding” in the continuous-time limit. This clarifies or partially corrects earlier suggestions by Song et al. that the discrepancy would vanish entirely. It builds on a line of research that tries to enhance data-noise coupling in generative modeling (OT-based or reflow-based methods). Its “generator-augmented flow” is a distinctive addition, conceptually simpler than full OT yet apparently effective. The authors have cited all the major literature, including consistency modeling (Song et al. 2023; Song & Dhariwal 2024) and prior works on alternative couplings (Pooladian et al., Dou et al.).
遗漏的重要参考文献
I don't think there are major references missing.
其他优缺点
Strengths:
The authors systematically analyze the limiting behavior of consistency losses, which is rarely spelled out in prior literature. Generator-augmented flows are easy to implement and require no heavy OT solvers. More importantly, on standard datasets, the approach yields consistent improvements, beyond batch-OT.
Weaknesses:
- The results primarily cover moderate-scale or tasks. Additional experiments on or higher could strengthen claims of broad applicability. Can the authors reply to this, or add experiments on this?
- The concept partly assumes an “ideal” or well-trained generator for the GC approach. The paper addresses this with “joint learning,” but theoretically that is more complicated, and some of the formal results rely on the generator’s accuracy. What do you think about this?
- Could the approach generalize easily to other cost functions in the coupling—e.g., or robust divergences? Does the preference for quadratic cost (in the “transport cost” sense) strictly matter for performance in your experiments?
其他意见或建议
Please refer to the "weaknesses" section.
We would like to thank the reviewer for their detailed and positive assessment. We address the raised questions/weaknesses below.
1. The experiments cover moderate-scale. Could the authors add an experiment on ? [emphasis ours]
We cannot conduct experiments on larger resolutions from scratch for computational reasons. Moreover, the only pre-trained diffusion models available in the compute-efficient ECT setting are on ImageNet , which is also too large given our resources.
Still, we were able to obtain new results on a larger-resolution ImageNet ( conditional) and a new dataset (FFHQ ) in the ECT setting that we will include in Section 5.3:
ECT Results (FID) on ImageNet-64 and FFHQ-64
| Model | FFHQ-64 (Short) | FFHQ-64 (Long) | ImageNet-64 (Short) | ImageNet-64 (Long) |
|---|---|---|---|---|
| ECT-IC | 13.29 ± 0.11 | 9.68 ± 0.08 | 10.82 ± 0.18 | 5.84 ± 0.21 |
| ECT-GC (μ=0.3) | 11.73 ± 0.09 | 8.51 ± 0.12 | 10.31 ± 0.22 | 6.39 ± 0.20 |
2. The concept partly assumes an “ideal” or well-trained generator for the GC approach. The paper addresses this with “joint learning,” but theoretically that is more complicated, and some of the formal results rely on the generator’s accuracy. What do you think about this?
Indeed, our theoretical assumption requires an ideal generator to construct GC. Our joint learning approach allows to approximate this ideal generator and, when the generator is close enough to an ideal one, joint learning succeeds. Interestingly, we show that when the generator is not trained enough on IC, then GC does not work anymore (see Section C.1 and Figure 7 in Appendix). This proves the effectiveness of the joint learning approach: the approximation is good enough to construct GC trajectories.
3. Could the approach generalize easily to other cost functions in the coupling—e.g., or robust divergences? Does the preference for quadratic cost (in the “transport cost” sense) strictly matter for performance in your experiments? [emphasis ours]
GC is agnostic to all distance functions, whether the one used in consistency model training (Eq. 4/6/15) or the transport cost used in the analysis in Section 4.2.2. Regarding the transport cost, the algorithm is not built to minimize this distance directly, as opposed to batch-OT. We did not study this further but intuitively expect GC to also minimize transport cost with regards to other cost functions.
We would be happy to provide further clarification, should we have misinterpreted to the reviewer's question.
This submission provides an in-depth theoretical analysis of consistency training (CT) and distillation (CD) objectives and how they differ. It additionally proposes a new "generator-augmented" training for CT that comes with lower transport cost between data and noise and reduced discrepancy between CD and CT. While there were some minor concerns on the evaluation setup, the paper overall provides substantial insights into consistency models, which are of high interest to the community. Therefore, I strongly support accepting the submission.