6.5

/10

Poster4 位审稿人

最低4最高10标准差2.2

4.3

置信度

正确性2.8

贡献度2.5

表达3.0

NeurIPS 2024

Improving the Training of Rectified Flows

Sangyun Lee,Zinan Lin,Giulia Fanti

OpenReview PDF

提交: 2024-05-14更新: 2024-11-06

TL;DR

We propose improved techniques for training rectified flows, allowing them to compete with knowledge distillation methods even in the low NFE setting

摘要

Diffusion models have shown great promise for image and video generation, but sampling from state-of-the-art models requires expensive numerical integration of a generative ODE. One approach for tackling this problem is rectified flows, which iteratively learn smooth ODE paths that are less susceptible to truncation error. However, rectified flows still require a relatively large number of function evaluations (NFEs). In this work, we propose improved techniques for training rectified flows, allowing them to compete with knowledge distillation methods even in the low NFE setting. Our main insight is that under realistic settings, a single iteration of the Reflow algorithm for training rectified flows is sufficient to learn nearly straight trajectories; hence, the current practice of using multiple Reflow iterations is unnecessary. We thus propose techniques to improve one-round training of rectified flows, including a U-shaped timestep distribution and LPIPS-Huber premetric. With these techniques, we improve the FID of the previous 2-rectified flow by up to 75% in the 1 NFE setting on CIFAR-10. On ImageNet 64$\times$64, our improved rectified flow outperforms the state-of-the-art distillation methods such as consistency distillation and progressive distillation in both one-step and two-step settings and rivals the performance of improved consistency training (iCT) in FID. Code is available at https://github.com/sangyun884/rfpp.

关键词

generative modelingrectified flowdiffusion model

评审与讨论

审稿意见

评分: 10置信度: 42024-06-19

By retraining with Rectified flows, they straighten the ODE, allowing sampling for small number of steps. They propose using rectified flow as replacement to complex distillation methods such as consistency models. They reflow only once to straighten the path which makes it as efficient as distillation. They propose some propose using a different distribution on t and a LPIPS loss. They obtain better performance than the original rectified flow and competitive results with distillation.

The authors provide geometric intuition to why one reflow should be enough. They provide well detailed ablation on each aspects of their improvements on multiple datasets.

They combine Pseudo-Huber (differentiable Huber loss to get less sensitivity to outliers while being differentiable) with LPIPS (better distance metric than L2 for image quality). They take a weighted combination of both losses. If one was not generating images, we could envision using pseudo-huber with L2, so the method could still be useful in that case. The focus here is on images though.

Edit: I apologize, I added some extra text to the summary that was not related. I just removed it.

优点

The experiments are on multiple datasets, focusing on the 1-2 NFE case which is arguably the most important problem in diffusion-flow models to solve. The results are quite impressive. Reflow seems more efficient that distillation and its especially much more clean and natural as a solution. The experiments and discussion is quite thorough.

缺点

Honestly this is a near perfect in my view. Maybe, the only thing I could think of is the focus being only on images, so maybe not fully generalizable accross domains.

问题

The only thing I would like added is the difference in memory from the first flow and then the reflow only because of the use of the LPIPS. Basically, I would like to know if LPIPS massively improve the memory cost over L2 and if so it would be important to mention it as a limitation. The authors already mentioned the 15% speed reduction (which is quite minor), it would be great to know about memory.

局限性

The authors have well addressed the limitations. I just asked that they talk about the memory cost of LPIPS too.

作者回复

2024-08-06

Thank you for the review. On ImageNet 64x64, the additional memory overhead of using LPIPS is less than 5%. We expect this would be even smaller on more large-scale settings since the generative model’s size would be relatively huge compared to the feature extractor such as AlexNet or VGG.

评论- response

2024-08-08

The fact that memory is almost the same with LPIPS is very good. I leave my score as it is.

审稿意见

评分: 6置信度: 52024-06-24

The paper targets efficient training of a class of flow models called Rectified Flow (RF) trained using flow matching objective. The paper has two broad contributions: (1) justification of 2-RF (‘reflow’-ed once) being close to optimal, (2) and using those findings to improve training of 2-RF.

Authors argued that, in practical scenario, the pairs generated by optimal 1-RF model is ‘crossing-free’, i.e. the stochastic interpolating paths rarely cross each other. They provided an intuitive and empirical evidence of the same. They claimed this to be a motivating factor for two improvements of 2-RF model training — a new time-step distribution $p_t$ & the a new distance measure for the regression task of diffusion objective.

Good empirical performance is shown in terms of FID on several datasets and models. An ablation study is also done for the relevant part of the finding.

优点

The problem statement chosen and the arguments provided are quite credible. It is well known that, for RF to work well, one requires several reflows. Reflows are expensive, hence the contributions of the paper (if credible) can be significant.
A good intuitive analysis is done by the authors to justify their argument of 2-RF being near-optimal.

缺点

I like the overall outcome of the paper in terms of empirical performance. I also like the problem statement. However, some concerns remain. Following points explain the issues and points to some more questions in question section.

Major concerns

This is a major concern for me. The paper has two parts: (1) justifying that 2-RF is near optimal & providing intuitive and empirical evidence; (2) proposing two new measures for training improvement. I think, (1) & (2) both are overall correct on their own. I just don’t think that (1) is the right motivator for (2).
Continuing the above point, I felt that the rationale behind both ‘improvements’ are weakly connected to the empirical observations (in section 3). After all, what the authors ultimately proposed are ad-hoc and exists in the literature. (L162) “.. focusing on the tasks where training loss is high ..” is a known technique which authors admitted themselves. Using non-L2 loss is also not unheard of [1]. More importantly, I don’t think (not) using L2 loss has anything to do with your observation is section 3 (as argued in L195). The L2 loss has its origin in score-matching which yields theoretical benefits, but I suspect it is not necessary. Q1 is related to this.
Let’s talk about the observation of section 3 itself (i.e. 2-RF being optimal). Authors must be clear about whether they are making a theoretical assertion (L64-65) or just an empirical observation (they used words like “under realistic setting”, “rarely intersect each other”). If you are providing a guarantee, you must provide better formal proofs. To clarify, I think the argument provided in section 3 is indeed correct and seems reasonable. But if it is a guarantee, empirical evidence is not enough. Q2 is a related question.

[1] “Improving Diffusion Models's Data-Corruption Resistance using Scheduled Pseudo-Huber Loss”, Kharpov et al., arXiv 2403.16728.

Presentations

The notation used in section 2 & 3 are confusing sometimes. There are three pairs of notation — $(\mathbf{x}, \mathbf{z})$ , $(\mathbf{x}_0, \mathbf{z}_0)$ & $(\mathbf{x}_1, \mathbf{z}_1)$ . I am confused about which is what. I would recommend authors to follow the same notation as the original RF paper by Liu et al.
Notations like $\mathbf{z}\_0 \sim p\_{\mathbf{x}}$ (L86) are very confusing.
The factor $\frac{t}{1-t}$ in Fig 2(a) (at the top left) should be $\frac{1-t}{t}$ , right ?
Fig. 2 caption says $\mathbf{z}'' = \mathbf{z} + (\mathbf{x}' - \mathbf{x}'')$ — where is the factor $\frac{1-t}{t}$ ? Did you assume $t = 0.5$ ?
Again, it is hard to parse notations like $\mathbf{x}, \mathbf{x}', \mathbf{x}''$ or $\mathbf{z}, \mathbf{z}', \mathbf{z}''$ .
Eq.4: Can you please denote the suffix of the $\mathbb{E}_{??}\left[ \cdot \right]$ properly ? It is hard to read otherwise.

Results/Experiments

Result section is okay-ish. The following are some comments/questions.

Table.1: Need clarification: ‘Base (A)’ is a 2-RF, and the other ones are written as “(A) + <something>” — does that mean they are 3-RF (meaning 2 reflows) ?
Section 5.2 seems totally unnecessary. That has nothing to do with the core contributions of the paper.
Section 5.3 is also very much unnecessary. I even doubt its correctness. You seem to be proposing a new sampler/solver with a very weak (intuitive) motivation. Designing a solver require a lot more than that. And then “.. detailed analysis is provided in appendix E” — appendix E barely has any details ! Also, obvious question, why are FIDs going up (fig.4) with higher NFE ? Does that even make sense ?
Fig.5(b): The inverted noise norm distribution still looks quite different (higher variance) from the true noise. Just having the norm closer to the truth isn’t necessarily making a good case for your method.

问题

Authors said “any other premetric .. deviates from the posterior expectation” — but is it really true ? Can you mathematically show why ?
Is it possible to have trajectory crossings at all when samples are from $p_{xz}^2$ ? If no, can you prove it formally ?
L99: “.. use a specific non-linear interpolation” — what is that exactly ? I thought the interpolation is still linear.
Eq.4: How did you decompose the loss — can you show the steps ? And what is $\bar{\mathcal{L}}$ ?
L65: “training loss of 2-RF has zero lower bound” — please clarify: Isn’t it true that any L2 loss has zero lower bound ? How does it matter whether trajectories cross or not ? Even the FM loss with independent coupling $p_{\mathbf{xz}}^1$ has zero lower bound — no ?
I think it is unclear which quantity the authors are arguing to be zero when trajectories do not cross. The term ‘curvature’ was used some times (L124) — what does that mean ? Can you write this object in mathematical terms ? Just curious, what is the equation that needs to be proved if one wants to show that no trajectories from 1-RF coupling cross each other (ideal case) ? (Related to Q2 above).

局限性

Some limitation are mentioned, which are reasonable.

评论- References

2024-08-06

[1] Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

[2] Elucidating the Design Space of Diffusion-Based Generative Models

[3] Minimizing Trajectory Curvature of ODE-based Generative Models

[4] Scaling rectified flow transformers for high-resolution image synthesis

[5] Consistency Models

作者回复

2024-08-06

We appreciate the reviewer's valuable comments.

Why does the training loss of rectified flows (and FM with independent coupling) have a nonzero lower bound?

When two interpolation trajectories cross, given the intersection, there are two possible directions to go. As the neural net we use is deterministic, if there are two possible regression targets, the loss cannot be zero.

How did you decompose the loss? What is $\bar {\mathcal L}$ ?

As we described in the general response above, the L2 squared loss is lower-bounded by $\frac{1}{t^2} \mathbb E [|| \mathbf x - \mathbb E[\mathbf x | \mathbf x_t]||_2^2]$ . $\bar {\mathcal L}$ is simply defined as the difference between $\mathcal L$ and its theoretical lower bound (i.e., $\bar {\mathcal L}:= \mathcal L - \frac{1}{t^2} \mathbb E [|| \mathbf x - \mathbb E[\mathbf x | \mathbf x_t]||_2^2]$ ), which is the only term we can reduce. The decomposition in Eq. 4 holds by definition.

“any other premetric .. deviates from the posterior expectation” — Why?

See general response.

Proposed techniques are ad-hoc, exist in the literature. “Focusing on the tasks where training loss is high” is a known technique.

“Putting emphasis on tasks with high training loss” does not always help; knowing how to apply it requires care.

E.g., one cannot naively apply this technique to 1-rectified flow training. Figure 1 in the attached PDF shows that the training loss of 1-rectified flow is also U-shaped as 2-rectified flow loss. However, the current state-of-the-art 1-rectified flow uses logit-normal distribution [4], which has the opposite shape.

On the other hand, for 2-rectified flow, we can design $p_t$ based on the loss shape as Sec. 3 tells us that the lower bound of the loss is nearly zero.

Sec. 3: If you are providing a guarantee, you must provide better formal proofs.

We are not making a formal statement—we do not expect our intuition to hold for every dataset/model. We found empirically that for many natural image datasets, our intuition appears to hold, allowing us to design significant training improvements.

Making section 3 formal poses several challenges. For example, we reason that if we add the difference of two synthetic data points to a noise, it is not a common noise realization that is used in the training. To make this formal, we would need to add restrictive assumptions to the data distribution, model, and training algorithm, as we are dealing with synthetic data. We also assumed that the model would not generate high-quality data if the input is very different from the noise it has seen during training, which is true in practical cases but also requires additional assumptions. As such, we ultimately decided to leave the intuition at an empirical level; we felt that adding a stylized theoretical result would not significantly strengthen the main results.

Q3: use a specific non-linear interpolation” — Isn’t the interpolation linear?

The variance-preserving diffusion model is a special case of rectified flows with the interpolation defined as $\mathbf x_t = \alpha(t) \mathbf x + \sqrt{1-\alpha(t)} \mathbf z$ , where $\alpha(t) = \exp(-\frac{1}{2} \int_0^t (19.9s + 0.1)ds)$ (See Liu et al. [1]).

Q6:

What does “curvature” mean ?

When there is no intersection, the resulting (optimal) rectified flow's ODE trajectories are completely straight, or in other words, have zero curvature (Theorem 3.6 in Liu et al. [1]). Curvature [2,3] represents the degree to which the trajectory deviates from a straight path–one widely used definition is $\int_0^1 \|\mathbf{z}_1-\mathbf{z}_0-\frac{\partial}{\partial t} \mathbf{z}_t\|_2^2 dt$ .

What equation needs to be proved to show that no trajectories from 1-RF coupling cross each other?

To show that there is no intersection, one needs to show that $\mathbb E[\mathbf x | \mathbf x_t = (1-t)\mathbf x' + t\mathbf z'] = \mathbf x'$ for all pairs of $\mathbf x'$ and $\mathbf z'$ and for all $t$ .

Table.1:“(A) + something” — does that mean they are 3-RF (meaning 2 reflows) ?

All of our models are 2-rectified flows. “(A) + something” means we add the technique something to the base model configuration (A). We will clarify.

Sections 5.2 and 5.3 seem unnecessary.

We aim to draw attention to the potential of rectified flows. The computational cost to generate synthetic pairs is sometimes cited [5] as a downside, and we provide Section 5.2 as a counterargument. We introduced the new sampler in Section 5.3 to show that 2-rectified flow++ can be further improved by designing a better sampler–another advantage over other methods.

I doubt correctness of 5.3. Designing a solver require a lot more.

We did not mean to imply that our proposed update rule is itself an ODE solver, and will clarify. We use terms like “sampler” or “update rule” that are applied to existing solvers like Euler and Heun, instead of the term “solver” to refer to the new sampling algorithm. We do not even believe it converges to the true ODE solution in the limit as its behavior is very different from Euler solver’s in a large NFE setting.

Our claim is that the new update rule achieves better FID with a smaller NFE (Fig. 4).

Why are FIDs going up (fig.4) with higher NFE ?

The new sampler's FID goes up after some NFEs because it is not a proper solver, but its performance at low NFEs is better than the original sampler's best FID with many more NFEs (Figure 4).

Fig.5(b): The inverted noise norm distribution still looks quite different from the true noise... Not necessarily a good case for your method.

The inverted noise norm distribution is not the same as the true noise distribution but closer to it than the baseline. Also, the norm distribution is meant to complement the visual comparison in Figure 5(c), where we see that the inverted noises of the baseline exhibit a strong spatial correlation.

Fig. 2 caption: where is the factor $\frac{1-t}{t}$ ? Did you assume $t=0.5$ ?

Yes.

评论- Response to author's rebuttal

2024-08-10

I thank the authors for providing a response to my queries.

Overall, I would say, the rebuttal is, in most parts, quite well written and convincing. Majority of my technical doubt were clarified, and the authors did so quite well. However, I would say the following points:

One of my core concerns were the link between the observations and proposed solution. The "using pre-metric" part was well explained in the rebuttal and I think I now vaguely understand what they meant. The other one, the "timestep distribution" part, I still think is weakly connected.
The other concern was about how "formal" the idea is. The authors admitted that it is more intuitive than formal, which isn't quite ideal. However, I do recognize that not everything can be easily formalized. But then you run into the issue of the observation being not true in every possible scenario. I guess that's a fair trade-off.
I would still advise against the new "update rule". I think the paper is better off without section 5.3. Showing a new update rule being better only for a one region of the NFE space does NOT make it correct. Prefer correctness over "content" in the paper. Do more study on that and consider submitting separately in future. That's my opinion.

Anyway, I think overall the authors made their case well. I am increasing the score. I hope these discussions will reflect in the final version of the paper in some way.

Thank you.

评论- Thank you for the comments!

2024-08-13

Thank you very much for your thoughtful review and engagement with us during the discussion phase. We really appreciate your time and suggestions and will take them into account when revising our paper.

We understand your concern about Section 5.3 and will remove it from the main body of our paper and conduct a more thorough investigation separately.

Regarding the timestep distribution, the motivation is that we shouldn’t focus naively on high training loss tasks, but rather consider how far the loss is from its optima (i.e., is there room for improvement or is the loss already very well optimized?). We will attempt to explain more clearly how this connects to Section 3. Say you have loss_1 and loss_2 whose values are 10 and 1 each. Assume loss_1 and loss_2 are lower-bounded by 9.99 and 0. Our claim is that we need to focus on loss_2 because 10-9.99 < 1-0. But from the observations in Section 3, we know that both losses are lower-bounded by 0, so we can decide which one to focus on simply by looking at their values. That is what Section 4.1 argues.

审稿意见

评分: 4置信度: 42024-07-02

This paper mainly improves the training of rectified flows empirically, making it comparable to the distillation method in terms of performance with fewer steps.

优点

This article provides a comprehensive analysis of a single return, and there is a clear motivation for improvement.
The authors mainly improved the training of the whole network from three aspects: time-step sampling, loss function, and initialization from the diffusion model.
The improved rectified flows method can be compared to the sota distillation method, and it also supports operations such as inversion, which may be important in applications such as image editing or translation.

缺点

The main weakness of this paper is that all the improvements are incremental and empirical. In addition, many of the improved techniques overlap with existing diffusion improvements, such as Pseudo-Huber loss [1] and Initialization with pre-trained diffusion models [2].
The comparison in Section 5.2 may be somewhat unfair. Although the reflow requires a forward pass for each iteration, the reflow needs to run ODE in advance to generate noise-sample pairs, while the CD does not need to simulate, so the number of forward passes should also be considered to generate noise-sample pairs.

[1] Yang Song and Prafulla Dhariwal. Improved techniques for training consistency models. [2] Liu X, Zhang X, Ma J, et al. Instaflow: One step is enough for high-quality diffusion-based text-to-image generation.

问题

See Weaknesses.

局限性

The authors briefly discuss their limitations.

作者回复

2024-08-06

We appreciate the reviewer's valuable comments.

The main weakness of this paper is that all the improvements are incremental and empirical.

We respectfully disagree with the reviewer's comment. Our proposed techniques improve the FID of 2-rectified flow from 12.21 to 3.38 on CIFAR-10, 12.39 to 4.11 on AFHQ64, 8.84 to 5.21 on FFHQ64 compared to baseline, which is difficult to call incremental in any sense. Regarding our improvements being empirical, it is true that the paper is empirical at its core, motivated by the insights we provide in Sections 3 and 4. However, we disagree that this is a reason to reject. Indeed, many highly impactful papers in the area of generative modeling are empirical [3,4,5,6]. While we agree that having rigorous theoretical backing is more beneficial than not, we don’t believe it is always necessary or possible for all practical scenarios, especially those in deep learning.

In addition, many of the improved techniques overlap with existing diffusion improvements, such as Pseudo-Huber loss and Initialization with pre-trained diffusion models

Pseudo-Huber loss

The fact that Pseudo-Huber loss works well for consistency models does not necessarily mean that it should work well for rectified flows because of their fundamental differences in formulation. Consistency models' loss functions can be generalized to any metric function while rectified flow requires l2 squared loss (see the general response above). Otherwise, the obtained solution is not a posterior expectation and thus violates the marginal-preserving property of Liu et al. [1] (Theorem 3.3). Without our finding in Section 3, this generalization is not justified. Indeed, the 1-rectified flow’s FID gets worse if we use other losses, as we demonstrated in the supplementary results in the general response.

Pre-trained diffusion models

It is true that the Instaflow paper used the pre-trained diffusion models for 1) paired data generation and 2) initialization, but they did not provide any theoretical justification for why we can do so. This raises the question of whether doing so yields the same results as standard 2-rectified flow training since diffusion models are trained with a curved, nonlinear interpolation (see Figure 5 in Liu et al. [1]). An illustrative example is when we use the pre-trained variance-exploding EDMs. In this case, it doesn’t make sense to directly use them as they are trained in t \in [0, +\infty) while rectified flows operate on [0, 1] range, so we can intuitively see that some sort of conversion is needed.

In fact, it turns out that one needs to convert the scale and time following Proposition 1 in our paper to make them compatible; the proof is provided in Appendix D. Proposition 1 makes sure that 1) the coupling generated from a pre-trained diffusion model is the same as 1-rectified flow coupling required by Reflow algorithm (Algorithm 1), and 2) 2-rectified flow is properly initialized by 1-rectified flow model as suggested by Liu et al. [1]. Table 1 in the attached pdf file shows that naively initializing with EDM without the proposed conversion leads to slower convergence.

This result is quite surprising, as it suggests that 1-rectified flow (with linear interpolation) and diffusion models (with nonlinear interpolation) not only belong to the same model class but are actually equivalent (i.e., they are interchangeable after training by simple time and scale conversion). We believe that this is not widely known in the literature, as significant community effort has been invested in developing new training techniques for 1-rectified flows/flow matching models, with the hope that the linear interpolation “connects data and noise on a straight line” and thus “has better theoretical properties” [2].

The comparison in Section 5.2 may be somewhat unfair. The number of forward passes should also be considered to generate noise-sample pairs.

It already does so. The caption of Table 5 says, "Reflow uses 395M forward passes for generating pairs and 1, 433.6M for training.". They sum up to 1,828.6M.

[1] Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

[2] Scaling rectified flow transformers for high-resolution image synthesis

[3] Diffusion Models Beat GANs on Image Synthesis

[4] Unsupervised Representation Learning with Deep Convolutional Generative Adversarial Networks

[5] Analyzing and Improving the Image Quality of StyleGAN

[6] Improved Techniques for Training GANs

2024-08-14

We are sorry to bother you, but we wanted to check if you had further suggestions as the discussion period is ending today.

审稿意见

评分: 6置信度: 42024-07-11

The method introduces an one-stage training of rectified flows, mitigating the costly process of multi-iteration training of the former model. Particularly, the authors propose a U-shaped timestep distribution for sampling and modified LPIPS-Huber loss. The method demonstrates superior FID scores for 1-NFE setting on CIFAR-10 and ImageNet $64\times 64$ , outperforming other distillation methods like consistency distillation.

优点

Provide an analysis of reflow algorithm and shows that one-stage reflow are sufficient for nearly straight solution trajectory
Improve training of rectified flow via introducing U-shaped time distribution and new loss function, LPIPS-Huber.

缺点

Have the authors tested with more than two-NFE like 100 to directly compare with 2-rectified flow? Does model still get consistent improvement?
What is the advantage of U-shape time distribution over lognormal distribution? As it is constrained to already hold a straight solution paths like in 2-rectified flow which cannot directly apply to 1-rectified flow. The intuition of this part does not clearly present.
Have the authors tested LPIPS-Huber-1/t (config F) for 1-rectified flow?
New sampler seems to have counter effect when many-NFE (>6) is used. Could the authors explain this behaviour?

Overall, the novelty seems to be limited as the method mainly involves engineering works in using U-shape distribution and LPIPS-Huber loss (a combination of huber loss and LPIPS loss) to improve model performance under few-NFE settings. Besides, though the performance shows a notable gain compared to RectifiedFlow and ConsistencyModels, it still falls behinds ImprovedConsistencyModels and CTM. However, I appreciate the findings in paper and these proposed techniques can be useful for the community.

问题

局限性

This is included in conclusion.

作者回复

2024-08-06

We appreciate the reviewer's valuable comments.

Have the authors tested with more than two-NFE like 100 to directly compare with 2-rectified flow? Does model still get consistent improvement?

We tested up to 16 NFEs and observed consistent improvement (see Figure 4). Compared to 2-rectified flow, 2-rectified flow++ with 2 NFEs outperforms 2-rectified flow with 110 NFEs by a large margin (2.76 vs 3.36, Table 2).

What is the advantage of U-shape time distribution over lognormal distribution? The intuition of this part does not clearly present.

Lognormal emphasizes the middle, while U-shaped distribution emphasizes both ends of the interval. The intuition is that when training 1-rectified flow (or diffusion model), there is little to learn on both ends because when t=1, a model just learns the dataset average, and when t=0, it just predicts noise average (i.e., 0). In contrast, in 2-rectified flow, a model should learn to generate a very sharp sample at t=1, and a very accurate noise at t=0. Therefore, the most difficult parts are the both ends.

Have the authors tested LPIPS-Huber-1/t (config F) for 1-rectified flow?

We emphasize that our scope is to improve the performance of the 2-rectified flow. 1-rectified flow training is not intersection-free, so using loss functions other than the squared L2 violates the theory. Nonetheless, we tested LPIPS-Huber-1/t loss for 1-rectified flow and provide the result in the general response. We see that the result is pretty bad. These results confirm our unique insights in 2-rectified flow presented in Section 3.

New sampler seems to have counter effect when many-NFE (>6) is used. Could the authors explain this behaviour?

The goal here is to obtain as good FID as possible using as few NFE as possible. The new sampler's FID goes up after some NFE but its performance with a low NFE is better than the original sampler's best FID with many more NFEs (see Figure 4). We introduced the new sampler as a prototypical example to show that our 2-rectified flow++ can be further improved by designing a better sampler, which is another advantage over other distillation methods. It is certainly possible to design a better sampler that does not have this increasing FID behavior, and we leave this to future work.

2-rectified flow++ falls behinds CTM and iCT.

Table 2 shows that our method outperforms CTM without GAN loss on CIFAR-10 by a large margin. 2-rectified flow++ can also be trained with GAN loss and may obtain a similar gain. Combining two different generative models is an interesting but orthogonal direction to our paper’s scope.

It is true that our 2-rectified flow++ falls behind iCT (FIDs 4.32 vs 4.01 on ImageNet64), but we want to emphasize that 2-rectified flow++ does not require real data during training while iCT does. These days, many foundation models such as SDXL [2] only disclose model weights but not training data. iCT cannot be trained in those cases.

[1] Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

[2] SDXL: Improving Latent Diffusion Models for High-Resolution Image Synthesis

评论- Gap between real data and synthetic data

2024-08-08

Thank you to the authors for the detailed answers. Another concern of mine is the mismatch between real data distribution and synthetic data distribution. Since the method is solely trained on synthetic data, generated by the pretrained model (similar to RectifiedFlow), does the model suffer from this issue? Additionally, the performance of the method still lags behind iCT. I believe that incorporating real data could address this problem and lead to a better FID score. If possible, I would appreciate it if the authors could conduct experiments on a small dataset like CIFAR-10 to test this hypothesis.

评论- Thank you for the suggestions!

2024-08-13

We appreciate the reviewer’s great suggestion, thank you! We ran an experiment on CIFAR-10, and this improved FID on CIFAR-10 from 3.38 to 3.07 (a 0.31 improvement).

Specifically, we integrated the generative ODE of 1-rectified flow (EDM init) backward from t=0 to t=1 using NFE=128 to collect 50000 pairs of (real_data, synthetic_noise) on CIFAR-10. For a quick validation, we took the pre-trained 2-rectified flow++ model and fine-tuned it using the (real_data, synthetic_noise) pairs for 5000 iterations with a learning rate of 1e-5. In this fine-tuning setting, we tried using (synthetic_data, real_noise) pair with a probability of p, but we found that not incorporating (synthetic_data, real_noise) pairs at all (i.e., p=0) performs the best.

Given that only 5000 iterations of fine-tuning improved FID by a noticeable margin, and considering the FID gap between iCT and 2-rectified flow++ on ImageNet is only 0.29, we believe incorporating real data from the beginning of the training will substantially narrow, or even surpass this performance gap on ImageNet. We are currently running this experiment, but it will not be done by the end of the discussion phase.

That being said, your idea introduces many open questions to be explored, e.g.:

How do we generate (real_data, synthetic_noise)? The number of pairs cannot exceed the number of training data samples if we naively generate synthetic noise from real data. For example, we can add a small amount of noise to data, solve the ODE, and construct the pair using the obtained noises? That way, we can get multiple synthetic noises for each data.
How do we mix the (real_data, synthetic_noise) and (synthetic_data, real_noise) pairs?
How do we avoid storing synthetic_noises which are hard to compress losslessly?

Since this direction doesn't relate to our intuition from Section 3, and since there are many ways to implement your idea, we feel that thoroughly fleshing out this technique belongs in its own paper. But we intend to add a subsection in the evaluation about this technique to highlight it as a possible avenue for further improvements over our proposed techniques.

We thank the reviewer again for their valuable comments, and also for engaging with us during the discussion phase. We appreciate your time and input!

2024-08-14

We are sorry to bother you, but we wanted to check if you had further suggestions as the discussion period is ending today. Thank you again for your valuable feedback, we legitimately found it very useful.

2024-08-14

I appreciate the authors’ effort in conducting this experiment! It is interesting to see the combination of real and synthetic data could further minimize the learning gap between them and thus improve the performance. As authors mentioned, this has led to many potential rooms for exploration. I am looking forward to see this in your future work. So far, I am satisfied with the rebuttal and there are no major concerns left in other reviews, I raise my score to reflect this. Besides, there are some good resources for reference:

[1] Fan, Lijie, et al. "Scaling laws of synthetic images for model training... for now." In CVPR. 2024.

[2] Singh, Krishnakant, et al. "Is Synthetic Data all We Need? Benchmarking the Robustness of Models Trained with Synthetic Images." CVPR 2024 Workshop. 2024.

作者回复

2024-08-06

General response

We thank the reviewers for their valuable comments. Here, we provide an additional background to help clarify some of the points raised in the reviews. We also have fixed some typos and clarified notations. The supplementary PDF file is attached.

Question: Clarify notation

In our paper, we use $\mathbf x, \mathbf z$ as the random variables (data and noise, respectively), $\mathbf x_t$ as a function of them, and $\mathbf x',\mathbf x'’, \mathbf z',\mathbf z'’$ as the specific values of them. $\mathbf z_t$ is a dynamical system governed by the rectified flow ODE. $\mathbf z_0 \sim p_{\mathbf x}$ means that the initial value of $\mathbf z_t$ at t=0 is sampled from $p_{\mathbf x}$ .

We will make the notations clear in the revised version.

Question: What does your intuitive argument in Section 3 have to do with the proposed loss functions? Alternative losses are widely used in many generative models.

We would like to summarize the logic below.

Minimizer of L2 squared loss

For a random variable $\mathbf x$ , $\arg \min_{\hat {\boldsymbol \mu}} \mathbb E_{\mathbf x}[||\mathbf x - \hat {\boldsymbol \mu}||^2]$ = $\mathbb E_{\mathbf x}[\mathbf x]$ , as $\nabla_{\hat {\boldsymbol \mu}} \mathbb E_{\mathbf x}[||\mathbf x - \hat {\boldsymbol \mu}||^2] = 2(\hat {\boldsymbol \mu} - \mathbb E_{\mathbf x}[\mathbf x]) = 0$ at $\hat {\boldsymbol \mu} = \mathbb E_{\mathbf x}[\mathbf x]$ . This minimum mean squared error (MMSE) estimator minimizes the l2 squared loss, but it is generally not a minimizer of other loss functions. For example, the minimizer of the l1 loss is the median. See e.g., [2], p. 176.

Lower bound. We can see that the l2 squared loss cannot be zero and is lower bounded by $\mathbb E[||\mathbf x - \mathbb E_{\mathbf x}[\mathbf x]||^2]$ unless $\mathbf x$ is not random.

Posterior expectation and marginal-preserving property

The goal of the rectified flow training is to obtain the vector field $\mathbb E[\mathbf z - \mathbf x |\mathbf x_t = \mathbf z_t]$ assuming the linear interpolation is used, because this vector field is shown to generate the same marginal distribution as the interpolation (Liu et al. [1], Theorem 3.3). This is called the “marginal preserving property”, and it is central to the efficacy of rectified flows. This is why Liu et al. [1] used the L2 squared loss for training. If we use a different loss function, the obtained solution is not a posterior expectation and the marginal-preserving property (Theorem 3.3) is inapplicable.

Our finding

Since we found that the intersection is generally not a problem in 2-rectified flow training (Section 3), our regression target is no longer a random variable. Now $\mathbb E[\mathbf x | \mathbf x_t = (1-t)\mathbf x' + t \mathbf z'] = \mathbf x'$ , and the loss can actually become zero:

$\mathbb E_{\mathbf x, \mathbf z, t}[\frac{1}{t^2} || \mathbf x - \mathbf x_\theta((1-t)\mathbf x + t\mathbf z, t)||_2^2] \geq \mathbb E || \mathbf x - \mathbb E[\mathbf x | \mathbf x_t]||_2^2] = \mathbb E [\frac{1}{t^2} || \mathbf x - \mathbf x||_2^2] = 0$

, where the optimum is achieved at $\mathbf x_\theta((1-t)\mathbf x' + t \mathbf z', t) = \mathbf x'$ . Since this is also the unique optimum of any premetric $m$ such that $m(\mathbf{a}, \mathbf{b})=0 \iff \mathbf{a}=\mathbf{b}$ , we can generalize the rectified flow loss to any premetric.

What if we use different losses for 1-rectified flow training?

Upon the reviewers’ questions, we provide additional experiment results where we compare the FID of 1-rectified flow models trained with different losses.

Method	FID
L2 squared	2.66
LPIPS	47.24
Pseudo-Huber	2.92

Table: FID of 1-rectified flow trained with different losses on CIFAR-10. During sampling, the RK45 solver is used following Liu et al. [1]

This result is expected–Unlike 2-rectified flow, 1-rectified flow training is not intersection-free, so only the L2 squared loss is valid.

Typos

Section 3: $p^1(\mathbf x) = \int p^1(\mathbf x, \mathbf z) d\mathbf z$ -> $p^2(\mathbf x) = \int p^2(\mathbf x, \mathbf z) d\mathbf z$

Figure 2(a) (top left): $\frac{t}{1-t}\mathbf x' - \mathbf x''$ -> $\frac{1-t}{t}(\mathbf x' - \mathbf x'')$

Section 4: log-normal -> logit-normal

[1] Flow Straight and Fast: Learning to Generate and Transfer Data with Rectified Flow

[2] https://probml.github.io/pml-book/book1.html

最终决定Accept (poster)

2024-09-25

The paper presents an improved training approach for rectified flows, demonstrating competitive performance with state-of-the-art distillation methods, particularly in low NFE settings. Overall, the reviewers recognized the paper’s empirical contributions as significant and practically relevant. Reviewer Wooo commended the results, noting that “Reflow seems more efficient than distillation and is especially much more clean and natural as a solution.” Reviewer nW7F observed that improvements are consistent even when scaling to higher NFEs, emphasizing that “the proposed techniques can be useful for the community.” While some concerns were expressed about the incremental nature of the work and overlap with existing methods—such as those highlighted by Reviewer eZ8c, who pointed out that “many of the improved techniques overlap with existing diffusion improvements”—the consensus is that the paper’s contributions substantially advance the training of rectified flows by reducing complexity and enhancing efficiency. The presentation and clarity of the paper were also praised, with Reviewer Wooo describing the empirical studies as “thorough” and the presentation as “excellent.” Given the paper’s strong empirical performance, practical impact, and clear exposition, I recommend its acceptance.

Improving the Training of Rectified Flows

摘要

评审与讨论

优点

缺点

问题

局限性

优点

缺点

Major concerns

Presentations

Results/Experiments

问题

局限性

Why does the training loss of rectified flows (and FM with independent coupling) have a nonzero lower bound?

How did you decompose the loss? What is Lˉ\bar {\mathcal L}Lˉ?

“any other premetric .. deviates from the posterior expectation” — Why?

Proposed techniques are ad-hoc, exist in the literature. “Focusing on the tasks where training loss is high” is a known technique.

Sec. 3: If you are providing a guarantee, you must provide better formal proofs.

Q3: use a specific non-linear interpolation” — Isn’t the interpolation linear?

Q6:

Table.1:“(A) + something” — does that mean they are 3-RF (meaning 2 reflows) ?

Sections 5.2 and 5.3 seem unnecessary.

I doubt correctness of 5.3. Designing a solver require a lot more.

Why are FIDs going up (fig.4) with higher NFE ?

Fig.5(b): The inverted noise norm distribution still looks quite different from the true noise... Not necessarily a good case for your method.

Fig. 2 caption: where is the factor 1−tt\frac{1-t}{t}t1−t​ ? Did you assume t=0.5t=0.5t=0.5?

优点

缺点

问题

局限性

The main weakness of this paper is that all the improvements are incremental and empirical.

In addition, many of the improved techniques overlap with existing diffusion improvements, such as Pseudo-Huber loss and Initialization with pre-trained diffusion models

The comparison in Section 5.2 may be somewhat unfair. The number of forward passes should also be considered to generate noise-sample pairs.

优点

缺点

问题

局限性

Have the authors tested with more than two-NFE like 100 to directly compare with 2-rectified flow? Does model still get consistent improvement?

What is the advantage of U-shape time distribution over lognormal distribution? The intuition of this part does not clearly present.

Have the authors tested LPIPS-Huber-1/t (config F) for 1-rectified flow?

New sampler seems to have counter effect when many-NFE (>6) is used. Could the authors explain this behaviour?

2-rectified flow++ falls behinds CTM and iCT.

General response

How did you decompose the loss? What is $\bar {\mathcal L}$ ?

Fig. 2 caption: where is the factor $\frac{1-t}{t}$ ? Did you assume $t=0.5$ ?