4.6

/10

Rejected3 位审稿人

最低2最高4标准差0.8

4.0

置信度

创新性3.0

质量2.3

清晰度3.3

重要性2.3

NeurIPS 2025

One Step Diffusion via Flow Fitting

Hamadi Chihaoui,Paolo Favaro

OpenReview PDF

提交: 2025-05-12更新: 2025-10-29

摘要

Diffusion and flow-matching models have demonstrated impressive performance in generating diverse, high-fidelity images by learning transformations from noise to data. However, their reliance on multi-step sampling requires repeated neural network evaluations, leading to high computational cost. We propose FlowFit, a family of generative models that enables high-quality sample generation through both single-phase training and single-step inference. FlowFit learns to approximate the continuous flow trajectory between latent noise $x_0$ and data $x_1$ by fitting a basis of functions parameterized over time $t \in [0, 1]$ during training. At inference time, sampling is performed by simply evaluating the flow only at the terminal time $t = 1$, avoiding iterative denoising or numerical integration. Empirically, FlowFit outperforms prior diffusion-based single-phase training methods achieving superior sample quality.

关键词

Efficient generative modelsSingle step diffusion

评审与讨论

审稿意见

评分: 2置信度: 52025-06-24

The paper proposes to parametrizes flow trajectories by linear combination of basis functions where basis functions take analytical form and coefficients of the basis functions are learned by neural networks. The parameterized flow map is decoupled in that the basis function depends on t and the neural network depends on the initial condition $x_0$ . The time derivative can then be done analytically without automatic differentiation. A joint training strategy is proposed, where a normal flow matching network is trained and a second, FlowFit network is trained together which distill the flow matching network in an online fashion.

优缺点分析

Strength:

The idea of decoupling time and space where one can learn space via neural network is neat, and one does not need automatic differentiation to calculate flow map derivative.
The proposed method is easy to understand.
One-step results on CelebAHQ outperforms prior end-to-end methods.

Weaknesses:

While I recognize that the novelty lies in decomposing the flow map into independent space and time components, the proposed distillation method is the same as Physics Informed Distillation [1], which leverages principles from Physics Informed Neural Networks. However, this paper fails to discuss, compare, or even cite this prior work. A proper discussion and comparison is needed.
The empirical investigation is limited, as the paper only compares results on CelebAHQ. How about CIFAR10 and ImageNet? How does this method compare with recent prior methods such as [2].
The sinusoidal embedding used in normal flow matching networks uses cosine and sine basis for their time embedding layers, but they additionally embed these basis functions into neural networks to obtain more complex functions of $x$ and $t$ . It does seem that multiplying the basis funtions direct with neural networks severely limit their representational capacity. Or at least this seems to require significantly more complex neural networks to fit complex joint (x,t) functions. There may not even exist an arbtrarily complex neural network to represent the joint (x,t) function fully. The authors do not seem to address this theoretical limitations in their discussion.

[1] Tee, Joshua Tian Jin, Kang Zhang, Hee Suk Yoon, Dhananjaya Nagaraja Gowda, Chanwoo Kim, and Chang D. Yoo. "Physics informed distillation for diffusion models." arXiv preprint arXiv:2411.08378 (2024).

[2] Zhou, Linqi, Stefano Ermon, and Jiaming Song. "Inductive moment matching." arXiv preprint arXiv:2503.07565 (2025).

问题

I'd like the authors to address my questions above. A proper discussion on prior methods and more empirical evaluations are needed. I believe the paper, although exhibiting novelty in its idea, is not ready for publication yet.

局限性

Some but not all technical limitations are addressed. No societal impact discussion is needed.

最终评判理由

I keep my score because the authors have not sufficiently addressed my concerns.

格式问题

N/A

作者回复

2025-07-29

Thank you for your feedback. Please see the detailed response below.

Q1: Comparison with Physics Informed Distillation [1]

First, we appreciate that the reviewer recognizes the novelty of our method in decoupling time and space for modeling the flow map. Here, we discuss [1] while emphasizing that FlowFit remains clearly distinguishable from [1].

The first clear difference is that, as its name suggests, [1] is a distillation-based method that follows a two-stage training procedure and relies on a pre-trained diffusion model. In contrast, our method is a single-stage training approach. We note that training the velocity filed does ( $v$ ) not depend on training flow field $\psi$ and so the training could be done in parallel and thus the effective training time corresponds to that of training a single model. FlowFit can be viewed as performing self-distillation during training, removing the need for a separate distillation step. We believe that a direct comparison with [1] is therefore not apple-to-apple.

Now, beyond this key difference, although both FlowFit and [1] aim to model the flow map, the approaches differ significantly in their formulations. In [1], a single network, namely, a Physics-Informed Neural Network (PINN), denoted $x_{\theta, t}$ , is used to jointly model space and time. In contrast, FlowFit adopts a basis decomposition where $\theta$ and $t$ are decoupled. This design choice has a direct implication on how the flow derivative is computed. In FlowFit, we can obtain the exact derivative of the flow map analytically and at no additional cost. On the other hand, [1] relies on a first-order numerical approximation of the flow derivative, which may introduce approximation error and additional computational cost.

Thus, we fairly believe that FlowFit is fairly distinguishable from [1]. We will include a summary of this comparison with [1] in our related work section.

Q2: More empirical evaluations

We evaluate FlowFit on two datasets: unconditional CelebA-HQ and class-conditional ImageNet, both of which are commonly used in recent works, such as [3], which uses the same benchmarks as ours. Unfortunately, due to time constraints, we were not able to include the ImageNet results in the main submission. However, these results are available in the supplementary material, and we plan to move them into the main paper in the revised version. We also note that the recent compared method [3] performs empirical evaluation on two datasets.

[3] One Step Diffusion via Shortcut Models. Kevin et al. ICLR 2025.

Q3: Comparison to [2]

Thanks for pointing out this interesting work. However, we would like to clarify that [2] is a concurrent work to ours. It first appeared on arXiv on March 10, 2025. As per the NeurIPS 2025 FAQ:

“What is the policy on comparisons to recent work?” papers that appear online after March 1, 2025 are generally considered concurrent to NeurIPS submissions. Authors are not expected to compare to those.”*

However, we appreciate the relevance of [2] and will include a brief comparison in our revised version. We also believe that [2] is clearly distinguishable from FlowFit, as it does not explore fitting the flow using a basis of functions at a conceptual level.

[2] Zhou, Linqi, Stefano Ermon, and Jiaming Song. "Inductive moment matching." arXiv preprint arXiv:2503.07565 (2025).

Q2: The sinusoidal embedding/discussion of the theoretical existence of a solution

Concerning the use of different basis functions, we agree that this is indeed an interesting direction to explore. In our work, we investigate specific choices of basis functions and show empirically that our approach achieves state-of-the-art performance. However, we do not claim that the basis we use is optimal. In particular, the trigonometric basis could benefit from alternative configurations, and exploring different basis functions or parameterizations is, in our view, a promising direction for future research.

Regarding the theoretical existence of a basis capable of fitting the flow, we would like to point out that in Lines 142–144, we explicitly raise the question of whether the trajectory $\psi(x, t)$ can be effectively approximated using a basis of functions. To support this, we include in the supplementary material the general Stone–Weierstrass theorem, which states that, on a compact domain, there exists a set of continuous functions capable of approximating any continuous function. In particular, Weierstrass approximation theorem, which ensures that any continuous function defined on a compact space can be approximated arbitrarily well by polynomials. We believe this provides a theoretical foundation for the existence of such approximations.

2025-08-09

Thanks for the response. The author has addressed some of my concerns. However, I would like to keep my score because the exposition needs revision to put the work in a broader context of prior works such as PINN, and more experiments are needed to justify its empirical performance against other state-of-the-art methods.

2025-08-09

Thanks for your follow-up.

We are happy to have addressed some of the reviewer’s concerns and here provide our response to the remaining ones.

the exposition needs revision to put the work in a broader context of prior works such as PINN

We commit to including the comparison with [5] in our response to Q1 within the prior works of our revision. If the reviewer suggests other relevant works, we would be happy to include those as well.

more experiments are needed to justify its empirical performance against other state-of-the-art methods.

We have added comparisons with two recent strong training-based baselines, iCT [2] and sCT [3] as suggested by reviewer 6gVd. We include results (FID) on both unconditional CelebAHQ and conditional ImageNet, and will incorporate them in our revision. We believe this further justify our empirical performance.

Method	CelebA-HQ	ImageNet
CT [1]	33.3 (reported in [4])	69.7 (reported in [4])
iCT [2]	21.7	43.3
sCT [3]	19.3	41.6
Shortcut Models [4]	20.5	40.3
FlowFit (Ours)	14.1	34.4

If the reviewer suggests other baselines, we are happy to include them in our revision.

[1] Consistency Models. Song et al. NeurIPS 2023

[2] Improved techniques for training consistency models. Song et al. ICLR 2024

[3] Simplifying, stabilizing and scaling continuous-time consistency models. Cheng et al. ICLR 2025

[4] One Step Diffusion via Shortcut Models. Kevin et al. ICLR 2025.

[5] Tee, Joshua Tian Jin, Kang Zhang, Hee Suk Yoon, Dhananjaya Nagaraja Gowda, Chanwoo Kim, and Chang D. Yoo. "Physics informed distillation for diffusion models." arXiv preprint arXiv:2411.08378 (2024)

We are happy to address any further concern or question.

审稿意见

评分: 3置信度: 42025-06-29

This paper proposed FlowFit for one-step or few-step generation for flow-based generative models. Specifically, the authors proposed directly learning the time-dependent trajectory, parameterized by a set of basis functions. By enforcing the consistency loss between the derivative of the flow and the vector field, the proposed FlowFit can achieve easy one-step generation. Experiments on the CelebAHQ dataset demonstrated better one-step generation quality.

优缺点分析

Strenghth

To the best of my knowledge, the proposed approach of utilizing a consistency-based loss is novel in comparison to existing consistency models.
On CelebAHQ, the proposed approach outperformed existing methods in one-step generation.

Weakness

Although the idea is new, this draft is clearly incomplete and lacks many details in theoretical validity and experimental setups.
The second condition in Equation 3 (the initial condition) prevents the model from collapsing, as an arbitrary constant added to the learned model will also satisfy the derivative constraint. This was never explained in the paper.
The experiments' scopes are limited. For example, the current evaluation focused on end-to-end training, although it is clear that the proposed approach can also do distillation (two-phase) of a pre-trained flow-based model, as the consistency loss is essentially independent.
Newer consistency baselines are not compared, e.g., iCT/iCD [1], sCD/sCT [2], which exhibit significantly better performance than the Reflow and other distillation baselines in this paper.
There are noticeable formatting mistakes, e.g., the "time window $\alpha_t$ " that was never used in Algorithm 1.

[1] Song, Yang, and Prafulla Dhariwal. "Improved techniques for training consistency models." arXiv preprint arXiv:2310.14189 (2023).

[2] Lu, Cheng, and Yang Song. "Simplifying, stabilizing and scaling continuous-time consistency models." arXiv preprint arXiv:2410.11081 (2024).

问题

How to enforce the boundary condition in Equation 3 (Weakness 2)?
It is unclear why the authors chose to use a basis set instead of directly modeling the trajectory as one single network $\psi_\theta(x_0,t)$ .
In Equation 7, it is unclear why the stop-gradient operator is used. Is it necessary for the consistency loss, or is it purely for computational concerns?
As the proposed approach can be easily adapted for distillation, it would be more beneficial to compare such results (Weakness 3).
sCM (Weakness 4) provided a robust baseline. Additional results with sCM would better demonstrate the effectiveness of the proposed approach.
The baseline results are identical to those in the Shortcut model paper. If the authors used these results, they should properly credit the original paper. Similarly, Section 5.5 is almost identical to that in the Shortcut model.

局限性

The authors have discussed the potential limitations of this work.

最终评判理由

The authors have successfully clarified some questions in my rebuttal and provided additional results to better demonstrate the effectiveness of the proposed approach. However, the evaluation scope remains limited. For example, the authors did not include the distillation setup, as is the common practice in previous CM work. Also, the manuscript is incomplete, and I would suggest a further major revision before the paper can be accepted.

格式问题

The paper does not have major formatting concerns.

作者回复

2025-07-29

Thank you for your feedback. Please see the detailed response below.

Q1: How did we enforce the boundary condition in Equation (3)?

Thanks for pointing this out. Equation (3) states that $\psi(x_0, 0) = x_0$ meaning that $\psi(x_0, 0) - x_0 = 0$ . To enforce the this condition , we integrate it in our calculation of $\psi(x_0, t)$ . In our implementation, we use a residual formulation of $\psi(x_0, t)$ , where $\psi(x_0, t) =\psi(x_0, t) + (\psi(x_0, 0) - x_0)= x_0 + (\psi(x_0, t) - \psi(x_0, 0))$ .
During training, when computing $\psi(x_0, t)$ , we are implicitly enforcing $\psi(x_0, 0) = x_0$ . We found that this ensure the boundary constraint.

It is worth noting that in the case of a polynomial basis, this constraint is naturally satisfied by omitting the constant term from the basis. This justifies our choice of using the basis $\\{t, t^2, t^3, \ldots\\}$ , which ensures that the residual $(\psi(x_0, t) - \psi(x_0, 0))$ is exactly zero at $t = 0$ .

We will add those details in our revision.

Q2: Why did the authors choose to use a basis set instead of directly modeling the trajectory as a single network?

One of our main motivations for using a basis in $t$ (e.g., modeling $\psi_\theta = c_1(\theta) \cdot t + c_2(\theta) \cdot t^2 + c_3(\theta) \cdot t^3 + \ldots$) rather than a single network $\psi_\theta(x_0, t)$ is to decouple time and space in modeling the flow map. This allows us to obtain the exact derivative with respect to $t$ analytically (“for free”) and leads to improved computational efficiency. Here’s a polished and grammatically refined version of your paragraph, with clear math formatting: Our method relies on matching the velocity field to the derivative of the flow with respect to $t$. With a basis formulation, this derivative can be computed analytically, for example, $\frac{d\psi_\theta}{dt} = c_1(\theta) + 2 \cdot c_2(\theta) \cdot t + 3 \cdot c_3(\theta) \cdot t^2 + \ldots$, and at no additional computational cost. In contrast, using a single network $\psi_\theta(x_0, t)$ would require backpropagation through the network with respect to $t$ (not to mention also backpropagating through $\theta$), making training significantly more expensive.

Q4: Results of the "two-stage"/distillation version of FlowFit

Here we provide the two-stage version of our method, where the velocity field is first trained, its weights are then frozen, and finally our FlowFit model is trained on top of it. We use the same number of iterations.

Setting	CelebA-HQ	ImageNet
Train → Distill (Ours*)	11.3	29.8
Joint Training (Ours)	14.1	34.4
Shortcut Models [1]	20.5	40.3

While the distillation/two-stage setting is certainly an interesting setup, it is not the focus of our work. We chose not to include it in the main paper, as that would require comparing against a wide range of recent two-stage methods, which we did not aim to cover.

We would like to emphasize that our work is focused on improving single-stage training and closing the performance gap with respect to two-stage methods, which we think is an interesting direction of research.

[1] One Step Diffusion via Shortcut Models. Kevin et al. ICLR 2025.

Q5: Comparison with sCM

Thank you for pointing this out. We will include a comparison with sCM. We are currently working on obtaining the results and will share them here as soon as they are available.

Q6: Why is the stop-gradient operator used in Equation 7

Good point. The stop-gradient is not necessary for our loss. We experimented without it, and training still works. However, as the reviewer mentioned, removing the stop-gradient introduces additional computational overhead. We will include an ablation comparing versions with and without stop-gradient to highlight the differences in both performance and computational cost.

Q7: Details about theoretical validity/experimental setup

We would be happy to clarify any further details regarding the theoretical validity or experimental setup.

Q8: The baseline results are identical to those in the Shortcut model paper/section 5.5

Between line 182 and 183 we credit the Shortcut model paper mentioning that we used their baseline results.

“We benchmark our approach against a diverse set of generative models, closely following the evaluation protocol introduced in [3]. The results for the competing methods, shown in Table 1, are taken from [3]”

We also will reformulate Section 5.5 to avoid the similarity.

Q9: Formatting mistakes

Thank you for pointing this out. We will correct these issues in our revision.

2025-08-05

I thank the authors for their detailed rebuttal. While some of my questions have been answered, I will elaborate on the remaining concerns which has not been sufficiently addressed.

Boundary condition. I understand the boundary condition can be prescribed for polynomials. However, it remains unclear how to enforce it for the sinusoidal basis. Specifically, the cosines are one for $t=0$ , which implicitly requires the summation of the coefficients to be $x_0$ .
Experimental results. The currently experimental results are still not comprehensive and convincing enough, given the lack of more recent baselines like sCM. The distillation seems more practically appealing, as the preliminary results already demonstrated better performance. In addition, it is probably easier to distill a pre-trained model than to train a new one from scratch.

In conclusion, I believe the idea presented in this paper is indeed new and has the potential to present a better generation quality in a few-step generation setup. However, the current version of the manuscript is far from complete. I would suggest that the authors further polish the work for a more comprehensive and solid revision.

评论- Addressing the boundary condition

2025-08-06

First, we thank the reviewer for their follow-up and sincerely appreciate their acknowledgment of the novelty of our paper. Below, we provide our response regarding the boundary condition point.

(a) In-depth explanation of our answer to Q1

Given the boundary condition $\psi(x_0, 0) = x_0$ , which implies that $\psi(x_0, 0) - x_0 = 0$ , we plugged this into our calculation of $\psi(x_0, t)$ .

\psi(x_0, t) = \psi(x_0, t) + 0 = \psi(x_0, t) + (\psi(x_0, 0) - x_0 )= x_0 + (\psi(x_0, t) - \psi(x_0, 0)).

Now, let us adopt the general case of a basis: $\\{b_1(t), b_2(t), \dots, b_n(t)\\}$ (which applies to the trigonometric basis), and compute $\psi(x_0, t) - \psi(x_0, 0)$

\psi(x_0, t) - \psi(x_0, 0) = c_1(\theta)(b_1(t) - b_1(0)) + c_2(\theta)(b_2(t) - b_2(0)) + \dots + c_n(\theta)(b_n(t) - b_n(0)).

Substituting into the earlier expression:

\psi(x_0, t) = x_0 + \left( \psi(x_0, t) - \psi(x_0, 0) \right) = x_0 + c_1(\theta)(b_1(t) - b_1(0)) + c_2(\theta)(b_2(t) - b_2(0)) + \dots + c_n(\theta)(b_n(t) - b_n(0)).

Thus,

\psi(x_0, t) - x_0 = c_1(\theta)(b_1(t) - b_1(0)) + c_2(\theta)(b_2(t) - b_2(0)) + \dots + c_n(\theta)(b_n(t) - b_n(0)).

The boundary condition is satisfied by default because substituting $\psi(x_0, t) - x_0$ at $t = 0$ yields zero as, $c_1(\theta)(b_1(0) - b_1(0)) + c_2(\theta)(b_2(0) - b_2(0)) + \dots + c_n(\theta)(b_n(0) - b_n(0)) = 0$ .

This procedure fits the residual $\psi(x_0, t) - x_0$ using the shifted basis:

$\\{b_1(t) - b_1(0),\ b_2(t) - b_2(0),\ \dots,\ b_n(t) - b_n(0)\\},$

(b) Ensuring Boundary Condition by Construction

The reviewer acknowledged that boundary conditions can be prescribed for polynomials. Inspired by this, we reflected on the difference between using a polynomial basis such as $\\{1, t, t^2, t^3, \dots, t^n\\}$ and a more general basis such as our trigonometric basis. We emphasize that our choice of the polynomial basis $\\{1, t, t^2, t^3, \dots, t^n\\}$ was made purely for simplicity. There are, in fact, infinitely many polynomial bases. For instance, $\\{1, t - 3, t^2 + 5, t^3 - 8, \dots, t^n + 4\\}$ is also a valid polynomial basis.

The key difference is that, in the case of $\\{1, t, t^2, \dots, t^n\\}$ , all components except the constant vanish at $t = 0$ . Thus, to enforce the boundary condition, we can simply omit the constant term, which is equivalent to subtracting its value at $t = 0$ .

Given a general basis of functions: $\\{b_1(t), b_2(t), \dots, b_n(t)\\}$ , we can simply construct a modified basis that satisfies the boundary condition by subtracting each basis function’s value at $t = 0$ , yielding $\\{b_1(t) - b_1(0),\ b_2(t) - b_2(0),\ \dots,\ b_n(t) - b_n(0)\\}$ . $\psi(x_0, t) - x_0$ can be fitted on $\\{b_1(t) - b_1(0),\ b_2(t) - b_2(0),\ \dots,\ b_n(t) - b_n(0)\\}$ and the boundary constraint is satisfied by construction.

Specifically, $\psi(x_0, t) - x_0$ can be expressed as, $\psi(x_0, t) - x_0 = c_1(\theta)(b_1(t) - b_1(0)) + c_2(\theta)(b_2(t) - b_2(0)) + \dots + c_n(\theta)(b_n(t) - b_n(0)).$

Ultimately, $\psi(x_0, t) = x_0 + c_1(\theta)(b_1(t) - b_1(0)) + c_2(\theta)(b_2(t) - b_2(0)) + \dots + c_n(\theta)(b_n(t) - b_n(0)).$

At $t = 0$ , the boundary condition $\psi(x_0, 0) = x_0$ holds.

In the specific case of polynomials, the general basis is $\\{1, t, t^2, t^3, \dots, t^n\\}$ , and the modified one (that we adopted in our paper) becomes: $\\{t, t^2, t^3, \dots, t^n\\}$ , which ensures all basis components evaluate to zero at $t = 0$ .

To conclude, given any basis $\\{b_1(t), b_2(t), \dots, b_n(t)\\}$ , fitting the residual $\psi(x_0, t) - x_0$ on the modified basis $\\{b_1(t) - b_1(0),\ b_2(t) - b_2(0),\ \dots,\ b_n(t) - b_n(0)\\}$ guarantees the boundary condition by construction.

We believe this is a clear and principled way to enforce the boundary condition $\psi(x_0, 0) = x_0$ (and is equivalent to our procedure mentioned in our answer to $Q1$ ). More importantly, this does not affect the validity of the results presented in our paper.

If the reviewer still has concerns regarding the boundary condition, we will be happy to provide further clarifications.

评论- Addressing the experimental results/settings

2025-08-07

Here, we include our answer regarding the second point.

The currently experimental results are still not comprehensive and convincing enough, given the lack of more recent baselines like sCM

Here, we include a comparison with the training-based methods of both iCT [2] and sCT [3] on CelebA-HQ. To the best of our knowledge, and based on our search, neither paper provides an official and publicly available implementation. Given this, we made our best effort to faithfully implement and train these baselines. For iCT [2], we based our implementation on https://github.com/Kinyugo/consistency_models, and for sCT [3], we used https://github.com/xandergos/sCM-mnist.

To ensure a fair comparison, we use the same backbone (DiT-B/2), batch size (64), and number of iterations (500K).

Method	FID ↓
CT [1]	33.3 (reported in [4])
iCT [2]	21.7
sCT [3]	19.3
Shortcut Models [4]	20.5
FlowFit (Ours)	14.1

The distillation seems more practically appealing, as the preliminary results already demonstrated better performance.

Here, the reviewer raises a concern regarding the appropriate task that the paper should address. While we propose our method for the single-stage training setting (training from scratch), same as in [4], the reviewer argues that it should instead be reformulated to address the problem of distillation (two-stage training). We would like to respond to this concern by briefly addressing the following three questions.

Q1) Is our task (single-stage training) an interesting task to solve? We strongly believe its is yes. The fact that one needs to first train a model and then perform distillation is fairly a handicap and thus the recent growing trend of work aiming to obtain a single-step model directly, without this sequential process.

Q2) is it already an established independant task? Yes, recent papers such as [4] exclusively solve this task.

Q3) Does FlowFit meets the "requirements" of a single-stage training approach? We strongly believe the answer is yes. FlowFit does not require a pre-trained model, and its effective training time corresponds to that of training a single network similar to [4].

Thus, we strongly believe that, objectively, our method deserves to be assessed on its own merit as a single-stage training approach. Moreover, we do not see why our work should be penalized for not being reformulated to address the distillation setting and subsequently compare to all the recent distillation methods.

Comparison to distillation methods and computational constraints

Beyond the justification provided in the previous point, one of the reasons we did not propose a distillation version of our method is that, unfortunately, we cannot afford the required computational resources. Distillation methods such as [2] and [3] typically rely on large batch sizes ranging from 512 to 4096, which are beyond our available compute budget.

Fortunately, the work of [4] provides a benchmark for single-stage methods using a moderate batch size of 64, which enabled us to evaluate our idea in a fair setting.

In this regard, we believe that research should be driven by ideas, not limited by obstacles that disproportionately affect those with fewer resources.

Concern about whether training from scratch would work, beyond the results presented in our paper.

We would like to point out that [4] already demonstrates that training the velocity model from scratch, in a manner similar to FlowFit, does not only work but leads to sota performance.

Here is the loss used in [4], as shown in Eq (5) of their paper:

$L(\theta) = E_{x, t, d} \left[ \left\| v_\theta(x_t, t, 0) - (x_1 - x_0)\right\|^2 + \lambda \left\| v_\theta(x_t, t, d) - s_{target} \right\|^2\right]$

Here, $v\_\theta(x\_t, t, 0)$ represents the vanilla velocity model, and $v\_\theta(x\_t, t, d)$ is the shortcut model parameterized by $0 < d \leq 1$ , enabling single-step generation. Ideally, $s\_{\text{target}}$ depends on a well-trained $v\_\theta(x\_t, t, 0)$ , however, the authors start training both components from scratch, and they show that this approach works.

We note that also here training $v_\theta(x_t, t, 0)$ is independent from training $v_\theta(x_t, t, d)$. Intuitively, if $v_\theta(x_t, t, 0)$ were replaced with a pre-trained velocity model and only $v_\theta(x_t, t, d)$ were trained, the method in [4] would reduce to a distillation approach.

[1] Consistency Models. Song et al. NeurIPS 2023

[2] Improved techniques for training consistency models. Song et al. NeurIPS 2023

[3] Simplifying, stabilizing and scaling continuous-time consistency models. Cheng et al. ICLR 2025

[4] One Step Diffusion via Shortcut Models. Kevin et al. ICLR 2025.

We hope this addresses the reviewer’s concern. We would be happy to provide any additional clarification or address further questions.

2025-08-08

I thank the authors for their comprehensive new results, especially those in comparison with the latest sCM/iCM models. The authors also explained the boundary condition issues. Based on the new results, I will raise my original scores. However, some problems remain, including a limited evaluation scope (e.g., distillation, as the common practice in previous CM models). Nonetheless, given the new results, I believe the approach proposed in this work has the potential to be a better few-step generative model.

2025-08-08

First, we thank the reviewer for their follow-up.

The reviewer’s final reproach is that we should also evaluate our idea on another task, namely two-stage distillation. To this, we responded that we strongly believe single-stage training has recently become an established independent task. Indeed, the recent works of [1], [2] and [3] exclusively benchmark their methods for the task of single-stage training.

Beyond this objective argument, we also noted that, due to our limited compute budget, we could not conduct a fair evaluation in the context of distillation. We hope that this constraint will not lead to the disqualification of our work, especially considering that its novelty appears to be appreciated by all the reviewers (average originality score of 3 out of 4).

[1] One Step Diffusion via Shortcut Models. Frans et al. ICLR 2025.

[2] Mean Flows for One-step Generative Modeling. Geng et al. arxiv 2025. (Concurrent work to ours, first appeared on arXiv on May 19, 2025).

[3] Inductive moment matching. Zhou et al. ICML 2025 (Concurrent work to ours, first appeared on arXiv on March 10, 2025).

审稿意见

评分: 4置信度: 32025-07-01

The following work proposes a single stage training algorithm for single-step sampling from probability flow ODEs (including, but not limited to flow matching, consistency models). The algorithm directly fits to the flow map solution at time t, bypassing the need to integrate over time during the sampling phase. This is achieved by representing the time-dependent ODE with either a Fourier (sinusoidal) or polyomial set of time-varying basis functions. The training algorithm alternates between fitting a standard vector field model $v_t$ , then matching the derivative of the basis function model to the vector field with a consistency loss.

优缺点分析

Strengths:

Formulation and writing were clear and easy to understand
Method appears effective and principled Weaknesses:
I think prior work [1] uses a closely related formulation by also modeling the sampling trajectories in the Fourier domain -- allowing them to also achieve 1-step sampling.
Whether one would really consider this a single-stage training algorithm is perhaps a bit nuanced. Based on my understanding of the algorithm, we can maybe interpret this as distilling the vector field model into the FlowFit model while vector field model is still training. It's not clear that this is more efficient than just allowing the standard vector field model to train to completion on its own first, and then freezing it and distilling it to the FlowFit model. Furthermore, I believe progressive distillation can also be similarly formulated to train multiple stages simultaneously within the same training loop.
I feel like the comparison between polynomial and trigonometric basis functions shouldn't be written this conclusively, as it does not preclude the existence of an unused configuration of trigonometric basis functions that might yet yield better results.

[1] https://proceedings.mlr.press/v202/zheng23d.html

问题

Include relevant discussion wrt. prior works such as [1], highlighting differences and similarities
Respond to concern regarding single stage training claims.

局限性

Limitations were included, but societal impact was not. It might be a good idea to include a standard blurb regarding the negative impact of generative models.

最终评判理由

Overall, I really liked the general direction of this work, but I'm lowering my rating because

I think this method should be further studied in the context of other distillation works
I would like to see the boundary condition regarding sinusoidal basis functions addressed in a further revision.

格式问题

None

作者回复

2025-07-29

Thank you for your feedback. Please see the detailed response below.

Q1: discussing differences and similarities with respect [1]

Here, we discuss [1] while emphasizing that FlowFit remains clearly distinguishable from [1].

The first clear difference is that, as its name suggests, [1] is a distillation-based method that follows a three-stage training procedure and relies on a pre-trained diffusion model, construct a training synthetic dataset, then distill. Furthermore, the method for In contrast, our method is a single-stage training approach. We note that training the velocity filed does ( $v$ ) not depend on training flow field $\psi$ and so the training could be done in parallel and thus the effective training time corresponds to that of training a single model. FlowFit can be viewed as performing self-distillation during training, removing the need for a separate distillation step. We believe that a direct comparison with [1] is therefore not apple-to-apple.

Now, beyond this key distinction, we note that while both FlowFit and [1] aim to directly model the flow map, that is, to predict any $x_t$ given $x_0$, the two approaches differ significantly in their formulations. FlowFit uses a basis of functions, decoupling space and time, and operates by matching the local velocity field to the time derivative of the flow map. In contrast, [1] focuses on learning a direct mapping from initial noise to its intermediate states, using a synthetic dataset. Additionally, [1] relies on a carefully designed architecture that incorporates temporal convolutions within the neural network while we do not.

Thus, we fairly believe that FlowFit is fairly distinguishable from [1]. We will include a summary of this comparison with [1] in our related work section.

Q2: single stage training claims

Thanks for pointing this out. In fact, we were probably not clear enough when presenting our algorithm, which may have led to some confusion. Training the velocity field and flow field is done jointly but does not need to be in alternating fashion. A key point is that training the velocity field is completely independent of training the flow model so the two models can be trained fully in parallel. That is, the forward and backward passes of the velocity can be computed simultaneously with those of the flow, without any drop in performance. Therefore, there is no need for alternating training and thus the effective training time is the time of training a single network, hence our claim of our method being a training-based (single-run) method and not a distillation one. Our method can be seen as performing self-distillation during training time, and thus does not require a separate distillation step and is trained over a single run. We believe the key distinction between a training-based method and a distillation-based method lies in the effective training time, whether is the time of training two sequential networks (distillation or 2-stages) or training a single network (training or 1-stage), and not in the number of networks trained. For instance, Generative Adversarial Networks (GANs) also train two networks but are still considered a training-based approach, not a distillation method. We also would like to emphasize that we do not rely on any tricks, such as extending the training time so much that it effectively becomes equivalent to first training the velocity field and then distilling. For example, our method is trained on ImageNet for 1 million iterations, which is a common number in the literature (same as [1]).

Regarding the comparison with "single run" or "training" version of progressive distillation: there is no clear reason to believe that starting from scratch in the first stage would consistently lead to improved results or to a strong performance. In progressive distillation, the number of denoising steps every $T$ iterations, and if the first model is not well-trained, its errors may propagate to subsequent stages. This could make the training more fragile. Also, modifying progressive distillation to start from scratch would effectively define a new method, for which there is currently no empirical evidence that it would work better.

[1] One Step Diffusion via Shortcut Models. Kevin et al. ICLR 2025.

Q3: Non-conclusive comparison between polynomial and trigonometric basis function

We agree. The results we report apply only to the specific configuration used in our experiments. It is indeed possible that with a different choice of trigonometric basis, the performance could improve and potentially even outperform the polynomial basis.

As such, one of our suggestions for future work is a more thorough investigation of various basis functions and their configurations. So, we are open to the reviewer's suggestion on how to present this result, whether we should further emphasize that it applies only to our current setting, move it to the supplementary material, or omit it entirely.

评论- Thanks for your response

2025-08-06

I appreciate the thoughtful responses from the authors. I am satisfied with the response for Q1 and Q3.

My point regarding Q2 was similarly brought up by 6gVd, and my primary concern regarding this point was that it is not clear to me that distilling from $v_{\theta'}$ prior to its convergence is beneficial in anyway, as the targets are simply less accurate and noisier. I don't think comparing this to a GAN is very apt, as there is no complicated min-max game going on. As such, this work strikes me more as a distillation technique. I do appreciate the additional results provided in the other review responses, and I think that further exploration along the 2 stage direction would significantly strengthen this work.

Overall, I really liked the general direction of this work, but I'm lowering my rating because

I think this method should be further studied in the context of other distillation works
I would like to see the boundary condition regarding sinusoidal basis functions addressed in a further revision.

评论- Addressing the first concern

2025-08-07

We thank the reviewer for their follow-up and we appreciate that they "liked the general direction of our work". Here, we address the reviewer’s first concern.

my primary concern regarding this point was that it is not clear to me that distilling from $v_{\theta'}$ prior to its convergence is beneficial in anyway, as the targets are simply less accurate and noisier.

The work of [1] already showed that distilling from $v_{\theta}$ prior to its convergence not only works but leads to state-of-the-art results in case of training-based setting

Here is the loss used in [1], as shown in Eq (5) of their paper:

$L(\theta) = E_{x, t, d} \left[ \left\| v_\theta(x_t, t, 0) - (x_1 - x_0)\right\|^2 + \lambda \left\| v_\theta(x_t, t, d) - s_{target} \right\|^2\right]$

Here, $v\_\theta(x\_t, t, 0)$ represents the vanilla velocity model (and trained with vanilla flow matching loss), and $v\_\theta(x\_t, t, d)$ is the shortcut model parameterized by $0 < d \leq 1$ , enabling single-step generation. Ideally, $s\_{\text{target}}$ depends on a well-trained $v\_\theta(x\_t, t, 0)$ , however, the authors start training both components from scratch, and they show that this approach works well.

We note that also here training $v\_\theta(x\_t, t, 0)$ is independent from training $v\_\theta(x\_t, t, d)$ . Intuitively, if $v\_\theta(x\_t, t, 0)$ were replaced with a pre-trained velocity model and only $v\_\theta(x\_t, t, d)$ were trained, the method in [1] would reduce to a distillation approach.

That being said, we acknowledge that using a pre-trained model, either in our case or in the case of [1], should lead to improved results. However, this shifts the methods into a different category: two-stage training, which relies on different assumptions (access to a pre-trained model). Therefore, a direct comparison is not fair.

As such, this work strikes me more as a distillation technique. I think this method should be further studied in the context of other distillation works

The reviewer raises a concern regarding the appropriate task that the paper should address. While we propose our method for the single-stage training setting (training from scratch), same as in [1], the reviewer argues that it should instead be reformulated to address the problem of distillation (two-stage training). We would like to respond to this concern by briefly addressing the following three questions.

Q2) is it already an established independant task? Yes, recent papers such as [1] uniquely solve this task.

Comparison to distillation methods and computational constraints

Fortunately, the work of [1] provides a benchmark for single-stage methods using a moderate batch size of 64, which enabled us to evaluate our idea in a fair setting.

In this regard, we believe that research should be driven by ideas, not limited by obstacles that disproportionately affect those with fewer resources.

What is the value of our work beyond solving distillation?

We propose a novel idea and demonstrate state-of-the-art performance among single-stage training methods. While single-stage approaches, including FlowFit, do not yet match the FID of two-stage distillation methods, we believe advancing one-stage training and narrowing this gap is a valuable and promising research direction. In this context, FlowFit represents a meaningful step forward.

[1] One Step Diffusion via Shortcut Models. ICLR 2025.

[2] Improved techniques for training consistency models. NeurIPS 2023

[3] Simplifying, stabilizing and scaling continuous-time consistency models. ICLR 2025

We hope this addresses the reviewer’s first concern and are happy to provide further clarifications if needed.

评论- Addressing the second concern

2025-08-07

Here, we address the reviewer’s second concern.

Here we show a clear way to enforce the boundary condition in the general case including sinusoidal basis functions, and that does not affect the validity of the results presented in our paper.

Ensuring Boundary Condition by Construction

As acknowledged by reviewer 6gVd, boundary conditions can be easily prescribed for polynomials. Inspired by this, we reflected on the difference between using a polynomial basis such as $\\{1, t, t^2, t^3, \dots, t^n\\}$ and a more general basis such as our trigonometric basis. We emphasize that our choice of the polynomial basis $\\{1, t, t^2, t^3, \dots, t^n\\}$ was made purely for simplicity. There are, in fact, infinitely many polynomial bases. For instance, $\\{1, t - 3, t^2 + 5, t^3 - 8, \dots, t^n + 4\\}$ is also a valid polynomial basis.

Specifically, $\psi(x_0, t) - x_0$ can be expressed as, $\psi(x_0, t) - x_0 = c_1(\theta)(b_1(t) - b_1(0)) + c_2(\theta)(b_2(t) - b_2(0)) + \dots + c_n(\theta)(b_n(t) - b_n(0)).$

Ultimately, $\psi(x_0, t) = x_0 + c_1(\theta)(b_1(t) - b_1(0)) + c_2(\theta)(b_2(t) - b_2(0)) + \dots + c_n(\theta)(b_n(t) - b_n(0)).$

At $t = 0$ , the boundary condition $\psi(x_0, 0) = x_0$ holds.

We believe this is a clear and principled way to enforce the boundary condition $\psi(x_0, 0) = x_0$ (and is equivalent to our procedure mentioned in our answer to $Q1$ of reviewer 6gVd ). More importantly, this does not affect the validity of the results presented in our paper..

If the reviewer still has concerns regarding the boundary condition, we will be happy to provide further clarifications.

2025-08-08

Thanks again for your responses. I'm raising my rating back to my initial borderline accept as a result. Ultimately the goal in to achieve some consensus, but I definitely find this work interesting despite the need for some more evaluation in the 2 stage setting.

2025-08-09

Thank you for your feedback and we appreciate your positive assessment of our work.

As we explained earlier, our work focuses on the task of single-stage training, which we strongly believe has become an established independent task, as recent works such as [1], [2], and [3] exclusively benchmark on it..

However, based on the reviewer’s recommendation, we commit to including results from the distillation version of our method, along with a recent distillation baseline in our revision.

[1] One Step Diffusion via Shortcut Models. Frans et al. ICLR 2025.

[2] Mean Flows for One-step Generative Modeling. Geng et al. arxiv 2025. (Concurrent work to ours, first appeared on arXiv on May 19, 2025).

[3] Inductive moment matching. Zhou et al. ICML 2025 (Concurrent work to ours, first appeared on arXiv on March 10, 2025).

最终决定Reject

2025-09-17

Tl;dr: Based on the reviews, rebuttal and ensuing discussion I recommend reject decision.

Paper Summary

The paper proposes FlowFit, a new method for generative modeling designed to produce samples in a single step. The central idea is to directly parameterize the continuous flow from a noise latent to a data point using a basis function expansion over time. This transforms the problem from learning a local velocity field to learning a global trajectory. The original submission's empirical results on CelebA-HQ showed performance comparable to some single-step methods.

Key strengths and weaknesses

Strengths

Novelty: The core idea of parameterizing the flow trajectory is novel and mathematically appealing.
Efficiency: The method claims single-step generation through a single training stage, which is quite desirable and an advantage over multistep training or distillation methods (caveats below).

Weaknesses

Insufficient empirical validation: The experimental results in the initial submission were not suffcient. Multiple reviewers raised this concern. A significant portion of critical experimental validation was conducted during the rebuttal period.
Single stage vs online training: The proposed method involves the co-training of two separate models, where one is effectively guiding the other. As pointed out by Reviewer ptJr, this setup is conceptually very close to online distillation. If the models are large, it may not be feasible to do this and one may have to resort to two stage distillation. Further, the distilled version seemed to significantly outperform the online version. It’s unclear why online version would be preferred if the computation cost is similar and performance is lower.
Insufficient investigation of method details: The choice of basis functions was not deeply analyzed or justified in the original manuscript. It also lacked discussion on how to enforce boundary conditions for general bases. Some of these reviewer concerns were clarified during the discussion period.

Decision justification

Overall, I feel that the original manuscript is far from ready for publication. The general reviewer opinion remained negative leaning as well. Single stage vs. distillation is another major concern (as stated in weaknesses above). The authors conducted a significant amount of new research during rebuttal/discussion, including implementing and testing new baselines and a new distillation version of their own model. While the effort and the new results are appreciated, the original work is simply not ready and needs a major revision.