PaperHub
5.2
/10
Rejected3 位审稿人
最低2最高5标准差1.2
2
5
3
4.3
置信度
创新性2.0
质量2.0
清晰度3.0
重要性2.0
NeurIPS 2025

Single-Step Diffusion via Direct Models

OpenReviewPDF
提交: 2025-05-12更新: 2025-10-29

摘要

We introduce Direct Models, a generative modeling framework that enables single-step diffusion by learning a direct mapping from initial noise $x_0$ to all intermediate latent states along the generative trajectory. Unlike traditional diffusion models that rely on iterative denoising or integration, Direct Models leverages a progressive learning scheme where the mapping from $x_0$ to $x_{t + \delta t}$ is composed as an update from $x_0$ to $x_t$ plus the velocity at time $t$. This formulation allows the model to learn the entire trajectory in a recursive, data-consistent manner while maintaining computational efficiency. At inference, the full generative path can be obtained in a single forward pass. Experimentally, we show that Direct Models achieves state-of-the-art sample quality among single-step diffusion methods while significantly reducing inference time.
关键词
Efficient generative models

评审与讨论

审稿意见
2

The paper introduces Direct Models, a framework to perform single-step diffusion for image generation. The method is based on the flow matching concept, particularly the Conditional Flow Matching formulation [1]. The authors defined the flow phi(x,t) by x + tw(x,t) where w is parameterized by a neural network. The framework performed by first training the vector field v, by the formulation in [1], and then w by matching (x + tw(x,t))’ with v. Finally, inference can then be carried out in 1 step using the learned w(x,t). Experiments with CelebAHQ-256 are provided, showing the method performs relatively well compared to other approaches.

[1] Yaron Lipman, Ricky TQ Chen, Heli Ben-Hamu, Maximilian Nickel, and Matt Le. Flow matching for generative modeling

优缺点分析

Strengths:

  • Technically, the idea is sound. It tries to approximate the entire flow path in 1 step, using a parameterization of the vector field.

  • The approach introduced is simple and easy to follow. The authors did a good job of explaining the core technical idea and results.

Weakness:

  • The paper title “Single-step diffusion” is misleading. There is no diffusion process, the entire paper is constructed based on Flow matching.

  • The mention of “Taylor expansion” seems like an unnecessary jargon to me, it’s just a simple finite difference, there is no error quantification. Further more, equation (8) is confusing, since the authors are mixing continuous time and finite discrete time together.

  • The authors should be upfront with the fact that the framework operated in latent space using Stable diffusion encoder, instead of just passingly mention it in the experiments.

  • The comparison to Distillation method by saying “we propose an end-to-end training approach that directly learns one-step generative model” (line 205) is also misleading. The training actually depends on a well-trained vector field v(x,t), the authors, instead of using a trained v(x,t), decides to train v and w alternatively in the same training loop. I believe this will make the training less stable than finishing training v first, then use that signal to train w, in which case this is essentially a distillation process. In [3], the training is also done in 1 loop, alternatively, but the authors still mistakenly say that it’s not end-to-end compared to the paper under review. (Lines 202-206)

  • The evaluation is lacking, the authors only evaluate on 1 dataset of CelebAHQ. Since CelebA is a highly structured dataset (aligned human faces with good lighting conditions), the performance may not generalize to more diverse dataset like ImageNet. Also, the authors mentioned different distillation schemes, including [3], but only make comparisons to the 3 older methods published in 2022-2023, while the results claimed in [3] already surpassed them.

  • The submitted version looks very rushed in some sections. For example, the ablation study only contains 2 short sentences with no explanation of the results.

  • The authors failed to cite or make comparison against [2], which proposed 1-step flow matching, using a principled formulation.

[2] Kornilov, Nikita, et al. "Optimal flow matching: Learning straight trajectories in just one step." Advances in Neural Information Processing Systems 37 (2024): 104180-104204. [3] Yin, Tianwei, et al. "Improved distribution matching distillation for fast image synthesis." Advances in neural information processing systems 37 (2024): 47455-47487.

问题

  • The authors choose to parameterize phi in a specific way, how does this affect the path and quality of generated samples? We see in [1] that different parameterization can lead to highly different paths.

  • Would it be possible to have a single network approach?

  • There’s no analysis or discussion of stability over multiple timesteps

  • How would the model perform on different datasets? ie ImageNet

  • How well would it compare against 1-step flow matching?

  • What is the motivation behind the stop-gradient? It would be interesting to explore its effect.

局限性

yes

最终评判理由

The paper is rushed, poorly motivated and far from the quality that should be set by neurips.

格式问题

N/A

作者回复

Thank you for your feedback. Please see the detailed response below.

Q1: The term “diffusion” in “Single-step diffusion” is misleading

We follow the terminology used in [1], where "diffusion" refers broadly to both diffusion-based and flow-matching models. However, we agree that this terminology may be misleading and we will update our title to explicitly mention "flow matching" for clarity.

[1] One Step Diffusion via Shortcut Models. Kevin et al. ICLR 2025.

Q2: “equation (8) is confusing/Taylor expansion” seems unnecessary jargon

First, we would like to clarify how Eq. (8) is derived. It is obtained by directly combining the results from Eq. (6) and Eq. (7). Eq. (6) is derived by applying the standard derivative operator to Eq. (4), while in Eq. (7), we approximate the derivative using the forward finite difference method. We would be happy to provide a more detailed explanation if there is still any ambiguity regarding the derivation.

Concerning the "Taylor expansion" terminology, we agree that it is unnecessary here, and we will remove it in our revision for clarity.

Q3: Should operating in the latent space be mentioned upfront rather than in the experiments section?

First, we would like to point out that, in theory, our method can operate in both pixel and latent spaces. Therefore, our intention was to present the method in an agnostic way and then clarify the specific experimental setting—i.e., which space we used for training, in the experiments section. This is similar to [1], where the same structure was followed. However, we will mention in the introduction that all model training in our work is conducted in the latent space for clarity.

Q4: Concern about Direct Models being a training or distillation method?

“The training actually depends on a well-trained vector field v(x,t)”

One of the key findings of our work is that the method can be trained from scratch, without the need for a well-trained v(x,t)v(x,t), with an effective training time equivalent to that required for training a single network. Despite this, it still outperforms all existing training-based methods.

“the authors, instead of using a trained v(x,t), decides to train v and w alternatively in the same training loop”

We were probably not clear enough when presenting our algorithm, which may have led to this confusion. The key point is that training v(x,t)v(x,t) is completely independent of training w(x,t)w(x,t), so the two models can be trained fully in parallel. That is, the forward and backward passes of vv can be computed simultaneously with those of ww, without any drop in performance. Therefore, there is no need for alternating training and thus the effective training time is the time of a single network (w(x,t)w(x,t)) hence our claim of our method being a training-based method and not a distillation one. Our method can be seen as performing self-distillation during training time, and thus do not require a separate distillation step and are trained over a single run. We believe the key distinction between a training-based method and a distillation-based method lies in the effective training time, whether is the time of training two sequential networks (distillation or 2-stages) or training a single network (training or 1-stage), and not in the number of networks trained. For instance, Generative Adversarial Networks (GANs) also train two networks but are still considered a training-based approach, not a distillation method.

We also would like to emphasize that we do not rely on any tricks, such as extending the training time so much that it effectively becomes equivalent to first training the velocity field and then distilling. For example, our method is trained on ImageNet for 1 million iterations, which is a common number in the literature.

“I believe this will make the training less stable than finishing training v first, then use that signal to train w, in which case this is essentially a distillation process”

We show that, although we train from scratch, our training is stable stable and outperform existing training-based methods, such as the recent state-of-the-art [1]. In our paper, we do not claim that our training-based method outperforms distillation-based methods in general, or even the distillation version of our own method. These are different categories with different assumptions, and thus a direct comparison would be unfair.

We would like to emphasise that in this work, our goal is to advance the state of training-based methods, which we believe is a valuable direction of research.

Q5: Empirical validation/comparison to distillation methods including [3]

We evaluate Direct Models on two datasets: unconditional CelebA-HQ and class-conditional ImageNet, both of which are commonly used in recent works, such as [1], which uses the same benchmarks as ours. Unfortunately, due to time constraints, we were not able to include the ImageNet results in the main submission. However, these results are available in the supplementary material, and we plan to move them into the main paper in the revised version. We also note that the recent compared method [1] performs empirical evaluation on two datasets.

Regarding the comparison with distillation methods, as mentioned in our response to Q3, we believe that our approach should not be classified as a distillation method. Our method has distinct advantages that typical distillation techniques lack, for example, it is a single-stage process, making it both simpler and more practical. Therefore, we feel that a direct comparison with distillation methods may not be entirely appropriate. We note that we included some comparisons to distillation approaches for reference, similar to [1], but we do not claim that our method outperforms distillation-based methods. That said, we will include [3] in our related work section.

[1] One Step Diffusion via Shortcut Models. Kevin et al. ICLR 2025.

Q6: Improving Writing

Thanks for pointing this out. We will improve the manuscript, particularly the description of the ablation study.

Q7: Comparison to [2]

We are unsure how we could compare to [2]. [2] does not report results on the image generation task. In fact, [2] does not show results on datasets such as CelebA, CIFAR, or ImageNet. To our understanding, a comparison with the method in [2] is difficult because [2] uses specific NN architectures ( convex networks and two-layer MLPs) and, as the authors acknowledge, [2] suffers from scalability issues. Could the reviewer suggest a potential comparison with [2]?

Q8: How does the reparameterization of $\phi$ affect the path and quality of generated samples?

We would like to point out that our derivation of $\phi$ and $v$ is based on the standard formulation of conditional flow matching in Eq. (2) and the work of [1]. Our approach involves a simple change of variables that, in theory, does not affect the solution. Specifically, instead of training a network to predict $\phi(x, t)$ directly, we train a network to predict $\frac{\phi(x, t) - x}{t}$. At inference time, we recover the flow by multiplying by $t$ and adding back $x$. In Figure 1 of our main paper, we show the outputs generated using both the vanilla training and our method, the resulting mappings appear quite similar. We will include in the supplementary material an example illustrating the intermediate noisy latents produced by both models.

Q9: Would it be possible to have a single-network approach?

This is a good question. We believe this is indeed possible. The most straightforward solution would be to modify the network architecture to include a binary conditioning variable that can switch between states. In this way, the same network could process inputs corresponding to both $v$ and $w$. We would be happy to hear any suggestions/comments from the reviewer if they have other possible thoughts/approaches.

Q10: How would the model perform on ImageNet?

We included a comparison on class-conditional ImageNet in the supplementary material. Our method outperforms the state-of-the-art training-based method [1]. We will move this result to the main paper in our revision.

Q11: How well does it compare against 1-step flow matching?

1-step flow matching (vanilla flow matching) results in poor performance. As shown in Table 1 of our main paper, on the CelebA-HQ dataset, 1-step flow matching achieves an FID of 280.5, while our method achieves 16.8. For conditional ImageNet, as reported in Table 1 of our supplementary material, 1-step flow matching reaches an FID of 324.8, compared to 36.5 for our method.

Q12: The motivation behind the stop-gradient

The stop-gradient is not necessary for our loss. We experimented without it, and training still works. However, removing the stop-gradient introduces additional computational overhead. We will include an ablation comparing versions with and without stop-gradient to highlight the differences in both performance and computational cost.

评论

It’s still unclear to me why this approach, claimed by the authors to be 1-stage, is more practical? The amount of training FLOPS is essentially the same, and in inference only the final diffusion model is used. The comparison to GAN in the comment seems misleading, GAN is composed of 2 different with conflicting losses, that is why they need to be trained alternatively. For the introduced architecture, both share the same goals, optimizing similar targets, that is why the loss is stable. The loss of the final model depends on the estimates of the first model, hence the reason we claimed that training separately will be more stable and lead to better results. The authors can conduct some simple experiments to verify it. If that is indeed the case, there is no point in introducing more complexity in coupling them together during training.

评论

Thank you for your follow-up. Here, we provide our response regarding the remaining concerns.

Direct Models should use a pre-trained model (be reformulated to address the distillation setting)?

The reviewer raises a concern regarding the appropriate task that our paper should address. While we propose our method for the single-stage training setting (training from scratch), same as in [1], the reviewer argues that it should instead be reformulated to address the problem of distillation (two-stage training). We would like to respond to this concern by briefly addressing the following three questions.

Q1) Is our task (single-stage training) an interesting task to solve? We strongly believe its is yes. The fact that one needs to first train a model and then perform distillation is fairly a handicap and thus the recent growing trend of work aiming to obtain a single-step model directly, without this sequential process.

Q2) is it already an established independant task? Yes, recent papers such as [1] and [2] exclusively address this task.

Q3) Does Direct Models meets the "requirements" of a single-stage training approach? We strongly believe the answer is yes. Direct Models does not require a pre-trained model, and its effective training time corresponds to that of training a single network similar to [1].

Thus, we strongly believe that, objectively, our method deserves to be assessed on its own merit as a single-stage training approach. Moreover, we do not see why our work should be penalized for not using a pre-trained model (being reformulated to address the distillation setting) and subsequently compare to all the recent distillation methods.

hence the reason we claimed that training separately will be more stable and lead to better results.

(a) training stability

We would like to point out that, beyond the results in our paper, [1] already demonstrates that training the velocity model from scratch, in a manner similar to Direct Models, does not only work but leads to state-of-the-art performance in the single-stage training setting.

Here is the loss used in [1], as shown in Eq (5) of their paper:

L(θ)=Ex,t,d[vθ(xt,t,0)(x1x0)2+λvθ(xt,t,d)starget2]L(\theta) = E_{x, t, d} \left[ \left\| v_\theta(x_t, t, 0) - (x_1 - x_0)\right\|^2 + \lambda \left\| v_\theta(x_t, t, d) - s_{target} \right\|^2\right]

Here, v_θ(x_t,t,0)v\_\theta(x\_t, t, 0) represents the vanilla velocity model (trained with vanilla flow matching loss), and v_θ(x_t,t,d)v\_\theta(x\_t, t, d) is the shortcut model parameterized by 0<d10 < d \leq 1, enabling single-step generation. Ideally, s_targets\_{\text{target}} depends on a well-trained v_θ(x_t,t,0)v\_\theta(x\_t, t, 0), however, the authors start training both components from scratch, and they show that this approach works well.

We note that also here training $v_\theta(x_t, t, 0)$ is independent from training v_θ(x_t,t,d)v\_\theta(x\_t, t, d). Intuitively, if v_θ(x_t,t,0)v\_\theta(x\_t, t, 0) were replaced with a pre-trained velocity model and only v_θ(x_t,t,d)v\_\theta(x\_t, t, d) were trained, the method in [1] would reduce to a distillation approach.

leading to better results

That being said, we acknowledge that using a pre-trained model, either in our case or in the case of [1], should lead to improved results. However, this shifts the methods into a different category: two-stage training, which relies on different assumptions (access to a pre-trained model). Therefore, a direct comparison is not fair.

It’s still unclear to me why this approach, claimed by the authors to be 1-stage, is more practical? The amount of training FLOPS is essentially the same

While we agree that Direct Models needs the same amount of training FLOPS as its "distillation version", (which we also believe to be the case for [1]), Direct Models addresses the central limitation of distillation, namely the need to first train a generative model then to distill it, which is clearly an handicap. In contrast, Direct Models do not rely on a pre-trained model and require approximately the same training time as a single network, effectively halving the overall training time (compared to distillation) thus our claim of 1-stage. This makes them particularly useful in scenarios where a pre-trained model is not available.

We would like to emphasize that Direct Models, and single-stage methods in general, are particularly relevant in practical scenarios, such as industrial applications, where training a generative model followed by distillation can be prohibitively time-consuming. On large-scale, real-world datasets, this two-stage process can require several days of training, making single-stage alternatives more appealing.

[1] One Step Diffusion via Shortcut Models. Kevin et al. ICLR 2025.

[2] Mean Flows for One-step Generative Modeling. Geng et al. arxiv 2025. (Concurrent work to ours, first appeared on arXiv on May 19, 2025).


We hope this addresses the reviewer’s concern. We would be happy to provide any additional clarification or address further questions.

审稿意见
5

The authors introduce a generative modeling framework ("Direct Models") that enables single-step generation. To do this the authors learn both a velocity field and a residual field w(x,t)w(x,t), where the latter object parameterizes the flow and can thus be used to do single-step generation. The velocity field is trained via a Conditional Flow Matching loss, while the residual field ww is trained via a "Local Velocity Propagation" loss that encourages (finite differences of) the residual field to be consistent with the velocity. Conceptually, Direct Models can be viewed as the "opposite" of Consistency models in that Direct Models map gaussian noise to all intermediate latents, while Consistency models map intermediate latents to the fully denoised datapoint. The result is a practical one-step generation method that is competitive with alternative approaches (at least as judged by the CelebAHQ-256-based empirical evaluation), including distillation-based methods and end-to-end methods like Consistency models.

优缺点分析

Strengths:

  • The paper is clearly written and generally easy to follow
  • The method is conceptually simple, easy to understand, and entails a straightforward training procedure. As far as I know the method is novel.
  • The empirical performance, while limited to a single dataset, shows competitive performance with a mix of relevant baseline methods
  • The paper addresses a problem area, namely one-step generation, that is of interest to many in the community

Weaknesses:

  • While generally clearly written, the submission is a bit sparse on some details (see below)
  • The empirical validation is limited to a single dataset. It would be particularly valuable to see more benchmarking in the class conditioned setting (i.e. beyond what's in the supplement).

问题

Questions:

  • It's nice to see the mini-ablation in Table 2 but what happens at different δt\delta t, for example 0.020.02? Benchmarking only two choices limits what we can learn from this.
  • Have you tried non-uniform time sampling during training, especially for tt^\prime in Eqn 16?
  • What do you expect would happen if you first trained the velocity field, fixed its parameters, and then learned the residual field via the local propagation loss? Or if you also computed θ\theta gradients for the local propagation loss? What trivial solutions, if any, do you need to guard against?

Comments:

  • Using parameters ν\nu instead of e.g. ϕ\phi is confusing, since ν\nu can be easily confused for the velocity.
  • Writing out all the elementary algebra between Eqn 8 and 12 and then repeating Eqn 12 as Eqn 13 is a waste of space. This space could be used much more profitably by clarifying other aspects of your method, see below. Similarly, the conditional generation results can/should be moved to the main paper.
  • Line 69: "or sampling noise". This is hopelessly vague; please clarify.
  • Line 134: "Additionally, all models operate in the latent space provided by the sd-vae-ft-mse autoencoder." This is unnecessarily vague. Please be more specific.
  • Line 13 in the supplement: "For the ImageNet case, we adopt classifier-free guidance (CFG) during training.” This is an entirely inadequate description. Please be much more specific, with equations.

局限性

Yes, as mentioned by the authors, one limitation is that the current formulation is restricted to single-step inference.

最终评判理由

I choose to maintain my score. I agree with the authors that "advancing the state of the art in 1-stage approaches is a promising and important research direction". I am a bit puzzled by reviewer twLP's nitpicking about whether the method constitutes distillation or not.

格式问题

N/A

作者回复

Thank you for your feedback. Please see the detailed response below.

Q1: Empirical validation

We evaluate FlowFit on two datasets: unconditional CelebA-HQ and class-conditional ImageNet, both of which are commonly used in recent works, such as [3], which uses the same benchmarks as ours. Unfortunately, due to time constraints, we were not able to include the ImageNet results in the main submission. However, these results are available in the supplementary material, and we plan to move them into the main paper in the revised version. We also note that the recent compared method [1] performs empirical evaluation on two datasets.

Regarding the suggestion to include an additional benchmark on a class-conditioned dataset: to the best of our knowledge, most existing methods use class-conditioned ImageNet for this purpose. If the reviewer has another dataset in mind, we are happy to try and include results during the discussion period.

[1] One Step Diffusion via Shortcut Models. Kevin et al. ICLR 2025.

Q2: Ablation with different δt\delta_t

Thanks for pointing this out. Here we add ablations with other δt\delta_t, in particular with δt=0.02\delta_t=0.02. We are currently working on obtaining the result and will share them here as soon as they are available. We will also include them in our revision.

Q3: Did we try non-uniform sampling for tt' ?

Good point. Yes, we did experiment with scheduling t t'. Specifically, we sampled tt' uniformly from the interval [0,min(1,iTT0)][0, min(1,\frac{{i}}{{T} - T_0})]. In our case, we set TT is the maximum number of iterations and ii is the current training iteration. T=500K{T} = 500K, and we used T0=100KT_0 = 100K.

We observed that near the end of training, the loss suddenly increases, and the generated images begin to show artifacts. The Table below provides a comparison of FID scores for the two cases. We note that, for the scheduled sampling case, the FID is computed using the checkpoint just before the loss increase.

Sampling StrategyFID
Uniform sampling of tt' 16.8
Scheduled sampling of t t'18.7

Overall, we found that uniform time sampling leads to more stable training and better image quality. We will include this in our ablation on our revised version of the paper.

Q4: What happens if we first train the velocity field, fix its parameters, then train the residual field?

Our method also works in the case of pure distillation—i.e., training the velocity field first, fixing its parameters, and then training the direct model. We are currently working on obtaining the results and will share them here as soon as they are available. We will also include them in our revision.

Q5: adding more details addressing the reviewer comments

Thanks for pointing this out. Here we will address all the reviewer's comments our revision.

评论

Here, we include results involving additional training that we were unable to provide during the initial rebuttal period.

Q4: What happens if we first train the velocity field, fix its parameters, then train the residual field? [updated]

In the table below, we compare our proposed Direct Models (with joint training) against its distillation-based version. In the joint training setup, both the velocity model and the direct model are trained simultaneously in parallel processes. To simplify training, we always use the latest EMA (Exponential Moving Average) version of the velocity model to compute the propagation loss (Eq (16)) for training the direct model w(x,t)w(x, t) in every 100 iterations.

MethodSettingFID ↓Training Time ↓
Distillation (train velocity, fix its parameters, then train direct model)2-stage14.32×\approx 2 \times
Joint Training (simultaneous)1-stage17.01×1 \times
Shortcut Models [1]1-stage20.51×\approx 1 \times

As expected, the 2-stage distillation leads to slight better quality images (lower FID), however this is at the need of two stages of training and thus more compute time. Direct Models outperforms the existing training-based methods and already demonstrate a narrow gap in quality compared to distillation.

While we acknowledge that current single-run methods, including Direct Models, do not yet match the FID performance of two-step distillation methods, we believe that advancing the state of the art in 1-stage approaches is a promising and important research direction. In this context, Direct Models represent a meaningful milestone toward achieving that goal.

[1] One Step Diffusion via Shortcut Models. Kevin et al. ICLR 2025.

评论

I thank the authors for engaging with my feedback. I think the additional ablations are a useful and informative addition. I agree with the authors that "advancing the state of the art in 1-stage approaches is a promising and important research direction" and as such I choose to maintain my score and will continue to argue for acceptance.

评论

We thank the reviewer for their constructive feedback and continued endorsement of our work.

评论

Just as a reminder, the author-reviewer discussion is not a period to update/include new results. The question on first training the velocity field then training the residual field was asked by the original review, and the experiment results presented above were not included in the authors' original rebuttal. Hence, in order to preserve fairness of the review process, I ask that the reviewers do not take these new experimental results into account when making their final decisions.

审稿意见
3

This paper introduces Direct Models, a framework for training single-step generative models. Direct Models map the initial noise x0x_0 to any intermediate time tt. The model's evolution is governed by matching the derivative at the endpoint xtx_t. This is achieved by jointly training two networks: a standard conditional flow matching (CFM) model to estimate the velocity field, and the direct model that learns to map noise from t=0 to t=1 in a single inference step. The authors claim their method achieves state-of-the-art sample quality among single-step methods on CelebA-HQ.

优缺点分析

Strengths

  1. This paper is well-motivated in pursuing a high-quality, single-step generative model and addresses a significant bottleneck of inference speed in diffusion and flow-based models.
  2. The experimental results show improved FID scores compared to recent single-step models.
  3. The paper is generally well-written and the proposed method is described with good clarity for implementation.

Weaknesses

  1. The proposed Direct Models largely overlap with BOOT. Both models map noise to intermediate time steps, and the loss function matches the derivative (velocity or score) at the intermediate time steps. The stop-gradient is applied similarly. What is the benefits of Direct Models over BOOT?
  2. The "single run" claim does not seem valid in the present paper. Without a comprehensive comparison of distillation against training two models jointly, the claim that this method is "more practical and efficient" (line 25) is not supported.
  3. The manuscript presents inadequate ablation studies. The authors claim the formulation x+w(x,t)x + w(x,t) leads to "unstable training" but provide no evidence. What if use large δt\delta t? The impact of the stop-gradient in the propagation loss? Does the model fail to train without it?

问题

  1. The Flow Matching model is jointly trained with Direct Models. There is no evidence that this Flow Matching model necessarily needs to be jointly trained with Direct Models. How large is the difference from using a pre-trained teacher model as vθv_\theta?
  2. How does the training loss curve look?

局限性

As discussed in the paper.

最终评判理由

I appreciate the authors’ clarification. The response has partly addressed my concerns.

The additional ablation studies improve the quality of this work. However, according to the reviewing policy communicated by the Area Chair, additional results submitted after the rebuttal deadline should not be taken into account when making final decisions. I must follow the guidance to evaluate the submission. Following the reviewer gVds's comment to cut space, the authors are encouraged to include discussions and experiments using the extra space.

The clarification regarding BOOT is helpful. I recommend including this comparison directly in the paper and marking the two-stage baseline as BOOT. However, I remain unconvinced by the "single-run" claim. The current formulation still appears to resemble a form of distillation. I did not find a convincing theoretical or empirical justification.

Overall, I hold a mixed opinion. The rebuttal improved the clarity and experimental rigor of the paper, but the response still falls short of fully addressing my concerns

格式问题

NA

作者回复

Thank you for your feedback. Please see the detailed response below.

Q1: "Single run claim" and comparison of our method against the "distillation" version our method against training two models jointly

Thanks for pointing this out. In fact, we were probably not clear enough when presenting our algorithm, which may have caused this confusion. The key point is that training v(x,t)v(x,t) is completely independent of training w(x,t)w(x,t), so the two models can be trained fully in parallel. That is, the forward and backward passes of vv can be computed simultaneously with those of ww, without any drop in performance. Therefore, there is no need for alternating training and thus the effective training time is the time of training a single network (w(x,t)w(x,t)) hence our claim of our method being a training-based (single-run) method and not a distillation one. Our method can be seen as performing self-distillation during training time, and thus does not require a separate distillation step and is trained over a single run. We believe the key distinction between a training-based method and a distillation-based method lies in the effective training time, whether is the time of training two sequential networks (distillation or 2-stages) or training a single network (training or 1-stage), and not in the number of networks trained. For instance, Generative Adversarial Networks (GANs) also train two networks but are still considered a training-based approach, not a distillation method. We also would like to emphasize that we do not rely on any tricks, such as extending the training time so much that it effectively becomes equivalent to first training the velocity field and then distilling. For example, our method is trained on ImageNet for 1 million iterations, which is a common number in the literature (same as [1]).

We acknowledge that the current single-run methods are not yet matching the FID of two-step distillation methods. Nonetheless, our single run method shows a good progress towards closing this gap and yields sota among all other single-run methods.

We will also include a comparison with "the distillation version" of Direct Models. We are currently working on obtaining the results and will share them here as soon as they are available.

Q1: Comparison with BOOT and what is the benefit of Direct Models over BOOT

Thanks for pointing this out. BOOT is a distillation method and we will discuss it in the prior work section in our revision.

(a) comparison with BOOT:

We agree that on both Direct Models and BOOT, we aim to model the direct mapping from the initial noise to all the intermediate latents but here are some key differences:

  1. BOOT adopts a diffusion model formulation, whereas our Direct Models follow a flow matching formulation. Flow matching is different from diffusion models and is recognized as representing an independent class of generative models. This fundamental difference leads to distinct derivations and training losses. In this sense, our derivation and that of BOOT should be viewed as fundamentally distinct, just as diffusion models and flow matching are fundamentally different in their underlying principles.
  1. Beyond the differences in the underlying frameworks, derivations, and final training losses, a central distinction is that BOOT is a pure distillation framework (two-stage training) that relies on a pre-trained generative model, whereas Direct Models does not (as discussed in our response to Q1). We would like to emphasize that Direct Models demonstrates that joint training is not only feasible, but also achieves state-of-the-art performance in the single-training setting.

(b) benefit using of Direct Models over BOOT:

We believe Direct Models brings a valuable advantage by addressing a key limitation acknowledged by the BOOT authors themselves (mentioned in their Limitations section) which is the requirement of a pre-trained generative model. Moreover, here we quote from their Future Work section:

“As future research, we aim to investigate the possibility of jointly training the teacher and the student models... This exploration could provide insights into the applicability and benefits of BOOT in scenarios where a pre-trained model is not available.”

Our work directly addresses this direction by showing that joint training is feasible and that we can achieve competitive performance without requiring a pre-trained model.

To summarize, Our method i) proposes a novel derivation based on the flow matching framework, ii), eliminates BOOT’s two-stage training, with only a small performance gap and iii) It achieves state-of-the-art performance among single-stage methods, making it a promising direction for the community.

Q3: Missing ablations

Thank you for pointing this out. We agree that adding these ablations will strengthen our work. We are currently working on obtaining the results and will share them here as soon as they are available. We will also include them in our revision.

Q4: How does the training loss curve look?

The training loss decreases steadily. Unfortunately, we could not include figures in the rebuttal, but we will provide a plot of the training loss curve in the supplementary material.

评论

I appreciate the authors’ clarification. The response has partly addressed my concerns.

The additional ablation studies improve the quality of this work. However, according to the reviewing policy communicated by the Area Chair, additional results submitted after the rebuttal deadline should not be taken into account when making final decisions. I must follow the guidance to evaluate the submission. Following the reviewer gVds's comment to cut space, the authors are encouraged to include discussions and experiments using the extra space.

The clarification regarding BOOT is helpful. I recommend including this comparison directly in the paper and marking the two-stage baseline as BOOT. However, I remain unconvinced by the "single-run" claim. The current formulation still appears to resemble a form of distillation. I did not find a convincing theoretical or empirical justification.

Overall, I hold a mixed opinion. The rebuttal improved the clarity and experimental rigor of the paper, but the response still falls short of fully addressing my concerns.

评论

Here, we include results involving additional training that we were unable to provide during the initial rebuttal period.

Q1: Comparison with "distallation version" of Direct Models [updated]

In the table below, we compare our proposed Direct Models (with joint training) against its distillation-based version. In the joint training setup, both the velocity model and the direct model are trained simultaneously in parallel processes. To simplify training, we always use the latest EMA (Exponential Moving Average) version of the velocity model to compute the propagation loss (Eq (16)) for training the direct model w(x,t)w(x, t) in every 100 iterations.

MethodSettingFID ↓Training Time ↓
Distillation (train velocity, fix its parameters, then train direct model)2-stage14.32×\approx 2 \times
Joint Training (simultaneous)1-stage17.01×1 \times
Shortcut Models [1]1-stage20.51×\approx 1 \times

As expected, the 2-stage distillation leads to slight better quality images (lower FID), however this is at the need of two stages of training and thus more compute time. Direct Models outperforms the existing training-based methods and already demonstrate a narrow gap in quality compared to distillation.

Concerning the "more practical and efficient" claim in line 25, we would like to clarify that this statement is made specifically in the context of comparisons with distillation methods. Below, we provide additional details to support this point:

more practical: the fact that one needs to first train a model and then perform distillation is fairly a handicap, and thus the recent growing trend of work aiming to obtain a single-step model directly, without this sequential process. So from this prospective, we believe that Direct Models (and 1-stage training methods in general) are conceptually more "practical" than distillation methods. That being said, we acknowledge that the term practical is broad and context-dependent. For some, a practical solution may simply be the one that achieves even slightly better FID, regardless of its cost.

more efficient: Since the effective training time is roughly equivalent to training a single network, our method is more efficient in terms of training time than the distillation ones. However, we acknowledge that the term "efficient" can refer to compute time, or other compute resources, and we should have been more specific in our wording.

We will include these clarifications in our revision.

Lastly, we acknowledge that current single-stage methods, including Direct Models, do not yet match the FID performance of two-step distillation methods. However, we believe that advancing the state of the art in 1-stage approaches is a promising and important research direction. In this context, Direct Models represent a meaningful milestone toward achieving that goal.

[1] One Step Diffusion via Shortcut Models. Kevin et al. ICLR 2025.

Q3: Missing ablations [updated]

Here we provide the results of the ablations proposed by the reviewer.

(a) The effect of large δt\delta_t

The table below shows the FID scores obtained using different values of δt\delta_t . As expected, when δt\delta_t is large, the finite difference approximation in Eq (7) becomes less accurate, leading to a drop in performance.

δt\delta_t0.0050.0050.010.010.050.050.10.1
FID16.616.820.525.4

(b) The effect of stop gradient

In the table below, we report the FID scores of generated images when training with and without stopping the gradient. We observe that the training still works and the loss steadily decreases without stop gradient. However, we found that the training leads to (i) lower image quality and (ii) approximately +20% additional computational overhead.

FormulationFID
w/ stop gradient16.8
w/o stop gradient27.4

(c) x+w(x,t)x + w(x,t) vs x+tw(x,t)x + t w(x,t) training formulations.

We note that these two formulations are mathematically equivalent. The key difference lies in the interpretation of w(x,t)w(x, t). In the formulation x+tw(x,t)x + t \cdot w(x,t), we are learning the normalized direction from x0x_0 to xtx_t. In contrast, in x+w(x,t)x + w(x,t), we are learning the unnormalized direction. In the unnormalized case, the function w(x,t)w(x,t) tends to be close to zero when t0t \approx 0, and much more significant when t1t \approx 1.

FormulationFID
x+tw(x,t)x + t w(x,t)16.8
x+w(x,t)x + w(x,t) > 100

In the case of training with the formulation x+w(x,t)x + w(x,t), the loss keeps oscillating, and the generated images are consistently black. We note that a common practice in training diffusion models and flow matching is to learn a normalized score or velocity rather than the absolute residual.

评论

Similar to the other comment I posted, the author-reviewer discussion is not a period to update/include new results. These ablation studies were was asked by the original review, and the experiment results presented above were not included in the authors' original rebuttal. Hence, in order to preserve fairness of the review process, I ask that the reviewers do not take these new experimental results into account when making their final decisions.

评论

Again, we thank the reviewer for their follow-up.

Concerning the additional experimental results, we unfortunately could not finalize them during the rebuttal week due to limited computational resources. Training a single configuration requires about 4 days on our 4 ×\times NVIDIA RTX 3090 GPUs, and we do not have sufficient resources to run all experiments in parallel, which caused this delay.

We commit to including the proposed ablations in our revision. We will also include the comparison with BOOT and adopt it as a baseline among distillation methods.

We are happy to address any further concerns or questions.

评论

Thank you for your follow-up.

Here, we explain why we strongly believe Direct Models are a single-stage (single-run) training method.

Argument (1)

To do so, we believe the key point is to clearly define what constitutes a single-run method and what constitutes a distillation method.

For this, let us assume there exists a unified network that requires a unified training time TT, and that sufficient computational resources are available. We denote by tet_e (which, in this setting, is a multiple of TT) the earliest training time for a method AA to be fully trained, that is, the minimum training time required for AA to operate as intended.

By definition, a distillation method first pre-trains a model and then distills it. This means it depends on a well-trained model to begin the second phase of distillation. Therefore, regardless of available computational resources, it is constrained by this sequential process. Thus, its earliest training time is te=2Tt_e = 2T (the pre-training time must be included since, without it, the method cannot operate). In contrast, a single-stage method is not constrained by such a sequential process, and its earliest training time is te=Tt_e = T.

Now, a central question: if someone trains a model composed of two networks trained in parallel, meaning that te=Tt_e = T, do we call this single-run training or distillation? We strongly believe this is still single-stage training and not distillation. To the best of our knowledge, the notion of distillation is not defined by FLOPs or by the number of networks trained, but rather by the presence of a sequential pre-training and distillation process.

Formally:

  • Distillation method: earliest training time te2Tt_e \geq 2T
  • Single-run method: earliest training time te=Tt_e = T

In our case, Direct Models do not require a pre-trained model and do not depend on a well-trained vv, as we are training from scratch. This removes the need for a sequential process. The training of vv and ww can be done in parallel, especially because training vv does not depend on ww. This means that during the forward/backward pass of ww, we can also perform the forward/backward pass of vv simultaneously. (If the reviewer has concerns about how this can be implemented, we can provide further details.) Therefore, we believe that our earliest training time in this case is te=Tt_e = T, supporting the claim of a "single-run" method (at least in theory so far).

This is in theory. In practice, an important final consideration is how to define this unified training time TT (which translates to a certain number of training iterations). We believe TT should correspond to the training time of a vanilla generative model.

Thus, a key condition that a single-run method should fulfill is that its number of training iterations matches, or is very comparable to, that of a common generative model (for example the teacher or the student in the case of distillation).

In our case, we believe this condition is met, and we are not using any trick in this regard. On ImageNet, our method is trained for 1 million iterations, the same as [1] (a single-run method), and this is a common number of iterations for training a flow matching model on ImageNet in general when using a batch size of 128.

Argument (2)

Moreover, we believe that Direct Models are similar to Shortcut Models [1] in this regard, which are recognized as a single-run method.

The loss used in [1], as shown in Eq (5) of their paper, is:

L(θ)=Ex,t,d[vθ(xt,t,0)(x1x0)2+λvθ(xt,t,d)starget2]L(\theta) = E_{x, t, d} \left[ \left\| v_\theta(x_t, t, 0) - (x_1 - x_0)\right\|^2 + \lambda \left\| v_\theta(x_t, t, d) - s_{target} \right\|^2\right]

Here, vθ(xt,t,0)v_\theta(x_t, t, 0) represents the vanilla velocity model (trained with the standard flow matching loss), and vθ(xt,t,d)v_\theta(x_t, t, d) is the shortcut model parameterized by 0<d10 < d \leq 1, enabling single-step generation. In [1], both components are trained from scratch. Training vθ(xt,t,0)v_\theta(x_t, t, 0) is independent from training vθ(xt,t,d)v_\theta(x_t, t, d).

Intuitively, if vθ(xt,t,0)v_\theta(x_t, t, 0) were replaced with a pre-trained velocity model and only vθ(xt,t,d)v_\theta(x_t, t, d) were trained, the method in [1] would become a distillation approach.

In this aspect, the difference between [1] and our approach is that they use a single network with two conditionings (00 and dd). Because of this, they perform the forward pass for vθ(xt,t,0)v_\theta(x_t, t, 0) and vθ(xt,t,d)v_\theta(x_t, t, d) sequentially. In contrast, we use two separate networks, which enables easier parallel training.

We believe that both Direct Models and [1], and here we quote from [1]'s introduction, “can be seen as performing self-distillation during training time, and thus do not require a separate distillation step and are trained over a single run.”

[1] One Step Diffusion via Shortcut Models. Kevin et al. ICLR 2025.


We hope this addresses the reviewer’s concern and would be happy to provide further details if needed.

最终决定

This paper presents a new flow-matching training objective to learn a direct, single-step model. Unlike previous approaches, the presented approach is not a distillation method, but the single-step model is jointly learned with the velocity model.

While the simplicity and the promise of such an approach is appreciated, the current form of the paper is insufficient for publication, as pointed out by several reviewers.

The main issue is insufficient experimental evaluation and comparison with other baselines. The paper itself only evaluates on two datasets (CelebA-HQ and ImageNet-256), does not contain adequate ablation studies (such as the impact of δt\delta t and the stop-gradient in the loss, as noted by Reviewer twLP), and also misses an important baseline BOOT (again as noted by Reviewer twLP). While some of the ablation studies were provided during the discussion period, since they were not provided in time for the initial rebuttal, as per the NeurIPS reviewing policy we must not factor this data into account for fairness reasons. Furthermore, while the authors respond in their rebuttal that the shortcut model paper [3] also evaluates only on these two datasets: (a) [3] includes substantially more thorough ablation studies (including e.g., how performance is affected by model sizes), and (b) [3] also includes some diffusion policy experiments in non-image, robotics domains. I believe this paper would be substantially strengthened if the authors had more time to include more evaluations.

Some other notes:

  • Please do change the name of the paper to not include “diffusion”-- as pointed out by Reviewer Y4t7, this is quite confusing as this paper is about single-step flow models.
  • While I understand single-step and distillation methods are different implementation techniques, I found the authors’ response to Reviewer twLP’s question on it to be quite unconvincing. For example, suppose there was a distillation method that took 2x as long to train compared to a single-step direct method, but in inference time had orders of magnitude better performance with single-step generation versus a single-step direct method. In this case, for practitioners it might be hard to argue against the utility of such high-quality and fast inference just because the training took 2x as long. Thus, I don’t believe the evaluation of the two approaches should be considered separately, as they have the same underlying objectives (reducing inference time while maintaining sample quality).